What is big data?
Before defining big data, it’s essential to understand what data is. This term includes the characters, quantities, or symbols on which operations are performed by a computer. Data can be transmitted in electrical signals and stored on mechanical, magnetic, or optical media. The term Big Data refers to large sets of data collected by companies (emails, documents, databases, business process histories, etc.), as well as data from sensors, smartphones, content published on the web (images, videos, texts, etc.), e-commerce transactions, messages on social networks, geolocated data, etc. This data can be explored and analyzed to extract valuable information or used for data science projects.
5 V of big data:
The Association for Computing Machinery used first the term « Big Data » in 1997. In 2001, Doug Laney described big data according to the « three V’s » principles:
- The increasingly massive Volume of data that traditional databases cannot store;
- The Variety of the type of this data, in fact, can be, for example, an email, a text file, or an image;
- Velocity, which means that this data is produced, collected and analyzed in real-time.
Now we use 5 V to define big data by adding:
- Veracity: It designates the reliability of the data, which is essential to benefit from and transform into valuable information for the company.
- Value: Investing in a big data project is expensive. So, it must have a real added value for the company.
Big data application areas
The use of big data has opened up new perspectives in many fields: scientific research, politics, medicine, ecology, finance, commerce, communication, meteorology, etc. Thanks to data science techniques, researchers, companies, and administrations can perform trend or predictive analysis, draw up profiles, anticipate risks and monitor phenomena in real-time.
The different types of Big Data
Big Data data comes from a variety of sources and can therefore take many forms. There are three main categories:
- When data can be stored and processed in a fixed and well-defined format, this is referred to as « structured » data. Thanks to the many advances in computing, techniques available today make it possible to work effectively with this data and extract an added value.
- Data with an unknown format or structure is considered « unstructured » data. This type of data presents many challenges in terms of processing and exploitation beyond their massive volume. A typical example is a dataset containing a mix of images, video and text files. Nowadays, this type of data is increasingly common. Companies have vast amounts of this type of data. Still, they struggle to take advantage of it because of its difficulty in processing.
- Finally, « semi-structured » data is halfway between these two categories. For example, this could be structured in format but is not clearly defined within a database.