Contenido del Curso
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
Why Big Data?
What is Big Data?
First of all, let’s define, what we are working with.
5 Vs
Speaking about properties of Big Data, we should mention concept called “5 Vs”, which refers to important properties of Big Data:
- Volume;
- Variety;
- Velocity;
- Veracity;
- Valence.
Now, let’s discuss each of them in a detailed way.
Volume
It measures the size of dataset, which can range from terabytes to petabytes and beyond.
The volume of data being produced is driven by various factors, including spread of digital technologies, more data-generating devices and large-scale transactions.
Practically, volume is a base of big data, since if volume of data is large enough – it could be considered a big data.
Variety
It refers to different types and formats of data.
Practically, data could be structured(e.g. tables), semi-structured(e.g. JSONs, XMLs) and unstructured (e.g. texts, images, videos, audios etc).
Higher variety implies higher complexity, as long as there are more types of data we need to store and manipulate with
Variety could be divided into:
- Structural variety – corresponds to the different ways in which data is organized and formatted.
- Media variety – refers to the different types of media formats and channels through which data is generated and consumed.
- Semantic variety – pertains to the differences in meaning and interpretation of data across different contexts or domains.
- Availability variety – refers to ow and when data can be accessed, as well as the reliability and timeliness of data sources.
Velocity
In terms of Big Data, this includes real-time or near-real-time data streams, like data from social media platforms, financial transactions, or sensors.
Veracity
It involves dealing with uncertainties and inconsistencies in data quality, such as missing values, errors or biases.
Value
It's not just about having large amounts of data but about deriving meaningful information that can drive business decisions and innovations.
The goal is to convert raw data into valuable insights that can impact strategy and operations.
¿Todo estuvo claro?