Course Content
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
Why Big Data?
To begin with, we'll focus on the most important aspects of Big Data.
What is Big Data?
First of all, let's clarify what we're working with.
5 Vs
When discussing the properties of Big Data, we should mention the concept of the "5 Vs", which highlights its key characteristics:
- Volume;
- Variety;
- Velocity;
- Veracity;
- Value.
Now, let's explore each of them in detail.
Volume
It measures the size of a dataset, which can range from terabytes to petabytes and beyond.
The volume of data generated is influenced by several factors, including the proliferation of digital technologies, the increasing number of data-generating devices, and large-scale transactions.
In practical terms, volume is a fundamental aspect of big data; if the volume of data is large enough, it qualifies as big data.
Variety
It refers to the different types and formats of data.
In practice, data can be structured (e.g., tabular data), semi-structured (e.g., JSON, XML), or unstructured (e.g., text, images, videos, audio, etc.).
Higher variety of data leads to higher complexity, as it requires managing and storing multiple data types.
Velocity
In the context of Big Data, this encompasses real-time or near-real-time data streams, such as data from social media platforms, financial transactions, and sensors.
Veracity
It involves addressing uncertainties and inconsistencies in data, such as missing values, errors, or biases.
Value
It's not only about having large volumes of data; it's about extracting meaningful information that can inform business decisions and drive innovations.
The objective is to transform raw data into valuable insights that can influence strategy and operations.
Thanks for your feedback!