Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Why Big Data? | Big Data Basics
Introduction to Big Data with Apache Spark in Python
course content

Conteúdo do Curso

Introduction to Big Data with Apache Spark in Python

Introduction to Big Data with Apache Spark in Python

1. Big Data Basics
2. Spark Basics
3. Spark SQL

Why Big Data?

What is Big Data?

First of all, let’s define, what we are working with.

5 Vs

Speaking about properties of Big Data, we should mention concept called “5 Vs”, which refers to important properties of Big Data:

  • Volume;
  • Variety;
  • Velocity;
  • Veracity;
  • Valence.

Now, let’s discuss each of them in a detailed way.

Volume

It measures the size of dataset, which can range from terabytes to petabytes and beyond.

The volume of data being produced is driven by various factors, including spread of digital technologies, more data-generating devices and large-scale transactions.

Practically, volume is a base of big data, since if volume of data is large enough – it could be considered a big data.

Variety

It refers to different types and formats of data.

Practically, data could be structured(e.g. tables), semi-structured(e.g. JSONs, XMLs) and unstructured (e.g. texts, images, videos, audios etc).

Higher variety implies higher complexity, as long as there are more types of data we need to store and manipulate with

Variety could be divided into:

  • Structural variety – corresponds to the different ways in which data is organized and formatted.
  • Media variety – refers to the different types of media formats and channels through which data is generated and consumed.
  • Semantic variety – pertains to the differences in meaning and interpretation of data across different contexts or domains.
  • Availability variety – refers to ow and when data can be accessed, as well as the reliability and timeliness of data sources.

Velocity

In terms of Big Data, this includes real-time or near-real-time data streams, like data from social media platforms, financial transactions, or sensors.

Veracity

It involves dealing with uncertainties and inconsistencies in data quality, such as missing values, errors or biases.

Value

It's not just about having large amounts of data but about deriving meaningful information that can drive business decisions and innovations.

The goal is to convert raw data into valuable insights that can impact strategy and operations.

Tudo estava claro?

Seção 1. Capítulo 1
We're sorry to hear that something went wrong. What happened?
some-alt