Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Types of Data | Big Data Fundamentals
Mastering Big Data with PySpark
course content

Kurssisisältö

Mastering Big Data with PySpark

Mastering Big Data with PySpark

1. Big Data Fundamentals
2. Distributed Systems
3. Spark Core
4. Spark SQL
5. Structured Streaming
6. MLlib

book
Types of Data

Before you store, process, or analyze big data, you need to understand which kinds of data you are dealing with — and how they affect the choice of tools and techniques.

In practice, data is usually grouped into three broad categories according to how well it is organized:

  • Structured data: rigid, table-like, schema-based;
  • Semi-structured data: loosely organized, utilizes tags or markers;
  • Unstructured data: free-form, no fixed schema.

As a rule of thumb: the more structured the data, the easier it is to query and analyze; the less structured it is, the more flexibility you gain at the cost of additional processing work.

Structured Data

Note
Definition

Structured data is information that conforms to a predefined schema (tables, columns, data types, constraints).

Because every record follows the same pattern, structured data is highly searchable, sortable, and joinable. Most traditional information systems — from banking ledgers to e-commerce catalogs — rely on structured storage in relational databases.

Typical sources:

  • Customer profiles in a CRM;
  • Point-of-sale transaction logs;
  • Inventory spreadsheets.

Semi-Structured Data

Note
Definition

Semi-structured data does not fit neatly into tables, yet it still carries self-describing tags or key–value pairs that reveal an internal hierarchy.

Because the schema is implicit rather than enforced, each record can evolve independently. That flexibility speeds up feature delivery (add a field, deploy today) but makes downstream validation harder.

Typical sources:

  • Web API JSON or XML responses;
  • IoT sensor streams with varying payloads;
  • Application and server log files.

Unstructured Data

Note
Definition

Unstructured data lacks any predefined model. The meaning is embedded in the content itself rather than in an external schema.

Working with unstructured data usually involves computer vision, audio processing, or NLP techniques in order to extract features.

Typical sources:

  • Images and video archives;
  • Free-form customer support emails;
  • Social-media posts and comments;
  • Call-center voice recordings.
question-icon

Fill in the blanks

The type of data that uses tags or markers to structure information but does not enforce a table layout is called .
Voice recordings from a call center are examples of
data.
A banking ledger stored in a relational database represents
data.

Click or drag`n`drop items and fill in the blanks

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 2

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

course content

Kurssisisältö

Mastering Big Data with PySpark

Mastering Big Data with PySpark

1. Big Data Fundamentals
2. Distributed Systems
3. Spark Core
4. Spark SQL
5. Structured Streaming
6. MLlib

book
Types of Data

Before you store, process, or analyze big data, you need to understand which kinds of data you are dealing with — and how they affect the choice of tools and techniques.

In practice, data is usually grouped into three broad categories according to how well it is organized:

  • Structured data: rigid, table-like, schema-based;
  • Semi-structured data: loosely organized, utilizes tags or markers;
  • Unstructured data: free-form, no fixed schema.

As a rule of thumb: the more structured the data, the easier it is to query and analyze; the less structured it is, the more flexibility you gain at the cost of additional processing work.

Structured Data

Note
Definition

Structured data is information that conforms to a predefined schema (tables, columns, data types, constraints).

Because every record follows the same pattern, structured data is highly searchable, sortable, and joinable. Most traditional information systems — from banking ledgers to e-commerce catalogs — rely on structured storage in relational databases.

Typical sources:

  • Customer profiles in a CRM;
  • Point-of-sale transaction logs;
  • Inventory spreadsheets.

Semi-Structured Data

Note
Definition

Semi-structured data does not fit neatly into tables, yet it still carries self-describing tags or key–value pairs that reveal an internal hierarchy.

Because the schema is implicit rather than enforced, each record can evolve independently. That flexibility speeds up feature delivery (add a field, deploy today) but makes downstream validation harder.

Typical sources:

  • Web API JSON or XML responses;
  • IoT sensor streams with varying payloads;
  • Application and server log files.

Unstructured Data

Note
Definition

Unstructured data lacks any predefined model. The meaning is embedded in the content itself rather than in an external schema.

Working with unstructured data usually involves computer vision, audio processing, or NLP techniques in order to extract features.

Typical sources:

  • Images and video archives;
  • Free-form customer support emails;
  • Social-media posts and comments;
  • Call-center voice recordings.
question-icon

Fill in the blanks

The type of data that uses tags or markers to structure information but does not enforce a table layout is called .
Voice recordings from a call center are examples of
data.
A banking ledger stored in a relational database represents
data.

Click or drag`n`drop items and fill in the blanks

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 2
some-alt