Kurssisisältö
Mastering Big Data with PySpark
Mastering Big Data with PySpark
Types of Data
Before you store, process, or analyze big data, you need to understand which kinds of data you are dealing with — and how they affect the choice of tools and techniques.
In practice, data is usually grouped into three broad categories according to how well it is organized:
- Structured data: rigid, table-like, schema-based;
- Semi-structured data: loosely organized, utilizes tags or markers;
- Unstructured data: free-form, no fixed schema.
As a rule of thumb: the more structured the data, the easier it is to query and analyze; the less structured it is, the more flexibility you gain at the cost of additional processing work.
Structured Data
Structured data is information that conforms to a predefined schema (tables, columns, data types, constraints).
Because every record follows the same pattern, structured data is highly searchable, sortable, and joinable. Most traditional information systems — from banking ledgers to e-commerce catalogs — rely on structured storage in relational databases.
Typical sources:
- Customer profiles in a CRM;
- Point-of-sale transaction logs;
- Inventory spreadsheets.
Semi-Structured Data
Semi-structured data does not fit neatly into tables, yet it still carries self-describing tags or key–value pairs that reveal an internal hierarchy.
Because the schema is implicit rather than enforced, each record can evolve independently. That flexibility speeds up feature delivery (add a field, deploy today) but makes downstream validation harder.
Typical sources:
- Web API JSON or XML responses;
- IoT sensor streams with varying payloads;
- Application and server log files.
Unstructured Data
Unstructured data lacks any predefined model. The meaning is embedded in the content itself rather than in an external schema.
Working with unstructured data usually involves computer vision, audio processing, or NLP techniques in order to extract features.
Typical sources:
- Images and video archives;
- Free-form customer support emails;
- Social-media posts and comments;
- Call-center voice recordings.
Kiitos palautteestasi!