Зміст курсу
Mastering Big Data with PySpark
Mastering Big Data with PySpark
Types of Data
Before you begin to process and analyze big data, you need to understand the types of data you might encounter — and the formats in which that data is stored. This is necessary, as different tools, storage systems, and processing techniques work with different types and formats of data.
First, let's talk about the types of data. Data is often categorized into three types based on how well it is organized: structured, semi-structured, and unstructured. Each comes with its own benefits and challenges, but as a general rule — the more structured the data, the easier it is to store, query, and analyze.
Structured Data
Structured data is a data that fits into a predefined schema or model.
Structured data is highly organized and easy to search. It typically resides in relational databases and is formatted into tables with rows and columns. Each column has a specific data type and constraints, making the data predictable and easy to analyze.
Examples:
Customer records in a database;
Financial transactions;
Inventory spreadsheets.
Semi-Structured Data
Semi-structured data is a data that doesn't fit neatly into tables, but it still has an internal structure and uses tags or markers to separate elements.
Semi-structured data offers more flexibility than structured data, as the strict schema is not enforced. Even entities belonging to the same class can have variations in the fields they contain, their order, or the data types used — making it easier to evolve data structures over time but harder to enforce consistency.
Examples:
Web API responses;
Sensor outputs;
Log files.
Unstructured Data
Unstructured data is a data that lacks predefined schema or model.
Дякуємо за ваш відгук!