Вивчайте Data Cleaning in Sports Analytics | Introduction to Sports Analytics

When working with sports data, you will often face issues that can affect the quality and reliability of your analysis. One of the most common problems is missing data—for example, a player's height or weight might not be recorded for every game, or match statistics could be incomplete due to data entry errors. If you ignore missing values, your calculations may be inaccurate or misleading.

Inconsistent formats are another frequent challenge. Player names might appear in different formats across datasets, such as LeBron James in one file and James, LeBron in another. Dates may be recorded in different styles, such as 2024-06-01 versus 06/01/2024, making it hard to merge or compare data.

Finally, outliers—values that are unusually high or low compared to the rest of the data—can distort analysis. For instance, a typo could record a basketball player's points as 500 instead of 50, or a sensor error might log an impossible sprint time. Identifying and handling these outliers is essential to ensure your results reflect real-world performance.


              1234567891011121314151617181920
            
import pandas as pd

# Hardcoded sports data with missing values
data = {
    "Player": ["Alex Smith", "Jordan Lee", "Chris Ray", "Sam Green", "Alex Smith"],
    "Team": ["Eagles", "Falcons", "Eagles", "Falcons", "Eagles"],
    "Points": [15, None, 22, 18, 15],
    "Assists": [5, 7, None, 6, 5]
}

df = pd.DataFrame(data)

# Identify missing values
missing_values = df.isnull().sum()

# Fill missing values: replace None with 0
df_filled = df.fillna(0)

print("Missing values per column:\n", missing_values)
print("\nDataFrame after filling missing values:\n", df_filled)

In this code sample, you first create a pandas DataFrame from hardcoded sports data that includes missing values. The isnull() method checks each cell in the DataFrame for missing values, returning a boolean value. By chaining .sum(), you count the number of missing values in each column, which helps you quickly identify where data is incomplete.

To handle the missing values, the fillna(0) method replaces all None or NaN values with zero. This is a common approach when missing statistics should be treated as zero, such as a player not recording any assists in a game. After filling the missing values, you print the updated DataFrame to confirm that there are no longer any empty cells. These pandas methods—isnull(), sum(), and fillna()—are essential tools for cleaning sports datasets and ensuring your analysis is based on complete data.


              12345678910111213141516
            
import pandas as pd

# Hardcoded sports data with duplicate records
data = {
    "Player": ["Alex Smith", "Jordan Lee", "Chris Ray", "Sam Green", "Alex Smith"],
    "Team": ["Eagles", "Falcons", "Eagles", "Falcons", "Eagles"],
    "Points": [15, 20, 22, 18, 15],
    "Assists": [5, 7, 8, 6, 5]
}

df = pd.DataFrame(data)

# Remove duplicate records
df_no_duplicates = df.drop_duplicates()

print("DataFrame after removing duplicates:\n", df_no_duplicates)

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 1. Розділ 4

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Свайпніть щоб показати меню


              1234567891011121314151617181920
            
import pandas as pd

# Hardcoded sports data with missing values
data = {
    "Player": ["Alex Smith", "Jordan Lee", "Chris Ray", "Sam Green", "Alex Smith"],
    "Team": ["Eagles", "Falcons", "Eagles", "Falcons", "Eagles"],
    "Points": [15, None, 22, 18, 15],
    "Assists": [5, 7, None, 6, 5]
}

df = pd.DataFrame(data)

# Identify missing values
missing_values = df.isnull().sum()

# Fill missing values: replace None with 0
df_filled = df.fillna(0)

print("Missing values per column:\n", missing_values)
print("\nDataFrame after filling missing values:\n", df_filled)


              12345678910111213141516
            
import pandas as pd

# Hardcoded sports data with duplicate records
data = {
    "Player": ["Alex Smith", "Jordan Lee", "Chris Ray", "Sam Green", "Alex Smith"],
    "Team": ["Eagles", "Falcons", "Eagles", "Falcons", "Eagles"],
    "Points": [15, 20, 22, 18, 15],
    "Assists": [5, 7, 8, 6, 5]
}

df = pd.DataFrame(data)

# Remove duplicate records
df_no_duplicates = df.drop_duplicates()

print("DataFrame after removing duplicates:\n", df_no_duplicates)

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 1. Розділ 4