Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Working with Duplicates | Data Cleaning
Introduction to Pandas with AI

bookWorking with Duplicates

AI in Action

import pandas as pd

df = pd.read_csv("passengers.csv")

print(df.duplicated().sum())
df = df.drop_duplicates()

Detecting Duplicates

You can check for duplicates in a DataFrame using the .duplicated() method. You can also use it to count how many rows are duplicates.

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Check which rows are duplicates print(df.duplicated()) # Count duplicate rows print(df.duplicated().sum())
copy

By default, pandas checks all columns when identifying duplicates. You can also check duplicates within a specific subset of columns:

12345
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") print(df.duplicated(subset=["Ticket"]).sum())
copy

Removing Duplicates

After you confirm that the duplicate rows shouldn't remain, remove them using the .drop_duplicates() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Remove duplicate rows print(df.drop_duplicates()) # Remove duplicates based only on values in a subset print(df.drop_duplicates(subset=["Ticket"]))
copy

Counting Unique Values

To check how many distinct values each column has, use the .nunique() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Count unique values for each column print(df.nunique()) # Count unique values for a single column print(df["Embarked"].nunique())
copy

This helps you identify columns with limited categories or verify whether an ID column is truly unique.

1. What does df.duplicated() return?

2. How can you remove all duplicate rows from a DataFrame?

3. How would you count a number of unique elements in the "Class" column?

question mark

What does df.duplicated() return?

Select the correct answer

question mark

How can you remove all duplicate rows from a DataFrame?

Select the correct answer

question mark

How would you count a number of unique elements in the "Class" column?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 2. Hoofdstuk 2

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

bookWorking with Duplicates

Veeg om het menu te tonen

AI in Action

import pandas as pd

df = pd.read_csv("passengers.csv")

print(df.duplicated().sum())
df = df.drop_duplicates()

Detecting Duplicates

You can check for duplicates in a DataFrame using the .duplicated() method. You can also use it to count how many rows are duplicates.

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Check which rows are duplicates print(df.duplicated()) # Count duplicate rows print(df.duplicated().sum())
copy

By default, pandas checks all columns when identifying duplicates. You can also check duplicates within a specific subset of columns:

12345
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") print(df.duplicated(subset=["Ticket"]).sum())
copy

Removing Duplicates

After you confirm that the duplicate rows shouldn't remain, remove them using the .drop_duplicates() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Remove duplicate rows print(df.drop_duplicates()) # Remove duplicates based only on values in a subset print(df.drop_duplicates(subset=["Ticket"]))
copy

Counting Unique Values

To check how many distinct values each column has, use the .nunique() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Count unique values for each column print(df.nunique()) # Count unique values for a single column print(df["Embarked"].nunique())
copy

This helps you identify columns with limited categories or verify whether an ID column is truly unique.

1. What does df.duplicated() return?

2. How can you remove all duplicate rows from a DataFrame?

3. How would you count a number of unique elements in the "Class" column?

question mark

What does df.duplicated() return?

Select the correct answer

question mark

How can you remove all duplicate rows from a DataFrame?

Select the correct answer

question mark

How would you count a number of unique elements in the "Class" column?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 2. Hoofdstuk 2
some-alt