Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Working with Duplicates | Data Cleaning
Introduction to Pandas with AI

bookWorking with Duplicates

AI in Action

import pandas as pd

df = pd.read_csv("passengers.csv")

print(df.duplicated().sum())
df = df.drop_duplicates()

Detecting Duplicates

You can check for duplicates in a DataFrame using the .duplicated() method. You can also use it to count how many rows are duplicates.

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Check which rows are duplicates print(df.duplicated()) # Count duplicate rows print(df.duplicated().sum())
copy

By default, pandas checks all columns when identifying duplicates. You can also check duplicates within a specific subset of columns:

12345
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") print(df.duplicated(subset=["Ticket"]).sum())
copy

Removing Duplicates

After you confirm that the duplicate rows shouldn't remain, remove them using the .drop_duplicates() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Remove duplicate rows print(df.drop_duplicates()) # Remove duplicates based only on values in a subset print(df.drop_duplicates(subset=["Ticket"]))
copy

Counting Unique Values

To check how many distinct values each column has, use the .nunique() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Count unique values for each column print(df.nunique()) # Count unique values for a single column print(df["Embarked"].nunique())
copy

This helps you identify columns with limited categories or verify whether an ID column is truly unique.

1. What does df.duplicated() return?

2. How can you remove all duplicate rows from a DataFrame?

3. How would you count a number of unique elements in the "Class" column?

question mark

What does df.duplicated() return?

Select the correct answer

question mark

How can you remove all duplicate rows from a DataFrame?

Select the correct answer

question mark

How would you count a number of unique elements in the "Class" column?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 10

bookWorking with Duplicates

Swipe to show menu

AI in Action

import pandas as pd

df = pd.read_csv("passengers.csv")

print(df.duplicated().sum())
df = df.drop_duplicates()

Detecting Duplicates

You can check for duplicates in a DataFrame using the .duplicated() method. You can also use it to count how many rows are duplicates.

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Check which rows are duplicates print(df.duplicated()) # Count duplicate rows print(df.duplicated().sum())
copy

By default, pandas checks all columns when identifying duplicates. You can also check duplicates within a specific subset of columns:

12345
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") print(df.duplicated(subset=["Ticket"]).sum())
copy

Removing Duplicates

After you confirm that the duplicate rows shouldn't remain, remove them using the .drop_duplicates() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Remove duplicate rows print(df.drop_duplicates()) # Remove duplicates based only on values in a subset print(df.drop_duplicates(subset=["Ticket"]))
copy

Counting Unique Values

To check how many distinct values each column has, use the .nunique() method:

12345678
import pandas as pd df = pd.read_csv("https://staging-content-media-cdn.codefinity.com/courses/64641555-cae4-4cd0-8d29-807aeb6bc0c4/datasets/passengers.csv") # Count unique values for each column print(df.nunique()) # Count unique values for a single column print(df["Embarked"].nunique())
copy

This helps you identify columns with limited categories or verify whether an ID column is truly unique.

1. What does df.duplicated() return?

2. How can you remove all duplicate rows from a DataFrame?

3. How would you count a number of unique elements in the "Class" column?

question mark

What does df.duplicated() return?

Select the correct answer

question mark

How can you remove all duplicate rows from a DataFrame?

Select the correct answer

question mark

How would you count a number of unique elements in the "Class" column?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2
some-alt