Apprendre Detecting and Removing Duplicates | Data Cleaning and Wrangling Essentials

Data Cleaning and Wrangling in R

Glissez pour afficher le menu

Definition

Duplicate data refers to records in your dataset that are exact copies of other records, either entirely or based on certain key columns. Duplicates can arise from data entry errors, system glitches, or merging datasets. They can be problematic because they can skew analyses, inflate counts, and lead to misleading conclusions.

When working with real-world datasets, you will often encounter duplicate entries. Detecting these duplicates is an essential step in data cleaning, as failing to address them can compromise the quality and reliability of your results. In R, you can use the duplicated() function to flag repeated rows, and the distinct() function from the dplyr package to extract only unique records. Both functions are useful when working with simulated or real datasets.

To see how this works, consider a simulated dataset that might contain duplicate rows. You can create a simple data frame and use R functions to find duplicates:


              123456789
            
# Simulate a dataset with duplicate rows
df <- data.frame(
  id = c(1, 2, 2, 3, 4, 4, 4),
  name = c("Alice", "Bob", "Bob", "Carol", "Dave", "Dave", "Dave")
)

# Find duplicate rows using duplicated()
duplicated_rows <- df[duplicated(df), ]
print(duplicated_rows)

To demonstrate removing duplicates, suppose you want to keep only one row for each unique combination of values in your simulated dataset. You can use distinct() to achieve this, and you can also specify columns if you want to define duplicates more narrowly. For example, you might want to remove duplicates based only on the id column, ignoring the name.


              123456789
            
library(dplyr)

# Remove duplicate rows, keeping only the first occurrence
df_unique <- distinct(df)
print(df_unique)

# Remove duplicates based on the 'id' column only
df_unique_id <- distinct(df, id, .keep_all = TRUE)
print(df_unique_id)

When handling duplicates, it is important to consider your analysis goals. Sometimes, keeping the first occurrence of a duplicate is appropriate, especially if the records are identical or you want to preserve the earliest entry. In other cases, you may want to keep the last occurrence or use another method to decide which record to retain. Always document your approach and make sure it aligns with your data's context and the questions you are trying to answer.

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 15

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 15

Detecting and Removing Duplicates

1. What function can you use to detect duplicate rows in R?

2. How does distinct() differ from duplicated()?

3. Why might you want to keep the first occurrence of a duplicate?