Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Detecting and Removing Duplicates | Data Cleaning and Wrangling Essentials
Data Cleaning and Wrangling in R

bookDetecting and Removing Duplicates

Glissez pour afficher le menu

Note
Definition

Duplicate data refers to records in your dataset that are exact copies of other records, either entirely or based on certain key columns. Duplicates can arise from data entry errors, system glitches, or merging datasets. They can be problematic because they can skew analyses, inflate counts, and lead to misleading conclusions.

When working with real-world datasets, you will often encounter duplicate entries. Detecting these duplicates is an essential step in data cleaning, as failing to address them can compromise the quality and reliability of your results. In R, you can use the duplicated() function to flag repeated rows, and the distinct() function from the dplyr package to extract only unique records. Both functions are useful when working with simulated or real datasets.

To see how this works, consider a simulated dataset that might contain duplicate rows. You can create a simple data frame and use R functions to find duplicates:

123456789
# Simulate a dataset with duplicate rows df <- data.frame( id = c(1, 2, 2, 3, 4, 4, 4), name = c("Alice", "Bob", "Bob", "Carol", "Dave", "Dave", "Dave") ) # Find duplicate rows using duplicated() duplicated_rows <- df[duplicated(df), ] print(duplicated_rows)
copy

To demonstrate removing duplicates, suppose you want to keep only one row for each unique combination of values in your simulated dataset. You can use distinct() to achieve this, and you can also specify columns if you want to define duplicates more narrowly. For example, you might want to remove duplicates based only on the id column, ignoring the name.

123456789
library(dplyr) # Remove duplicate rows, keeping only the first occurrence df_unique <- distinct(df) print(df_unique) # Remove duplicates based on the 'id' column only df_unique_id <- distinct(df, id, .keep_all = TRUE) print(df_unique_id)
copy

When handling duplicates, it is important to consider your analysis goals. Sometimes, keeping the first occurrence of a duplicate is appropriate, especially if the records are identical or you want to preserve the earliest entry. In other cases, you may want to keep the last occurrence or use another method to decide which record to retain. Always document your approach and make sure it aligns with your data's context and the questions you are trying to answer.

1. What function can you use to detect duplicate rows in R?

2. How does distinct() differ from duplicated()?

3. Why might you want to keep the first occurrence of a duplicate?

question mark

What function can you use to detect duplicate rows in R?

Sélectionnez la réponse correcte

question mark

How does distinct() differ from duplicated()?

Sélectionnez la réponse correcte

question mark

Why might you want to keep the first occurrence of a duplicate?

Sélectionnez la réponse correcte

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 1. Chapitre 15

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 15
some-alt