Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Challenge: Link Employee Records | Record Linkage Techniques
Data Cleaning Techniques in Python

bookChallenge: Link Employee Records

Tâche

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" — index of the record in the first dataset;
    • "index_df2" — index of the record in the second dataset;
    • "similarity" — computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Solution

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 3. Chapitre 3
single

single

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Suggested prompts:

Can you explain that in simpler terms?

What are the main benefits of this approach?

Are there any common mistakes to avoid with this?

close

bookChallenge: Link Employee Records

Glissez pour afficher le menu

Tâche

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" — index of the record in the first dataset;
    • "index_df2" — index of the record in the second dataset;
    • "similarity" — computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Solution

Switch to desktopPassez à un bureau pour une pratique réelleContinuez d'où vous êtes en utilisant l'une des options ci-dessous
Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 3. Chapitre 3
single

single

some-alt