Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Challenge: Link Employee Records | Record Linkage Techniques
Data Cleaning Techniques in Python

bookChallenge: Link Employee Records

Task

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" β€” index of the record in the first dataset;
    • "index_df2" β€” index of the record in the second dataset;
    • "similarity" β€” computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Solution

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 3
single

single

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain that in simpler terms?

What are the main benefits of this approach?

Are there any common mistakes to avoid with this?

close

bookChallenge: Link Employee Records

Swipe to show menu

Task

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" β€” index of the record in the first dataset;
    • "index_df2" β€” index of the record in the second dataset;
    • "similarity" β€” computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Solution

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 3
single

single

some-alt