Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Challenge: Link Employee Records | Record Linkage Techniques
Data Cleaning Techniques in Python

bookChallenge: Link Employee Records

Taak

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" — index of the record in the first dataset;
    • "index_df2" — index of the record in the second dataset;
    • "similarity" — computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Oplossing

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 3
single

single

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Suggested prompts:

Can you explain that in simpler terms?

What are the main benefits of this approach?

Are there any common mistakes to avoid with this?

close

bookChallenge: Link Employee Records

Veeg om het menu te tonen

Taak

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" — index of the record in the first dataset;
    • "index_df2" — index of the record in the second dataset;
    • "similarity" — computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Oplossing

Switch to desktopSchakel over naar desktop voor praktijkervaringGa verder vanaf waar je bent met een van de onderstaande opties
Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 3
single

single

some-alt