Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Challenge: Link Employee Records | Record Linkage Techniques
Data Cleaning Techniques in Python

bookChallenge: Link Employee Records

Завдання

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" — index of the record in the first dataset;
    • "index_df2" — index of the record in the second dataset;
    • "similarity" — computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Рішення

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 3. Розділ 3
single

single

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain that in simpler terms?

What are the main benefits of this approach?

Are there any common mistakes to avoid with this?

close

bookChallenge: Link Employee Records

Свайпніть щоб показати меню

Завдання

Swipe to start coding

You are given two employee datasets originating from different internal systems. Each dataset contains employee attributes, but names and cities may differ slightly in formatting or spelling.

Your goal is to link matching employees across both datasets using fuzzy similarity.

Follow these steps:

  1. Create a composite matching key for each record using the fields first_name, last_name, and city.
  2. Convert all composite keys to lowercase strings with spaces separating the parts.
  3. Use the SequenceMatcher class from the difflib library to compute similarity scores between composite keys in the two datasets.
  4. For every employee in the first dataset, find all employees in the second dataset whose similarity score is 0.80 or higher.
  5. Store all matching pairs in a list named linked_records. Each element must be a dictionary containing:
    • "index_df1" — index of the record in the first dataset;
    • "index_df2" — index of the record in the second dataset;
    • "similarity" — computed similarity score.

Make sure the variable linked_records is declared and contains the correct linked employee pairs.

Рішення

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 3. Розділ 3
single

single

some-alt