Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Exploratory Data Analysis | Additional Applications of ARM
Association Rule Mining
course content

Contenido del Curso

Association Rule Mining

Association Rule Mining

1. Introduction to Association Rule Mining
2. Mining Frequent Itemsets
3. Additional Applications of ARM

bookExploratory Data Analysis

We have already discussed how association rule mining algorithms like Apriori and FP-growth can be applied in market basket analysis. However, ARM can also be utilized to address more specialized tasks. Now, we will provide a concise overview of additional tasks that can be tackled using ARM.

Association Rule Mining (ARM) can be utilized in classification and regression tasks to augment the exploratory data analysis (EDA) process and uncover latent patterns or relationships within our feature dataset.

By employing ARM, we can identify associations or "if-then" relationships among variables, which can be valuable for making predictions or deriving insights from the data.

Example

Let's consider a Heart Disease Classification dataset: it contains information about some medical features of the human organism. We will perform ARM to detect some hidden patterns in it:

1234567891011121314151617181920212223242526272829303132333435
import pandas as pd from sklearn.preprocessing import OneHotEncoder from mlxtend.frequent_patterns import apriori, association_rules import warnings # Ignore all warnings warnings.filterwarnings('ignore') # Load the heart dataset df = pd.read_csv('https://codefinity-content-media-v2.s3.eu-west-1.amazonaws.com/courses/a7e17f02-2cc9-4b92-abe0-cc8710d7011e/heart.csv') # Select features from the DataFrame selected_features = ['sex', 'cp', 'restecg', 'slope', 'ca', 'thal', 'fbs', 'target'] # Create a new DataFrame containing only the selected features df_selected = df[selected_features] # Perform one-hot encoding for 'cp', 'restecg', 'slope', 'thal', and 'ca' variables using sklearn's OneHotEncoder encoder = OneHotEncoder(drop='first', sparse=False) encoded_cols = ['cp', 'restecg', 'slope', 'thal', 'ca'] df_encoded_cols = pd.DataFrame(encoder.fit_transform(df[encoded_cols]), columns=encoder.get_feature_names_out(encoded_cols)) # Drop the original columns and replace them with the encoded ones df_encoded = df_selected.drop(columns=encoded_cols) df_encoded = pd.concat([df_encoded, df_encoded_cols], axis=1) # Mine frequent itemsets using Apriori algorithm frequent_itemsets = apriori(df_encoded, min_support=0.2, use_colnames=True) # Generate association rules association_rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.8) # Print antecedent, consequent, confidence, and lift metrics print('Association Rules:') print(association_rules[['antecedents', 'consequents', 'confidence', 'lift']])
copy

What conclusions can we make?

  1. If a patient has thalassemia type 3 (thal_3), they are likely to be male (sex) with a confidence of 87.07%. This suggests a strong association between thal_3 and being male;
  2. If a patient has both slope type 2 (slope_2) and restecg type 1 (restecg_1), they are likely to have a heart disease (target) with a confidence of 80.36%. This indicates a strong association between slope_2, restecg_1, and having a heart disease;
  3. If a patient has both thalassemia type 2 (thal_2) and restecg type 1 (restecg_1), they are likely to have a heart disease (target) with a confidence of 84.75%. This suggests a strong association between thal_2, restecg_1, and having a heart disease;
  4. If a patient has both slope type 2 (slope_2) and thalassemia type 2 (thal_2), they are likely to have a heart disease (target) with a confidence of 85.45%. This indicates a strong association between slope_2, thal_2, and having a heart disease;
  5. All lift values are greater than 1 for the provided rules. This indicates that the antecedents and consequents occur together more frequently than expected if they were independent. In other words, the occurrence of the antecedents increases the likelihood of the consequents, suggesting a positive association between the variables.

Using rules 2-3, we can even perform rule-based classification - if the patient has some particular feature values - we can classify heart disease without using ML approaches.

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 1
some-alt