Exploratory Data Analysis

We have already discussed how association rule mining algorithms like Apriori and FP-growth can be applied in market basket analysis. However, ARM can also be utilized to address more specialized tasks. Now, we will provide a concise overview of additional tasks that can be tackled using ARM.

Association Rule Mining (ARM) can be utilized in classification and regression tasks to augment the exploratory data analysis (EDA) process and uncover latent patterns or relationships within our feature dataset.

By employing ARM, we can identify associations or "if-then" relationships among variables, which can be valuable for making predictions or deriving insights from the data.

Example

Let's consider a Heart Disease Classification dataset: it contains information about some medical features of the human organism. We will perform ARM to detect some hidden patterns in it:


              1234567891011121314151617181920212223242526272829303132333435
            
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import warnings

# Ignore all warnings
warnings.filterwarnings('ignore')

# Load the heart dataset
df = pd.read_csv('https://staging-content-media-cdn.codefinity.com/courses/a7e17f02-2cc9-4b92-abe0-cc8710d7011e/heart.csv')

# Select features from the DataFrame
selected_features = ['sex', 'cp', 'restecg', 'slope', 'ca', 'thal', 'fbs', 'target']

# Create a new DataFrame containing only the selected features
df_selected = df[selected_features]

# Perform one-hot encoding for 'cp', 'restecg', 'slope', 'thal', and 'ca' variables using sklearn's OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_cols = ['cp', 'restecg', 'slope', 'thal', 'ca']
df_encoded_cols = pd.DataFrame(encoder.fit_transform(df[encoded_cols]), columns=encoder.get_feature_names_out(encoded_cols))

# Drop the original columns and replace them with the encoded ones
df_encoded = df_selected.drop(columns=encoded_cols)
df_encoded = pd.concat([df_encoded, df_encoded_cols], axis=1)

# Mine frequent itemsets using Apriori algorithm
frequent_itemsets = apriori(df_encoded, min_support=0.2, use_colnames=True)

# Generate association rules
association_rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.8)

# Print antecedent, consequent, confidence, and lift metrics
print('Association Rules:')
print(association_rules[['antecedents', 'consequents', 'confidence', 'lift']])

What conclusions can we make?

If a patient has thalassemia type 3 (thal_3), they are likely to be male (sex) with a confidence of 87.07%. This suggests a strong association between thal_3 and being male;
If a patient has both slope type 2 (slope_2) and restecg type 1 (restecg_1), they are likely to have a heart disease (target) with a confidence of 80.36%. This indicates a strong association between slope_2, restecg_1, and having a heart disease;
If a patient has both thalassemia type 2 (thal_2) and restecg type 1 (restecg_1), they are likely to have a heart disease (target) with a confidence of 84.75%. This suggests a strong association between thal_2, restecg_1, and having a heart disease;
If a patient has both slope type 2 (slope_2) and thalassemia type 2 (thal_2), they are likely to have a heart disease (target) with a confidence of 85.45%. This indicates a strong association between slope_2, thal_2, and having a heart disease;
All lift values are greater than 1 for the provided rules. This indicates that the antecedents and consequents occur together more frequently than expected if they were independent. In other words, the occurrence of the antecedents increases the likelihood of the consequents, suggesting a positive association between the variables.

Using rules 2-3, we can even perform rule-based classification - if the patient has some particular feature values - we can classify heart disease without using ML approaches.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 3. Capítulo 1

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Contenido del Curso

Association Rule Mining

1. Introduction to Association Rule Mining

Definition and Overview of ARM Frequent Itemsets and Association rules Support, Confidence, and Lift Measures Challenge: Metrics Calculation Apriori Principle and Its Significance

2. Mining Frequent Itemsets

3. Additional Applications of ARM

Exploratory Data Analysis Recommendation Systems Other Applications