Learn Testing the Hypothesis of Independence of Two Random Variables

Swipe to show menu

In real-life tasks, it is often needed to analyze the dependence between different features. For example:

Gender and political party affiliation: We can test whether there is a relationship between gender and political party affiliation;
Education level and job satisfaction: We can test whether there is a relationship between education level and job satisfaction;
Age and voting behavior: We can test whether there is a relationship between age and voting behavior;
Income level and preferred mode of transportation: We can test whether there is a relationship between income level and preferred mode of transportation.

But how can we prove that the variables are independent if we are not dealing with the entire population but only with small samples of the corresponding variables? For this, we can use the chi-square independence criterion.

Hypothesis formulation

We can use this criterion to test the following hypothesis:
Main hypothesis: corresponding random variables are independent of each other.
Alternative hypothesis: there are some relationships between the considered random variables

Contingency table

To use the chi-square independence test we have to provide some data preprocessing - create a contingency table. A contingency table, also known as a cross-tabulation table, is a table used to summarize the categorical data from two or more variables. The table presents the joint distribution of the variables, including the frequency or count of each combination of categories for the variables. Let's look at the example:


              1234567891011
            
import pandas as pd

# Example data
data = {'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],
        'Smoker': ['No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']}
df = pd.DataFrame(data)

# Create contingency table using pandas crosstab
cont_table = pd.crosstab(df['Gender'], df['Smoker'])

print(cont_table)

The contingency matrix for continuous random variables is built a little differently. We first split our values into several discrete subsets and only then build the contingency matrix, for example:


              1234567891011121314151617
            
import pandas as pd

# Create a sample dataset
data = pd.DataFrame({'age': [22, 45, 32, 19, 28, 57, 39, 41, 36, 24],
                     'income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000]})

# Define the bins for age
bins = [18, 25, 35, 45, 55, 65, 75]

# Create a new column with the age bins
data['age_group'] = pd.cut(data['age'], bins)
# Create a new column with the income bins splitted on 3 equal parts
data['income_group'] = pd.cut(data['income'], 3)
# Create a contingency table
contingency_table = pd.crosstab(data['age_group'], data['income_group'])

print(contingency_table)

Chi-square independence criterion in Python

Finally, let's use the chi-square independence criterion to check independence on a real dataset.


              12345678910111213141516171819202122232425
            
import pandas as pd
from scipy.stats import chi2_contingency

data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/heart.csv')
# Calculate correlation between target and other features
for i in data.columns:
  print('Covariance between ', i, ' and target is:', data.corr()['target'].loc[i])

# If the correlation is not close to zero than features have some linear relationships
# fbs and target have very small correlation so there are no linear dependencies between them
# Let's check hypothesis that fbs and hear disease occurrence are independent

# Choose significance level
alpha = 0.05

# Contingency table for discrete target and fbs
cont_table = pd.crosstab(data['fbs'], data['target'])

# Provide chi2 independence test
chi2_stat, p_val, dof, expected  = chi2_contingency(cont_table, correction=True)

if p_val < alpha:
  print('\n Fbs and heart disease occurrence are dependant')
else:
  print('\n Fbs and heart disease occurrence are independant')

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 4. Chapter 6