Course Content
Advanced Probability Theory
Advanced Probability Theory
Testing the Hypothesis of Independence of Two Random Variables
In real-life tasks, it is often needed to analyze the dependence between different features. For example:
-
Gender and political party affiliation: We can test whether there is a relationship between gender and political party affiliation;
-
Education level and job satisfaction: We can test whether there is a relationship between education level and job satisfaction;
-
Age and voting behavior: We can test whether there is a relationship between age and voting behavior;
-
Income level and preferred mode of transportation: We can test whether there is a relationship between income level and preferred mode of transportation.
But how can we prove that the variables are independent if we are not dealing with the entire population but only with small samples of the corresponding variables? For this, we can use the chi-square independence criterion.
Hypothesis formulation
We can use this criterion to test the following hypothesis:
Main hypothesis: corresponding random variables are independent of each other.
Alternative hypothesis: there are some relationships between the considered random variables
Contingency table
To use the chi-square independence test we have to provide some data preprocessing - create a contingency table. A contingency table, also known as a cross-tabulation table, is a table used to summarize the categorical data from two or more variables. The table presents the joint distribution of the variables, including the frequency or count of each combination of categories for the variables. Let's look at the example:
import pandas as pd # Example data data = {'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'], 'Smoker': ['No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']} df = pd.DataFrame(data) # Create contingency table using pandas crosstab cont_table = pd.crosstab(df['Gender'], df['Smoker']) print(cont_table)
The contingency matrix for continuous random variables is built a little differently. We first split our values into several discrete subsets and only then build the contingency matrix, for example:
import pandas as pd # Create a sample dataset data = pd.DataFrame({'age': [22, 45, 32, 19, 28, 57, 39, 41, 36, 24], 'income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000]}) # Define the bins for age bins = [18, 25, 35, 45, 55, 65, 75] # Create a new column with the age bins data['age_group'] = pd.cut(data['age'], bins) # Create a new column with the income bins splitted on 3 equal parts data['income_group'] = pd.cut(data['income'], 3) # Create a contingency table contingency_table = pd.crosstab(data['age_group'], data['income_group']) print(contingency_table)
Chi-square independence criterion in Python
Finally, let's use the chi-square independence criterion to check independence on a real dataset.
import pandas as pd from scipy.stats import chi2_contingency data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/Advanced+Probability+course+media/heart.csv') # Calculate correlation between target and other features for i in data.columns: print('Covariance between ', i, ' and target is:', data.corr()['target'].loc[i]) # If the correlation is not close to zero than features have some linear relationships # fbs and target have very small correlation so there are no linear dependencies between them # Let's check hypothesis that fbs and hear disease occurrence are independent # Choose significance level alpha = 0.05 # Contingency table for discrete target and fbs cont_table = pd.crosstab(data['fbs'], data['target']) # Provide chi2 independence test chi2_stat, p_val, dof, expected = chi2_contingency(cont_table, correction=True) if p_val < alpha: print('\n Fbs and heart disease occurrence are dependant') else: print('\n Fbs and heart disease occurrence are independant')
Thanks for your feedback!