Heart disease prediction

Data analysis, Random Forest and Logistic Regression with heart disease dataset

Imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import palettable
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score

Data

The data is available at UCI machine learning repository.

The website contains 4 datasets concerning heart disease diagnosis.The data for these datasets was collected from the four following locations:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

There are 2 versions of each dataset:

Full dataset with 76 attributes
Dataset with 14 attributes

The reduced dataset exists because only the subset of 14 attributes has been used in prior research and experiments.

Files used for this project:

processed.switzerland.data
processed.cleveland.data
processed.hungarian.data
processed.va.data

I create a single dataset by combining these four.

Download data

!wget https://archive.ics.uci.edu/static/public/45/heart+disease.zip
!mkdir heart_disease_data
!unzip heart+disease.zip -d heart_disease_data

Creating a single dataset

The datasets have the same columns, they don’t have headers, and missing values are provided as ?.

Code

columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak",
          "slope", "ca", "thal", "disease"]
sdf = pd.read_csv("heart_disease_data/processed.switzerland.data", header=None, names=columns, na_values='?')
cdf = pd.read_csv("heart_disease_data/processed.cleveland.data", header=None, names=columns, na_values='?')
hdf = pd.read_csv("heart_disease_data/processed.hungarian.data", header=None, names=columns, na_values='?')
vdf = pd.read_csv("heart_disease_data/processed.va.data", header=None, names=columns, na_values='?')

df = pd.concat([sdf, cdf, vdf, hdf], ignore_index=True)
df.disease = df['disease'].apply(lambda x: 1 if x > 0 else 0)
df

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	disease
0	32.0	1.0	1.0	95.0	0.0	NaN	0.0	127.0	0.0	0.7	1.0	NaN	NaN	1
1	34.0	1.0	4.0	115.0	0.0	NaN	NaN	154.0	0.0	0.2	1.0	NaN	NaN	1
2	35.0	1.0	4.0	NaN	0.0	NaN	0.0	130.0	1.0	NaN	NaN	NaN	7.0	1
3	36.0	1.0	4.0	110.0	0.0	NaN	0.0	125.0	1.0	1.0	2.0	NaN	6.0	1
4	38.0	0.0	4.0	105.0	0.0	NaN	0.0	166.0	0.0	2.8	1.0	NaN	NaN	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
915	52.0	1.0	4.0	160.0	331.0	0.0	0.0	94.0	1.0	2.5	NaN	NaN	NaN	1
916	54.0	0.0	3.0	130.0	294.0	0.0	1.0	100.0	1.0	0.0	2.0	NaN	NaN	1
917	56.0	1.0	4.0	155.0	342.0	1.0	0.0	150.0	1.0	3.0	2.0	NaN	NaN	1
918	58.0	0.0	2.0	180.0	393.0	0.0	0.0	110.0	1.0	1.0	2.0	NaN	7.0	1
919	65.0	1.0	4.0	130.0	275.0	0.0	1.0	115.0	1.0	1.0	2.0	NaN	NaN	1

920 rows × 14 columns

Attributes

Numerical attributes

age: age in years, numerical
trestbps - resting blood pressure (in mm Hg on admission to the hospital)
chol - cholesterol in mg/dl
thalach - maximum heart rate achieved
oldpeak - ST depression induced by exercise relative to rest. ‘ST’ relates to the positions on the electrocardiographic (ECG) plot.
ca - number of major vessels (0-3) colored by flouroscopy. Fluoroscopy is one of the most popular non-invasive coronary artery disease diagnosis. It enables the doctor to see the flow of blood through the coronary arteries in order to evaluate the presence of arterial blockages.

Categorical attributes

sex: sex
- 1 = male
- 0 = female
cp: chest pain type
- 1: typical angina
- 2: atypical angina
- 3: non-anginal pain
- 4: asymptomatic
fbs - fasting blood sugar > 120 mg/dl
- 1 = true
- 0 = false
restecg - resting electrocardiographic (ECG) results
- 0: normal
- 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
exang - exercise induced angina. Angina is a type of chest pain caused by reduced blood flow to the heart.
- 1 - yes
- 0 - no
slope - the slope of the peak exercise ST segment. (ECG)
- 1: upsloping
- 2: flat
- 3: downsloping
thal - A blood disorder called thalassemia
- 3: normal blood flow
- 6: fixed defect (no blood flow in some part of the heart)
- 7: reversable defect (a blood flow is observed but it is not normal)
disease - refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

Exploratory data analysis

EDA helper functinos

df_eda = df.copy()

def plot_categorical(data=pd.DataFrame,
                     column=str,
                     labels=[],
                     target='disease',
                     target_labels=['healthy', 'heart disease'],
                     title='', font = 10, ax = None):
    crosstab = pd.crosstab(data[column], data[target])
    ax = crosstab.plot(kind='bar', figsize=(15, 7), rot = 0, fontsize=font, ax=ax)
    x = np.arange(len(labels))
    ax.set_xticks(x)
    ax.set_xticklabels(labels)
    ax.legend(target_labels)
    plt.ylabel("count", size=14)
    plt.title(title)

    bars =  ax.patches

    #compute percents
    total_by_category = crosstab.sum(axis=1)
    healthy_perc = round((crosstab[0] / total_by_category ) * 100)


    for (i, bar) in enumerate(bars):
      prc = healthy_perc.iloc[i] if i < len(healthy_perc) else 100 - healthy_perc.iloc[i % len(healthy_perc)]
      plt.annotate(str(int(prc)) + '%',
                    (bar.get_x() + bar.get_width() / 2., bar.get_height()),
                    ha='center', va='center',
                    xytext=(0, 9),
                    textcoords='offset points')


def plot_numeric(data=pd.DataFrame, column=str, title=str):
    fig, ax = plt.subplots(1, 2)
    fig.set_size_inches(20, 7)

    #with respect to the target
    healthy = data.loc[data.disease == 0, column]
    sick = data.loc[data.disease == 1, column]
    healthy.plot.density(ax=ax[0])
    sick.plot.density(ax=ax[0])
    ax[0].legend(['healthy', 'heart disease'])

    data.boxplot(by='disease', column= [column], ax=ax[1])

    fig.suptitle(title, fontsize = 19)


def describe_numeric(data=pd.DataFrame, column=str):
    temp = data[[column, 'disease']].copy()
    temp['healthy'] = data[data.disease == 0][column]
    temp['sick'] = data[data.disease == 1][column]
    return temp[[column, 'healthy', 'sick']].describe()

def plot_missing(data = pd.DataFrame):
  na_values_percent = data.isna().sum().sort_values(ascending=False) \
    .apply(lambda x: (x, round(x / data.index.size * 100, 2))) # (count, %)
  na_values_percent.apply(lambda x: x[1]).plot.bar() # (plot %)
  plt.ylabel("Percentage", size=14)
  plt.title('Missing values')

Numeric data summary

Code

df_numeric=df_eda.loc[:,['age','trestbps','chol','thalach','oldpeak']]
df_numeric.describe()

	age	trestbps	chol	thalach	oldpeak
count	920.000000	861.000000	890.000000	865.000000	858.000000
mean	53.510870	132.132404	199.130337	137.545665	0.878788
std	9.424685	19.066070	110.780810	25.926276	1.091226
min	28.000000	0.000000	0.000000	60.000000	-2.600000
25%	47.000000	120.000000	175.000000	120.000000	0.000000
50%	54.000000	130.000000	223.000000	140.000000	0.500000
75%	60.000000	140.000000	268.000000	157.000000	1.500000
max	77.000000	200.000000	603.000000	202.000000	6.200000

Code

dcorr=df_numeric.corr(method='pearson')
sns.heatmap(data=dcorr,annot=True,fmt=".2f")

<AxesSubplot: >

Categorical data summary

Code

plt.figure(figsize=(5,5),dpi=200)
for (i, col) in enumerate(['sex','cp','fbs','restecg','exang', 'disease', 'thal', 'slope']):
  plt.subplot(3,3,i+1)
  df_eda[col].value_counts().sort_index().plot(kind='pie', figsize=(7,5), autopct='%1.1f%%', title = col, textprops={'fontsize': 5})

Analysing risk factors for heart disease

Cholesterol (‘chol’)

Cholesterol has a lot of 0 values, which is not possible.

Code

df_eda.chol.hist(bins=20)
df_eda['chol'] = df_eda['chol'].replace({0:np.nan})

Analysis

People with heart disease on average have a higher cholesterol level.

Code

plot_numeric(df_eda, 'chol', 'Cholesterol')

Code

describe_numeric(df_eda, 'chol')

	chol	healthy	sick
count	718.000000	372.000000	346.000000
mean	246.832869	240.158602	254.008671
std	58.527062	55.767559	60.620439
min	85.000000	85.000000	100.000000
25%	210.000000	204.000000	216.000000
50%	239.500000	233.000000	248.000000
75%	276.750000	270.250000	284.750000
max	603.000000	564.000000	603.000000

Binning shows that people with cholesterol level more than 254 mg/dl have more than 50% chance of having a heart disease.

Code

df_eda['chol_range'] = pd.qcut(df_eda['chol'], 10, duplicates = 'drop')
plot_categorical(df_eda, 'chol_range', df_eda.chol_range.unique().sort_values(), title = 'Cholesterol intervals')

Age

People with heart disease on average are about 5 years older than healthy people.

Code

df_eda.age.hist(bins=20)

<AxesSubplot: >

Code

describe_numeric(df_eda, 'age')

	age	healthy	sick
count	920.000000	411.000000	509.000000
mean	53.510870	50.547445	55.903733
std	9.424685	9.433700	8.718959
min	28.000000	28.000000	31.000000
25%	47.000000	43.000000	51.000000
50%	54.000000	51.000000	57.000000
75%	60.000000	57.000000	62.000000
max	77.000000	76.000000	77.000000

Code

plot_numeric(df_eda, 'age', 'age')

We can see that the risk of getting heart disease increases after the age of 54.

Code

df_eda['age_range'] = pd.qcut(df_eda['age'], 10)
plot_categorical(df_eda, 'age_range', df_eda.age_range.unique().sort_values())

Resting blood pressure (‘trestbps’)

Code

df_eda.trestbps.hist(bins=20)

<AxesSubplot: >

There’s one unrealistic value of 0, I replace it with na for analysis.

Code

df_eda.trestbps.replace({0:np.nan}, inplace=True)

People with heart disease on average have a slightly higher blood pressure than healthy patients.

Code

plot_numeric(df_eda, 'trestbps', 'Resting blood pressure')

Code

describe_numeric(df_eda, 'trestbps')

	trestbps	healthy	sick
count	860.000000	391.000000	469.000000
mean	132.286047	129.913043	134.264392
std	18.536175	16.869867	19.617889
min	80.000000	80.000000	92.000000
25%	120.000000	120.000000	120.000000
50%	130.000000	130.000000	130.000000
75%	140.000000	140.000000	145.000000
max	200.000000	190.000000	200.000000

Most patients with normal blood pressure do not have heart disease, while most patients with elevated blood pressure have heart disease.

Code

bins = [0, 120, 130, 140, np.inf]
labels = ['Normal', 'Elevated', 'High', 'Critically high']
df_eda['bp_range'] = pd.cut(df_eda['trestbps'], bins)
plot_categorical(df_eda, 'bp_range', labels)

Exercise induced angina (‘exang’)

Most of the patients with heart disease experienced angina during the exercise, while the majority in the healthy group had no such symptoms.

Code

plot_categorical(df_eda, 'exang', ['no', 'yes'])

Number of Blocked Vessels (‘ca’)

The chance of having heart disease increases proportionally to the number of blocked vessels. Patients with 0 blocked vessels have only 27% chance of having heart disease. The value reaches 85% chance for the group with 3 blocked vessels.

Code

labels=["0", "1", "2", "3"]
plot_categorical(df_eda, 'ca', labels)

Gender

The majority of men in the dataset have heart disease, while only 26% of women are unhealthy.

Code

plot_categorical(df_eda, 'sex', ['women', 'men'])

Chest pain type (‘cp’)

Amongst the patients with no chest pain almost 80% had heart disease. Patients with the atypical angina had the lowest level of heart disease rate. Overall, we can not say if a chest pain can be considered as a risk factor for a heart disease.

Code

labels=['typical angina', 'atypical angina', 'non-anginal pain', 'no pain']
plot_categorical(df_eda, 'cp', labels)

Fasting blood sugar

In the dataset most people don’t have a high sugar level. But among those who do, almost 70% have heart disease.

Code

plot_categorical(df_eda, 'fbs', ['Normal', 'High'])

Resting electrocardiographic results (‘restecg’)

Patients in the categories with type 1 and type 2 abnormalities have higher chance of getting a heart disease than the group with normal ECG.

Code

labels=['normal', 'type 1', 'type 2']
plot_categorical(df_eda, 'restecg', labels)

Maximum heart rate achieved (‘thalach’)

Code

df_eda.thalach.plot.hist(bins=20)

<AxesSubplot: ylabel='Frequency'>

Patients without heart disease are able to reach a higher maximum heart rate than patients with the disease.

Code

plot_numeric(df_eda, 'thalach', 'Maximum heart rate achieved')

ST depression (oldpeak)

Code

df_eda.oldpeak.hist()

<AxesSubplot: >

There are some negative values. Replacing with nan.

Code

df_eda['oldpeak'] = df_eda['oldpeak'].apply(lambda x: np.nan if x < 0 else x)
df_eda.oldpeak.hist()

<AxesSubplot: >

Oldpeak is another ECG parameter measuring ST depression during the exercise. It represents a distance on the ECG plot between specific points. There’s a notable difference in distributions between the groups. Sick people on average have higher value of the parameter.

Code

plot_numeric(df_eda, 'oldpeak', 'oldpeak')

Code

describe_numeric(df_eda, 'oldpeak')

	oldpeak	healthy	sick
count	846.000000	387.000000	459.000000
mean	0.906265	0.425840	1.311329
std	1.071192	0.712184	1.153295
min	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000
50%	0.500000	0.000000	1.200000
75%	1.500000	0.800000	2.000000
max	6.200000	4.200000	6.200000

Patients with high olpeak values have higher chance of having herat disease.

Code

df_eda['oldpeak_range'] = pd.qcut(df_eda['oldpeak'], 5, duplicates = 'drop')
plot_categorical(df_eda, 'oldpeak_range', df_eda.oldpeak_range.unique().sort_values(), title = 'oldpeak')

The slope of the peak exercise ST segment (‘slope’)

This is another ECG parameter, measured during the exercise. Almost 80% of the patients with ‘flat’ or ‘downslopping’ slope parameter had heart disease. Most of the people with the ‘upslopping’ slope were healthy.

Code

labels=["upslopping", "flat", "downslopping"]
plot_categorical(df_eda, 'slope', labels)

Thalassemia blood disorder (‘thal’)

People with this blood disorder are at risk of having heart disease with almost 80% for both type 1 and type 2 disorders.

Code

labels=['normal','type 1','type 2']
plot_categorical(df_eda, 'thal', labels)

Numeric attributes relationships

Younger patients are able to achieve higher maximum heart rate. Patients with heart disease are older and have lower maximum heart rate.

Code

sns.pairplot(df_eda, hue='disease', vars=['age', 'thalach']);

Blood pressure is higher in older patients. But these factors do not form a clear separation between healthy and sick patients.

Code

sns.pairplot(df_eda, hue='disease', vars=['age', 'trestbps']);

EDA colclusion

Exploratory data analysis identified the following groups with a high risk of heart disease:

Patients with high cholesterol
Patients older than 54 years old
Male patients
Patients with high blood sugar
Patients with abnormalities in their ECG
Patient low maximum heart rate
Patients with high oldpeak value
Patients with slope that flat or downsloping
Patients with blocked heart vessels
Patients with thalassemia blood disorder
Patients with high blood pressure, more than 120 mmHg

Data quality

Incorrect values

There are some incorrect values in the dataset as it was discovered.

Cholesterol (‘chol’) has lots of zeros.
ST depression (‘oldpeak’) has some negative values. In the existing research this parameter is >= 0.
Resting blood pressure (‘trestbps’) has one zero value.

Missing values

Count incorrect values as missing values.

Code

df_missing = df.copy()
df_missing.chol = df.chol.replace({0:np.nan})
df_missing.trestbps = df.trestbps.replace({0:np.nan})
df_missing.loc[df.oldpeak < 0, 'oldpeak'] = np.nan

plot_missing(df_missing)

Data Preparation

Data cleaning

Removing incorrect values.

Code

def remove_incorrect(data=pd.DataFrame):
  copy=df.copy()
  copy.chol.replace(0,np.nan,inplace=True)
  copy.trestbps.replace(0,np.nan,inplace=True)
  copy.loc[copy['oldpeak'] < 0, "oldpeak"] = np.nan
  return copy

df_clean = remove_incorrect(df)

#plot the result
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

#chol
sns.histplot(data=df, x="chol", ax = axs[0,0])
axs[0, 0].title.set_text('Cholesterol raw values')
sns.histplot(data=df_clean, x="chol", ax = axs[0,1])
axs[0, 1].title.set_text('Cholesterol removed zeros')

#oldpeak
sns.histplot(data=df, x="oldpeak", ax = axs[1,0])
axs[1, 0].title.set_text('Oldpeak raw values')
sns.histplot(data=df_clean, x="oldpeak", ax = axs[1,1])
axs[1, 1].title.set_text('Oldpeak removed negative values')

Deleting columns with more than 30% missing data

Code

df_reduced = df_clean.drop(['ca', 'thal', 'slope'],axis=1)
df_reduced.head(1)

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	disease
0	32.0	1.0	1.0	95.0	NaN	NaN	0.0	127.0	0.0	0.7	1

Removing duplicate rows.

Code

duplicates = df_reduced.loc[df_reduced.duplicated(), :] 
df_no_duplicates = df_reduced.drop_duplicates(keep='first') 
duplicates

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	disease
613	58.0	1.0	3.0	150.0	219.0	0.0	1.0	118.0	1.0	0.0	1
728	49.0	0.0	2.0	110.0	NaN	0.0	0.0	160.0	0.0	0.0	0

Outliers

Code

def find_outliers(data=pd.DataFrame):
    return [col for col in data.columns if has_outliers(data[col])]

def get_box_limits(col=pd.Series):
    q1 = col.quantile(.25)
    q3 = col.quantile(.75)
    IQR = q3 - q1
    ll = q1 - (1.5*IQR)
    ul = q3 + (1.5*IQR)
    return (ll, ul)

def has_outliers(col=pd.Series):
    ll, ul = get_box_limits(col)
    upper_outliers = col[col > ul].count() > 0
    lower_outliers = col[col < ll].count() > 0
    return upper_outliers or lower_outliers

outliers_columns = find_outliers(df_no_duplicates[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']])

fig, axs = plt.subplots(1, len(outliers_columns), figsize = (20, 7))
for (i, col) in enumerate(outliers_columns):
   df_no_duplicates[col].plot.box(ax = axs[i], fontsize = 18)

These values are real, and removing them might affect the modeling, so I keep them.

Statistical significance testing

The best resource about univariate and bivariate statistical analysis basics:

Chi square test for categorical data

Code

from pandas.core.strings.accessor import isna
from itertools import combinations
from scipy.stats import chi2_contingency

def test(col1=pd.Series, col2=pd.Series, alpha=float):
  data_cont=pd.crosstab(col1, col2)
  p_value = chi2_contingency(data_cont)[1]
  return p_value > alpha

def color_df(x):
  if x == True:
    return 'color: %s' % 'red'
  elif isna(x): return None
  else: return 'color: %s' % 'green'

def show_chi_square_results(data=pd.DataFrame):
  categorical = ['sex','cp','fbs','restecg','exang', 'disease']
  cmb = list(combinations(categorical, 2))

  chi_square_results_df = pd.DataFrame(index = categorical, columns = categorical)
  for c in cmb:
    res = test(data[c[0]], data[c[1]], 0.05)
    chi_square_results_df.loc[c[0], c[1]] = res
    chi_square_results_df.loc[c[1], c[0]] = res

  return chi_square_results_df.style.applymap(color_df)

show_chi_square_results(df_no_duplicates)

	sex	cp	fbs	restecg	exang	disease
sex	nan	False	False	True	False	False
cp	False	nan	True	False	False	False
fbs	False	True	nan	False	True	False
restecg	True	False	False	nan	False	False
exang	False	False	True	False	nan	False
disease	False	False	False	False	False	nan

The table demonstrates if a variable is statistically independent from another variable(label ‘True’). There are no independent variables for the target (disease) column.

T-test for numeric features

Two-tailed two sample t-testing is performed.

Null Hypothesis: The means of features for patients with heart disease and healthy patients are the same.

Alternative hypothesis: There are statistically significant differences in the feature means for the healthy and sick patients.

Code

from scipy.stats import ttest_ind

def test_numeric(data = pd.DataFrame, col=str):
    data = data[~data[col].isna()]
    _, p = ttest_ind(data[col], data['disease'], equal_var=False)
    return p

def show_ttest_results(data=pd.DataFrame):
  results = [(col, test_numeric(data, col))  for col in ['chol', 'trestbps', 'age', 'oldpeak', 'thalach']]
  return pd.DataFrame(results, columns = ['Column', 'p-value'])

Code

show_ttest_results(df_no_duplicates)

	Column	p-value
0	chol	0.000000e+00
1	trestbps	0.000000e+00
2	age	0.000000e+00
3	oldpeak	9.389716e-19
4	thalach	0.000000e+00

I reject null hypothesis and state that there are statistically significant differences in the feature means for the healthy and sick patients. Keeping all the features.

Split train/test

70% train, 30% test

Code

train_data = df_no_duplicates.sample(frac=0.7, random_state=123)
test_data = df_no_duplicates[~df_no_duplicates.index.isin(train_data.index)]

Data imputation

Code

from scipy.stats import mode
from functools import partial

def impute(data,cols,impute_fn):
    imputed = data.copy()
    for col in cols:
        imputed[col + "_missing"] = imputed[col].isna().astype(int) # flag variable for missing 
        imputed[col] = impute_fn(imputed[col])#.fillna(imputed[col].mode()[0]) 
    return imputed

def impute_categorical(data=pd.DataFrame) -> pd.DataFrame:
    cols = ['restecg', 'exang', 'fbs']
    impute_fn = lambda x: x.fillna(x.mode()[0])
    return impute(data,cols,impute_fn)
    

def impute_numeric(data=pd.DataFrame, cols=list) -> pd.DataFrame:
    cols = ['oldpeak', 'trestbps', 'thalach', 'chol']
    impute_fn = lambda x: x.fillna(x.median())
    return impute(data,cols,impute_fn)

def plot_numeric_imputation(data_before=pd.DataFrame, data_after=pd.DataFrame):
  cols = ['oldpeak', 'trestbps', 'thalach', 'chol']
  fig, axs = plt.subplots(1, len(cols), figsize = (25, 5))
  for i, col in enumerate(cols):
    sns.kdeplot(data_before[col], color='b', fill=True, ax = axs[i], alpha = 0.07)
    sns.kdeplot(data_after[col], color='r', fill=True, ax = axs[i], alpha = 0.07)
    axs[i].legend(['Before imputation', 'After imputation'])

def plot_categorical_imputation(data_before=pd.DataFrame, data_after=pd.DataFrame):
  cols = ['restecg', 'exang', 'fbs']
  fig, axs = plt.subplots(1, len(cols), figsize = (25, 5))
  for i, col in enumerate(cols):
    sns.histplot(data=data_before, x=col, ax=axs[i], color='b')
    sns.histplot(data=data_after, x=col, ax=axs[i], color='r', alpha=0.2)
    
plot_numeric_imputation(train_data, impute_numeric(train_data))
plot_categorical_imputation(train_data, impute_categorical(train_data))

One hot encoding

Code

def encode_dummy(data = pd.DataFrame, drop_first=True) -> pd.DataFrame:
  df = data.copy()
  for col in ['cp', 'restecg']:
    dummy = pd.get_dummies(df[col], prefix = col,drop_first=drop_first)
    df = pd.concat([df, dummy], axis=1)
    df.drop(col, axis=1, inplace=True)
  return df

encode_dummy(train_data).head(5)

	age	sex	trestbps	chol	fbs	thalach	exang	oldpeak	disease	cp_2.0	cp_3.0	cp_4.0	restecg_1.0	restecg_2.0
349	47.0	1.0	112.0	204.0	0.0	143.0	0.0	0.1	0	0	0	1	0	0
654	38.0	1.0	140.0	297.0	0.0	150.0	0.0	0.0	0	1	0	0	0	0
7	38.0	1.0	115.0	NaN	0.0	128.0	1.0	0.0	1	0	1	0	0	0
571	54.0	1.0	NaN	182.0	0.0	NaN	NaN	NaN	0	1	0	0	1	0
171	65.0	0.0	140.0	417.0	1.0	157.0	0.0	0.8	0	0	1	0	0	1

Feature Scaling

Normalization/Standardization

Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled.
Distance algorithms like KNN, K-means, and SVM are most affected by the range of features Normalization/Standardization.Therefore, I scale the data before employing a distance based algorithm so that all the features contribute equally to the result.

Code

from sklearn.preprocessing import MinMaxScaler

def scale(data=pd.DataFrame):
  return pd.DataFrame(MinMaxScaler().fit_transform(data), columns=data.columns)

def plot_normalization(data_before=pd.DataFrame, data_after=pd.DataFrame, cols=str):
  fig, axs = plt.subplots(len(cols), 2, figsize = (25, 15))
  for i, col in enumerate(cols):
    sns.histplot(data_before[col], color='b', ax = axs[i, 0])
    axs[i, 0].title.set_text(col + ' before scaling')
    sns.histplot(data_after[col], color='r', ax = axs[i, 1])
    axs[i, 1].title.set_text(col + ' after scaling')

plot_normalization(train_data, scale(train_data), ['chol', 'age', 'sex'])

Modeling

Helper functions

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

def run_modeling(model,train_data,test_data):
    x_train = train_data.drop('disease', axis=1)
    y_train = train_data['disease']
    model.fit(x_train, y_train)
      
    x_test = test_data.drop('disease', axis=1)
    y_test = test_data['disease']
    y_model = model.predict(x_test)
    return accuracy_score(y_test, y_model), model


def impute_all(data):
    return impute_categorical(impute_numeric(data))

def full_pipeline(data):
    data = impute_all(data)
    data = encode_dummy(data)
    return scale(data)

Random forest

rf_train_data  = impute_all(train_data)
rf_test_data  = impute_all(test_data)

acc, rf_model = run_modeling(RandomForestClassifier(random_state=0), rf_train_data,rf_test_data)
print(f'accuracy: {acc}')

accuracy: 0.7527272727272727

Logistic regression

lr_train_data = full_pipeline(train_data)
lr_test_data = full_pipeline(test_data)
    

acc, model = run_modeling(LogisticRegression(), lr_train_data, lr_test_data)
print(f'accuracy: {acc}')

accuracy: 0.7781818181818182

SVM

There are many model parameters and they are not easy to choose, so I use the GridSearchCV tool in sklearn to help to complete the parameter selection. The main parameters include kernel, C and gamma

from sklearn import svm
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':('linear', 'rbf'), 'C':[1], 'gamma':[0.05, 0.07, 0.125, 0.25, 0.5]}
model = GridSearchCV(svm.SVC(), parameters, scoring='accuracy')

svm_train_data = full_pipeline(train_data)
svm_test_data = full_pipeline(test_data)

acc, model = run_modeling(model, svm_train_data, svm_test_data)
print(f'accuracy: {acc}')

accuracy: 0.7781818181818182

Interpretation

I use information from the random forest feature importance to do interpretation.

From the random forest feature importance information I conclude that the most important variables for predicting heart disease are:

Chest pain type (cp)
Maximum heart rate achieved (thalach)
Oldpeak
Age
Cholesterol

Meanwhile such parameters as blood sugar and the results of the electrocardiogram contributed the least to the heart disease prediction in the random forest model.

The most of the missing values flags were not important for the model.

Code

(pd.Series(rf_model.feature_importances_, index=rf_train_data.drop(['disease'], axis=1).columns).sort_values().plot(kind='barh', figsize=(10,10)))

<AxesSubplot: >