Predicting Heart Diseases using ML Algorithms

6 min readDec 28, 2019

In this post I am going to use various classification algorithms like

Logistic Regression,
Linear SVM,
rbf SVM,
KNearestNeighbors,
GaussianNB,
DecisionTree,
RandomForestClassifier,
GradientBoostingClassifier. for more info about algorithms click here

to predict whether a person is likely or has higher probability or chances of getting heart diseases.

Dataset

This dataset contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
For reference click here

Attribute Information:

1.age = age in years
2.sex= (1 = male; 0 = female)
3.cp = chest pain type
4.trestbpsr = esting blood pressure (in mm Hg on admission to the hospital)
5.chol = serum cholestoral in mg/dl
6.fbs = (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7.restecg = resting electrocardiographic results
8.thalach = maximum heart rate achieved
9.exang = exercise induced angina (1 = yes; 0 = no)
10.oldpeak = ST depression induced by exercise relative to rest
11.slope = the slope of the peak exercise ST segment
12.ca = number of major vessels (0–3) colored by flourosopy
13.thal = 3 = normal; 6 = fixed defect; 7 = reversable defect
14.target = 1 or 0

Data Exploration

#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')df = pd.read_csv('heart.csv')df.head()

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot=True)
plt.show()

#age analysis
df.age.value_counts()[:10]58    19
57    17
54    16
59    14
52    13
51    12
62    11
44    11
60    11
56    11
Name: age, dtype: int64

Now using barplot to count values

sns.barplot(x= df.age.value_counts()[:10].index, y= df.age.value_counts()[:10].values  )
plt.xlabel('Age')
plt.ylabel("Age counter")
plt.title("Age Analysis")
plt.show

df.target.value_counts()
1    165
0    138

we can see count of target of variables i.e., suporting cases for heart diseases.

countNoDisease = len(df[df.target == 0])
countHaveDisease = len(df[df.target == 1])
print("Percentage of patients that don't have heart disease : {:.2f}%".format((countNoDisease/(len(df.target)))*100))
print("Percentage of patients that have heart disease  : {:.2f}%".format((countHaveDisease/(len(df.target)))*100))Percentage of patients that don't have heart disease : 45.54%
Percentage of patients that have heart disease  : 54.46%countFemale= len(df[df.sex == 0])
countMale = len(df[df.sex == 1])
print(" Percentage of Female Patients : {:.2f}%".format((countFemale/(len(df.sex))*100)))
print("Percentage of Male Patients : {:.2f}%".format((countMale/(len(df.sex))*100)))Percentage of Female Patients : 31.68%
Percentage of Male Patients : 68.32%young_age = df[(df.age>=29)&(df.age<40)]
middle_age = df[(df.age>=40)&(df.age<55)]
elderly_age = df[(df.age>=55)]
print("Patients of Young age", len(young_age))
print("Patients of middle age", len(middle_age))
print("Patients of elderly age", len(elderly_age))Patients of Young age 16
Patients of middle age 128
Patients of elderly age 159colors = ['blue','green', 'cyan']
explode= [1,1,1]
plt.figure(figsize= (8,8))
plt.pie([len(young_age), len(middle_age), len(elderly_age)], labels=['young age', 'middle age', 'elderly age'])
plt.show()

#chest pain analysis
df.cp.value_counts()0    143
2     87
1     50
3     23
Name: cp, dtype: int64df.target.unique()
array([1, 0], dtype=int64)sns.countplot(df.target)
plt.xlabel('Target')
plt.ylabel('Count')
plt.title('Target 1 & 0')
plt.show()

Building the model

X_data = df.drop(['target'], axis = 1)
y = df.target.values
X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size = 0.2, random_state= 0)
# SHAPE OF TRAINING AND TESTING DATA
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)(242, 13)
(61, 13)
(242,)
(61,)from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC,SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier


# Importing functions to get the model fitting for the data 
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix  
from sklearn.model_selection import cross_val_score
# Fitting all the models at the same time using 'for' loop
models=[LogisticRegression(multi_class="multinomial",solver="newton-cg"),
        LinearSVC(),
        SVC(kernel='rbf',gamma="auto"),
        KNeighborsClassifier(n_neighbors=10,metric="euclidean"),
        GaussianNB(),
        DecisionTreeClassifier(criterion="gini",max_depth=10),
        RandomForestClassifier(n_estimators=100),
        GradientBoostingClassifier()
        ]
model_names=['LogisticRegression',
             'LinearSVM',
             'rbfSVM',
             'KNearestNeighbors',
             'GaussianNB',
             'DecisionTree',
             'RandomForestClassifier',
             'GradientBoostingClassifier',
             ]
acc=[]

for model in range(len(models)):
    classification_model=models[model]
    classification_model.fit(X_train,y_train)
    y_pred=classification_model.predict(X_test)
    acc.append(accuracy_score(y_pred,y_test))
    print("\n\n","confusion matrix of",model_names[model],"is ","\n",confusion_matrix(y_test,y_pred))  
d={'Modelling Algorithm':model_names,'Accuracy':acc}
acc_table=pd.DataFrame(d)
acc_tableconfusion matrix of LogisticRegression is  
 [[21  6]
 [ 3 31]]


 confusion matrix of LinearSVM is  
 [[ 2 25]
 [ 0 34]]


 confusion matrix of rbfSVM is  
 [[ 2 25]
 [ 0 34]]


 confusion matrix of KNearestNeighbors is  
 [[19  8]
 [13 21]]


 confusion matrix of GaussianNB is  
 [[21  6]
 [ 3 31]]


 confusion matrix of DecisionTree is  
 [[22  5]
 [ 8 26]]


 confusion matrix of RandomForestClassifier is  
 [[21  6]
 [ 4 30]]


 confusion matrix of GradientBoostingClassifier is  
 [[21  6]
 [ 6 28]]

sns.barplot(y='Modelling Algorithm',x='Accuracy',data=acc_table)

#Finding 10 fold cross validation scores for all the models at the same time using 'for' loop 
models=[LogisticRegression(multi_class="multinomial",solver="newton-cg"),
        LinearSVC(),
        SVC(kernel='rbf',gamma="auto"),
        KNeighborsClassifier(n_neighbors=10,metric="euclidean"),
        GaussianNB(),
        DecisionTreeClassifier(criterion="gini",max_depth=10),
        RandomForestClassifier(n_estimators=100),
        GradientBoostingClassifier()
        ]
model_names=['LogisticRegression',
             'LinearSVM',
             'rbfSVM',
             'KNearestNeighbors',
             'GaussianNB',
             'DecisionTree',
             'RandomForestClassifier',
             'GradientBoostingClassifier',
             ]
cvs=[]
for model in range(len(models)):
    classification_model=models[model]
    clf=classification_model.fit(X_train,y_train)
    scores = cross_val_score(clf, X_test, y_test, cv=10)
    scores.mean()
    print(model_names[model]," 10 fold cross validation score == ",scores.mean(),"  \n\n")LogisticRegression  10 fold cross validation score ==  0.8280952380952382   


LinearSVM  10 fold cross validation score ==  0.6976190476190476   


rbfSVM  10 fold cross validation score ==  0.5585714285714285   


KNearestNeighbors  10 fold cross validation score ==  0.6742857142857142   


GaussianNB  10 fold cross validation score ==  0.7985714285714285   


DecisionTree  10 fold cross validation score ==  0.6776190476190476   


RandomForestClassifier  10 fold cross validation score ==  0.832857142857143   


GradientBoostingClassifier  10 fold cross validation score ==  0.7685714285714285

For Prediction we are taking values in the dataset itself and checking whether our model works

by giving this to the models we need to get “1” in our Predicted value.

new_obs=[[52,1,2,172,199,1,1,162,0,0.5,2,0,3]]
pv=[]
for model in range(len(models)):
    classification_model=models[model]
    models[model].predict(new_obs)
    pv.append(models[model].predict(new_obs))
    
d={'Modelling Algorithm':model_names,'Predicted value':pv}
pred_table=pd.DataFrame(d)
pred_table

ALL MODEL PREDICTS PERFECTLY EXCEPT LINEAR & KNEAREST NEIGHBORS.

All outputs are bolded.

My Github Profile

HarisH-Reddy-DS/HarisH-ML-PythonNotebooks

A Statistics postgraduate, a data science enthusiast. I love to see how the power of data analysis and machine learning…

github.com

References

Analytics Vidya
towards data science
geeksforgeeks
stackexchange

Predicting Heart Diseases using ML Algorithms

Dataset

Data Exploration

Building the model

For Prediction we are taking values in the dataset itself and checking whether our model works

by giving this to the models we need to get “1” in our Predicted value.

ALL MODEL PREDICTS PERFECTLY EXCEPT LINEAR & KNEAREST NEIGHBORS.

My Github Profile

HarisH-Reddy-DS/HarisH-ML-PythonNotebooks

A Statistics postgraduate, a data science enthusiast. I love to see how the power of data analysis and machine learning…

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by harish reddy

No responses yet