Predicting Heart Diseases using ML Algorithms

harish reddy
6 min readDec 28, 2019

--

In this post I am going to use various classification algorithms like

  1. Logistic Regression,
  2. Linear SVM,
  3. rbf SVM,
  4. KNearestNeighbors,
  5. GaussianNB,
  6. DecisionTree,
  7. RandomForestClassifier,
  8. GradientBoostingClassifier. for more info about algorithms click here

to predict whether a person is likely or has higher probability or chances of getting heart diseases.

Dataset

This dataset contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
For reference click here

Attribute Information:

1.age = age in years

2.sex= (1 = male; 0 = female)

3.cp = chest pain type

4.trestbpsr = esting blood pressure (in mm Hg on admission to the hospital)

5.chol = serum cholestoral in mg/dl

6.fbs = (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

7.restecg = resting electrocardiographic results

8.thalach = maximum heart rate achieved

9.exang = exercise induced angina (1 = yes; 0 = no)

10.oldpeak = ST depression induced by exercise relative to rest

11.slope = the slope of the peak exercise ST segment

12.ca = number of major vessels (0–3) colored by flourosopy

13.thal = 3 = normal; 6 = fixed defect; 7 = reversable defect

14.target = 1 or 0

Data Exploration

#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('heart.csv')df.head()
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age 303 non-null int64
sex 303 non-null int64
cp 303 non-null int64
trestbps 303 non-null int64
chol 303 non-null int64
fbs 303 non-null int64
restecg 303 non-null int64
thalach 303 non-null int64
exang 303 non-null int64
oldpeak 303 non-null float64
slope 303 non-null int64
ca 303 non-null int64
thal 303 non-null int64
target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot=True)
plt.show()
Correlation map
#age analysis
df.age.value_counts()[:10]
58 19
57 17
54 16
59 14
52 13
51 12
62 11
44 11
60 11
56 11
Name: age, dtype: int64

Now using barplot to count values

sns.barplot(x= df.age.value_counts()[:10].index, y= df.age.value_counts()[:10].values  )
plt.xlabel('Age')
plt.ylabel("Age counter")
plt.title("Age Analysis")
plt.show
df.target.value_counts()
1 165
0 138

we can see count of target of variables i.e., suporting cases for heart diseases.

countNoDisease = len(df[df.target == 0])
countHaveDisease = len(df[df.target == 1])
print("Percentage of patients that don't have heart disease : {:.2f}%".format((countNoDisease/(len(df.target)))*100))
print("Percentage of patients that have heart disease : {:.2f}%".format((countHaveDisease/(len(df.target)))*100))
Percentage of patients that don't have heart disease : 45.54%
Percentage of patients that have heart disease : 54.46%
countFemale= len(df[df.sex == 0])
countMale = len(df[df.sex == 1])
print(" Percentage of Female Patients : {:.2f}%".format((countFemale/(len(df.sex))*100)))
print("Percentage of Male Patients : {:.2f}%".format((countMale/(len(df.sex))*100)))
Percentage of Female Patients : 31.68%
Percentage of Male Patients : 68.32%
young_age = df[(df.age>=29)&(df.age<40)]
middle_age = df[(df.age>=40)&(df.age<55)]
elderly_age = df[(df.age>=55)]
print("Patients of Young age", len(young_age))
print("Patients of middle age", len(middle_age))
print("Patients of elderly age", len(elderly_age))
Patients of Young age 16
Patients of middle age 128
Patients of elderly age 159
colors = ['blue','green', 'cyan']
explode= [1,1,1]
plt.figure(figsize= (8,8))
plt.pie([len(young_age), len(middle_age), len(elderly_age)], labels=['young age', 'middle age', 'elderly age'])
plt.show()
#chest pain analysis
df.cp.value_counts()
0 143
2 87
1 50
3 23
Name: cp, dtype: int64
df.target.unique()
array([1, 0], dtype=int64)
sns.countplot(df.target)
plt.xlabel('Target')
plt.ylabel('Count')
plt.title('Target 1 & 0')
plt.show()

Building the model

X_data = df.drop(['target'], axis = 1)
y = df.target.values
X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size = 0.2, random_state= 0)
# SHAPE OF TRAINING AND TESTING DATA
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(242, 13)
(61, 13)
(242,)
(61,)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC,SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier


# Importing functions to get the model fitting for the data
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix
from sklearn.model_selection import cross_val_score
# Fitting all the models at the same time using 'for' loop
models=[LogisticRegression(multi_class="multinomial",solver="newton-cg"),
LinearSVC(),
SVC(kernel='rbf',gamma="auto"),
KNeighborsClassifier(n_neighbors=10,metric="euclidean"),
GaussianNB(),
DecisionTreeClassifier(criterion="gini",max_depth=10),
RandomForestClassifier(n_estimators=100),
GradientBoostingClassifier()
]
model_names=['LogisticRegression',
'LinearSVM',
'rbfSVM',
'KNearestNeighbors',
'GaussianNB',
'DecisionTree',
'RandomForestClassifier',
'GradientBoostingClassifier',
]
acc=[]

for model in range(len(models)):
classification_model=models[model]
classification_model.fit(X_train,y_train)
y_pred=classification_model.predict(X_test)
acc.append(accuracy_score(y_pred,y_test))
print("\n\n","confusion matrix of",model_names[model],"is ","\n",confusion_matrix(y_test,y_pred))
d={'Modelling Algorithm':model_names,'Accuracy':acc}
acc_table=pd.DataFrame(d)
acc_table
confusion matrix of LogisticRegression is
[[21 6]
[ 3 31]]


confusion matrix of LinearSVM is
[[ 2 25]
[ 0 34]]


confusion matrix of rbfSVM is
[[ 2 25]
[ 0 34]]


confusion matrix of KNearestNeighbors is
[[19 8]
[13 21]]


confusion matrix of GaussianNB is
[[21 6]
[ 3 31]]


confusion matrix of DecisionTree is
[[22 5]
[ 8 26]]


confusion matrix of RandomForestClassifier is
[[21 6]
[ 4 30]]


confusion matrix of GradientBoostingClassifier is
[[21 6]
[ 6 28]]
sns.barplot(y='Modelling Algorithm',x='Accuracy',data=acc_table)
#Finding 10 fold cross validation scores for all the models at the same time using 'for' loop 
models=[LogisticRegression(multi_class="multinomial",solver="newton-cg"),
LinearSVC(),
SVC(kernel='rbf',gamma="auto"),
KNeighborsClassifier(n_neighbors=10,metric="euclidean"),
GaussianNB(),
DecisionTreeClassifier(criterion="gini",max_depth=10),
RandomForestClassifier(n_estimators=100),
GradientBoostingClassifier()
]
model_names=['LogisticRegression',
'LinearSVM',
'rbfSVM',
'KNearestNeighbors',
'GaussianNB',
'DecisionTree',
'RandomForestClassifier',
'GradientBoostingClassifier',
]
cvs=[]
for model in range(len(models)):
classification_model=models[model]
clf=classification_model.fit(X_train,y_train)
scores = cross_val_score(clf, X_test, y_test, cv=10)
scores.mean()
print(model_names[model]," 10 fold cross validation score == ",scores.mean()," \n\n")
LogisticRegression 10 fold cross validation score == 0.8280952380952382


LinearSVM 10 fold cross validation score == 0.6976190476190476


rbfSVM 10 fold cross validation score == 0.5585714285714285


KNearestNeighbors 10 fold cross validation score == 0.6742857142857142


GaussianNB 10 fold cross validation score == 0.7985714285714285


DecisionTree 10 fold cross validation score == 0.6776190476190476


RandomForestClassifier 10 fold cross validation score == 0.832857142857143


GradientBoostingClassifier 10 fold cross validation score == 0.7685714285714285

For Prediction we are taking values in the dataset itself and checking whether our model works

by giving this to the models we need to get “1” in our Predicted value.

new_obs=[[52,1,2,172,199,1,1,162,0,0.5,2,0,3]]
pv=[]
for model in range(len(models)):
classification_model=models[model]
models[model].predict(new_obs)
pv.append(models[model].predict(new_obs))

d={'Modelling Algorithm':model_names,'Predicted value':pv}
pred_table=pd.DataFrame(d)
pred_table

ALL MODEL PREDICTS PERFECTLY EXCEPT LINEAR & KNEAREST NEIGHBORS.

All outputs are bolded.

My Github Profile

References

  1. Analytics Vidya
  2. towards data science
  3. geeksforgeeks
  4. stackexchange

--

--

harish reddy

A Statistics Postgradute, a data science enthusiast. I love to see how the knowledge of data analysis and ML techniques are solving the worlds critical problems