# Simple KNN For Multi-Class Classification, rendered and some key aspects are explained along the notebook structure.
# Initializing libraries code block
import seaborn as sns
import pandas
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
# Initializing dataset from drive
data_diabetes = pandas.read_csv('/content/drive/MyDrive/ML&Big_Data/diabetes.csv')
# Describing dataframe objects, gathering info such as mean, std deviation,
# what percentage are in the different quartiles, and maximum values a single object can hold
data_diabetes.describe()
HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | Veggies | ... | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | Diabetes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | ... | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 | 253680.000000 |
mean | 0.429001 | 0.424121 | 0.962670 | 28.382364 | 0.443169 | 0.040571 | 0.094186 | 0.756544 | 0.634256 | 0.811420 | ... | 0.084177 | 2.511392 | 3.184772 | 4.242081 | 0.168224 | 0.440342 | 8.032119 | 5.050434 | 6.053875 | 0.296921 |
std | 0.494934 | 0.494210 | 0.189571 | 6.608694 | 0.496761 | 0.197294 | 0.292087 | 0.429169 | 0.481639 | 0.391175 | ... | 0.277654 | 1.068477 | 7.412847 | 8.717951 | 0.374066 | 0.496429 | 3.054220 | 0.985774 | 2.071148 | 0.698160 |
min | 0.000000 | 0.000000 | 0.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 1.000000 | 24.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | ... | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 4.000000 | 5.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 1.000000 | 27.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 5.000000 | 7.000000 | 0.000000 |
75% | 1.000000 | 1.000000 | 1.000000 | 31.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 0.000000 | 3.000000 | 2.000000 | 3.000000 | 0.000000 | 1.000000 | 10.000000 | 6.000000 | 8.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 98.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 5.000000 | 30.000000 | 30.000000 | 1.000000 | 1.000000 | 13.000000 | 6.000000 | 8.000000 | 2.000000 |
8 rows × 22 columns
data_diabetes.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 253680 entries, 0 to 253679 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 HighBP 253680 non-null int64 1 HighChol 253680 non-null int64 2 CholCheck 253680 non-null int64 3 BMI 253680 non-null int64 4 Smoker 253680 non-null int64 5 Stroke 253680 non-null int64 6 HeartDiseaseorAttack 253680 non-null int64 7 PhysActivity 253680 non-null int64 8 Fruits 253680 non-null int64 9 Veggies 253680 non-null int64 10 HvyAlcoholConsump 253680 non-null int64 11 AnyHealthcare 253680 non-null int64 12 NoDocbcCost 253680 non-null int64 13 GenHlth 253680 non-null int64 14 MentHlth 253680 non-null int64 15 PhysHlth 253680 non-null int64 16 DiffWalk 253680 non-null int64 17 Sex 253680 non-null int64 18 Age 253680 non-null int64 19 Education 253680 non-null int64 20 Income 253680 non-null int64 21 Diabetes 253680 non-null int64 dtypes: int64(22) memory usage: 42.6 MB
Dataframe Analysis of Values¶
We can see that the provided dataset has no null values, so that's good. It's an indication that the data has been pre-processed already to an extent. Obviously as professional analysts we'll still be conducting feature scaling using StandardScaler
to scale down the feature's value sizes to appropriate sizes which we would use as parameters of prediction and how will they contribute to the final prediction of the model. With that said, let's move on.
# Describing dataframe array 'Diabetes'
data_diabetes['Diabetes'].unique()
array([0, 2, 1])
Unique Classification Values¶
In the above code, we can see that the label is either one of three classes. In which I can as a data analyst assume that 0
is = no-diabetes
, 1
= type 1 diabetic
, and 2
= type 2 diabetic
. Now we can start explaining what KNN will do in this multi-class classification model.
Objective 1: How KNN Works¶
So to my understanding, KNN works by calculating the distance between existing data points that the model has been trained on and the new instances which be the test set that the model has never seen. And by calculating the euclidean distance it can predict whether this new data instance can be classified as in our instance either no-diabetes
, type 1 diabetic
, or type 2 diabetic
After training our model should be able to declare when a new instance from the test-set is being used as input to the parameters the model generalizes on and declare a value to that instance as a prediction
column that holds a value or either 0
or 1
or 2
according to our mapping of features to labels.
Feature Scaling¶
Now, we can issue the following code block to ensure that our set before splitting has been scaled and a new separate dataframe object has the values of the label
column Diabetes
held in it. This allows us to use it as a validator to test our prediction when we come to that point.
# Feature scaling code block
unscaled_features = data_diabetes.drop(columns=['Diabetes'])
label = data_diabetes['Diabetes']
scaler_metric = StandardScaler()
scaled_features = scaler_metric.fit_transform(unscaled_features)
scaled_features.view()
array([[ 1.15368814, 1.16525449, 0.19692156, ..., 0.31690008, -1.06559465, -1.4744874 ], [-0.86678537, -0.85818163, -5.07816412, ..., -0.33793279, 0.96327159, -2.44013754], [ 1.15368814, 1.16525449, 0.19692156, ..., 0.31690008, -1.06559465, 0.93963796], ..., [-0.86678537, -0.85818163, 0.19692156, ..., -1.97501498, -0.05116153, -1.95731247], [ 1.15368814, -0.85818163, 0.19692156, ..., -0.33793279, -0.05116153, -2.44013754], [ 1.15368814, 1.16525449, 0.19692156, ..., 0.31690008, 0.96327159, -1.95731247]])
test_vals = [0.15, 0.2, 0.25, 0.3,]
neighbor_values = [2, 5, 7, 10]
for value in test_vals:
features_tr, features_te, labels_tr, labels_te = train_test_split(scaled_features, label, test_size=value, random_state=42)
for iteration in neighbor_values:
knnmodel = KNeighborsClassifier(n_neighbors=iteration)
knnmodel.fit(features_tr, labels_tr)
pred = knnmodel.predict(features_te)
pred_accuracy = accuracy_score(labels_te, pred)
print(f"This is the subset of {value}0% Testing Set")
print(f"The Accuracy of\t`{iteration}`\tN-Neighbor Value is {pred_accuracy:.2f}")
confusion = confusion_matrix(labels_te, pred)
plt.figure(figsize=(6, 4))
sns.heatmap(confusion, annot=True, cmap='Blues', fmt='g',
xticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"],
yticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"])
plt.title("Confusion Matrix For KNN")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
print("\n")
This is the subset of 0.150% Testing Set The Accuracy of `2` N-Neighbor Value is 0.83
This is the subset of 0.150% Testing Set The Accuracy of `5` N-Neighbor Value is 0.83
This is the subset of 0.150% Testing Set The Accuracy of `7` N-Neighbor Value is 0.84
This is the subset of 0.150% Testing Set The Accuracy of `10` N-Neighbor Value is 0.84
This is the subset of 0.20% Testing Set The Accuracy of `2` N-Neighbor Value is 0.83
This is the subset of 0.20% Testing Set The Accuracy of `5` N-Neighbor Value is 0.83
This is the subset of 0.20% Testing Set The Accuracy of `7` N-Neighbor Value is 0.84
This is the subset of 0.20% Testing Set The Accuracy of `10` N-Neighbor Value is 0.84
This is the subset of 0.250% Testing Set The Accuracy of `2` N-Neighbor Value is 0.83
This is the subset of 0.250% Testing Set The Accuracy of `5` N-Neighbor Value is 0.83
This is the subset of 0.250% Testing Set The Accuracy of `7` N-Neighbor Value is 0.83
This is the subset of 0.250% Testing Set The Accuracy of `10` N-Neighbor Value is 0.84
This is the subset of 0.30% Testing Set The Accuracy of `2` N-Neighbor Value is 0.83
This is the subset of 0.30% Testing Set The Accuracy of `5` N-Neighbor Value is 0.83
This is the subset of 0.30% Testing Set The Accuracy of `7` N-Neighbor Value is 0.84
This is the subset of 0.30% Testing Set The Accuracy of `10` N-Neighbor Value is 0.84
Optimal K-Neighbors Value & Test Subset¶
After multiple iteration, we determined that the best yielding test_size
was .20%
and .80%
for training, with the highest n_neighbors
= 7,
I've chosen this instead of the 10
because now we're just at the elbow-curve and we're just wasting runtime and resources at this point. So just for good measures, I've put it again in a separate plot in the next code block and then going to compare it with the two available methods of weighting provided by sklearn
library.
features_tr, features_te, labels_tr, labels_te = train_test_split(scaled_features, label, test_size=0.2, random_state=42)
optimal_k = KNeighborsClassifier(n_neighbors=7)
optimal_k.fit(features_tr, labels_tr)
optimal_pred = optimal_k.predict(features_te)
optimal_accuracy = accuracy_score(labels_te, optimal_pred)
print(f"This is the subset of 0.20% Testing Set")
print(f"The Accuracy of 7 N-Neighbor Value is {optimal_accuracy:.2f}")
confusion = confusion_matrix(labels_te, optimal_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(confusion, annot=True, cmap='Blues', fmt='g',
xticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"],
yticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"])
plt.title("Confusion Matrix For KNN")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
print("\n")
This is the subset of 0.20% Testing Set The Accuracy of 7 N-Neighbor Value is 0.84
print(optimal_accuracy)
0.8354225796278777
Weighted KNN For Optimal Value of K-Neighbors and Test Subset¶
Now after deciding what our optimal K Neighbor value was and at which exact subset of testing data, we can use the weight
attribute to distance
and uniform
which are the two different attributes that weight can be assigned of the KNN Classifier to determine if it will conclude a better prediction than the unweighted model.
features_tr, features_te, labels_tr, labels_te = train_test_split(scaled_features, label, test_size=0.2, random_state=42)
weights_list = ['uniform', 'distance']
for weight in weights_list:
weighted_knn = KNeighborsClassifier(n_neighbors=7, weights=weight)
weighted_knn.fit(features_tr, labels_tr)
weighted_pred = weighted_knn.predict(features_te)
weighted_pred_accuracy = accuracy_score(labels_te, weighted_pred)
print(f"This is the subset of 0.20% {weight} method Testing Set")
print(f"The Accuracy of 7 N-Neighbor value with {weight} method: is {weighted_pred_accuracy:.2f}")
confusion = confusion_matrix(labels_te, weighted_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(confusion, annot=True, cmap='Blues', fmt='g',
xticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"],
yticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"])
plt.title("Confusion Matrix For KNN")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
print("\n")
This is the subset of 0.20% uniform method Testing Set The Accuracy of 7 N-Neighbor value with uniform method: is 0.84
Comparison Between Models (Unweighted & Weighted)¶
To my specific runtime there was no major difference in the output, albeit being the model with uniform
weighting method had a 0.1%
higher accuracy than the distance
weighting method. So to my consensus, in order to make this model a bit better, I have to re-iterate through the feature engineering process again and figuring out other methods of implementation such as importing VectorAssembler
from pyspark to assemble all features values into one vector, or dropping certain columns such as the Education
column, that column may not correlate to the label as much as Age
for example, that in turn would reduce model bias, eventually leading to more-informed decisions, hopefully creating a better generalizing model, or even changing the classifier itself, to try out newer schemes and transformations.
;)
Thanks, I hope you like this notebook.
!jupyter nbconvert --to html /content/drive/MyDrive/Colab-Notebooks/simple-knn-mcc.ipynb