In [1]:
# Simple KNN For Multi-Class Classification, rendered and some key aspects are explained along the notebook structure.
In [2]:
# Initializing libraries code block

import seaborn as sns
import pandas
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
In [3]:
# Initializing dataset from drive

data_diabetes = pandas.read_csv('/content/drive/MyDrive/ML&Big_Data/diabetes.csv')
In [4]:
# Describing dataframe objects, gathering info such as mean, std deviation,
# what percentage are in the different quartiles, and maximum values a single object can hold

data_diabetes.describe()
Out[4]:
HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity Fruits Veggies ... NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income Diabetes
count 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 ... 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000
mean 0.429001 0.424121 0.962670 28.382364 0.443169 0.040571 0.094186 0.756544 0.634256 0.811420 ... 0.084177 2.511392 3.184772 4.242081 0.168224 0.440342 8.032119 5.050434 6.053875 0.296921
std 0.494934 0.494210 0.189571 6.608694 0.496761 0.197294 0.292087 0.429169 0.481639 0.391175 ... 0.277654 1.068477 7.412847 8.717951 0.374066 0.496429 3.054220 0.985774 2.071148 0.698160
min 0.000000 0.000000 0.000000 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000
25% 0.000000 0.000000 1.000000 24.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 ... 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 6.000000 4.000000 5.000000 0.000000
50% 0.000000 0.000000 1.000000 27.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 ... 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 8.000000 5.000000 7.000000 0.000000
75% 1.000000 1.000000 1.000000 31.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 ... 0.000000 3.000000 2.000000 3.000000 0.000000 1.000000 10.000000 6.000000 8.000000 0.000000
max 1.000000 1.000000 1.000000 98.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 5.000000 30.000000 30.000000 1.000000 1.000000 13.000000 6.000000 8.000000 2.000000

8 rows × 22 columns

In [5]:
data_diabetes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype
---  ------                --------------   -----
 0   HighBP                253680 non-null  int64
 1   HighChol              253680 non-null  int64
 2   CholCheck             253680 non-null  int64
 3   BMI                   253680 non-null  int64
 4   Smoker                253680 non-null  int64
 5   Stroke                253680 non-null  int64
 6   HeartDiseaseorAttack  253680 non-null  int64
 7   PhysActivity          253680 non-null  int64
 8   Fruits                253680 non-null  int64
 9   Veggies               253680 non-null  int64
 10  HvyAlcoholConsump     253680 non-null  int64
 11  AnyHealthcare         253680 non-null  int64
 12  NoDocbcCost           253680 non-null  int64
 13  GenHlth               253680 non-null  int64
 14  MentHlth              253680 non-null  int64
 15  PhysHlth              253680 non-null  int64
 16  DiffWalk              253680 non-null  int64
 17  Sex                   253680 non-null  int64
 18  Age                   253680 non-null  int64
 19  Education             253680 non-null  int64
 20  Income                253680 non-null  int64
 21  Diabetes              253680 non-null  int64
dtypes: int64(22)
memory usage: 42.6 MB

Dataframe Analysis of Values¶

We can see that the provided dataset has no null values, so that's good. It's an indication that the data has been pre-processed already to an extent. Obviously as professional analysts we'll still be conducting feature scaling using StandardScaler to scale down the feature's value sizes to appropriate sizes which we would use as parameters of prediction and how will they contribute to the final prediction of the model. With that said, let's move on.

In [6]:
# Describing dataframe array 'Diabetes'
data_diabetes['Diabetes'].unique()
Out[6]:
array([0, 2, 1])

Unique Classification Values¶

In the above code, we can see that the label is either one of three classes. In which I can as a data analyst assume that 0 is = no-diabetes, 1 = type 1 diabetic, and 2 = type 2 diabetic. Now we can start explaining what KNN will do in this multi-class classification model.

Objective 1: How KNN Works¶

So to my understanding, KNN works by calculating the distance between existing data points that the model has been trained on and the new instances which be the test set that the model has never seen. And by calculating the euclidean distance it can predict whether this new data instance can be classified as in our instance either no-diabetes, type 1 diabetic, or type 2 diabetic After training our model should be able to declare when a new instance from the test-set is being used as input to the parameters the model generalizes on and declare a value to that instance as a prediction column that holds a value or either 0 or 1 or 2 according to our mapping of features to labels.

Feature Scaling¶

Now, we can issue the following code block to ensure that our set before splitting has been scaled and a new separate dataframe object has the values of the label column Diabetes held in it. This allows us to use it as a validator to test our prediction when we come to that point.

In [7]:
# Feature scaling code block

unscaled_features = data_diabetes.drop(columns=['Diabetes'])
label = data_diabetes['Diabetes']

scaler_metric = StandardScaler()
scaled_features = scaler_metric.fit_transform(unscaled_features)
scaled_features.view()
Out[7]:
array([[ 1.15368814,  1.16525449,  0.19692156, ...,  0.31690008,
        -1.06559465, -1.4744874 ],
       [-0.86678537, -0.85818163, -5.07816412, ..., -0.33793279,
         0.96327159, -2.44013754],
       [ 1.15368814,  1.16525449,  0.19692156, ...,  0.31690008,
        -1.06559465,  0.93963796],
       ...,
       [-0.86678537, -0.85818163,  0.19692156, ..., -1.97501498,
        -0.05116153, -1.95731247],
       [ 1.15368814, -0.85818163,  0.19692156, ..., -0.33793279,
        -0.05116153, -2.44013754],
       [ 1.15368814,  1.16525449,  0.19692156, ...,  0.31690008,
         0.96327159, -1.95731247]])
In [8]:
test_vals = [0.15, 0.2, 0.25, 0.3,]
neighbor_values = [2, 5, 7, 10]
for value in test_vals:
  features_tr, features_te, labels_tr, labels_te = train_test_split(scaled_features, label, test_size=value, random_state=42)
  for iteration in neighbor_values:
    knnmodel = KNeighborsClassifier(n_neighbors=iteration)
    knnmodel.fit(features_tr, labels_tr)
    pred = knnmodel.predict(features_te)

    pred_accuracy = accuracy_score(labels_te, pred)
    print(f"This is the subset of {value}0% Testing Set")
    print(f"The Accuracy of\t`{iteration}`\tN-Neighbor Value is {pred_accuracy:.2f}")
    confusion = confusion_matrix(labels_te, pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(confusion, annot=True, cmap='Blues', fmt='g',
                xticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"],
                yticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"])
    plt.title("Confusion Matrix For KNN")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()
    print("\n")
This is the subset of 0.150% Testing Set
The Accuracy of	`2`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.150% Testing Set
The Accuracy of	`5`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.150% Testing Set
The Accuracy of	`7`	N-Neighbor Value is 0.84
No description has been provided for this image

This is the subset of 0.150% Testing Set
The Accuracy of	`10`	N-Neighbor Value is 0.84
No description has been provided for this image

This is the subset of 0.20% Testing Set
The Accuracy of	`2`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.20% Testing Set
The Accuracy of	`5`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.20% Testing Set
The Accuracy of	`7`	N-Neighbor Value is 0.84
No description has been provided for this image

This is the subset of 0.20% Testing Set
The Accuracy of	`10`	N-Neighbor Value is 0.84
No description has been provided for this image

This is the subset of 0.250% Testing Set
The Accuracy of	`2`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.250% Testing Set
The Accuracy of	`5`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.250% Testing Set
The Accuracy of	`7`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.250% Testing Set
The Accuracy of	`10`	N-Neighbor Value is 0.84
No description has been provided for this image

This is the subset of 0.30% Testing Set
The Accuracy of	`2`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.30% Testing Set
The Accuracy of	`5`	N-Neighbor Value is 0.83
No description has been provided for this image

This is the subset of 0.30% Testing Set
The Accuracy of	`7`	N-Neighbor Value is 0.84
No description has been provided for this image

This is the subset of 0.30% Testing Set
The Accuracy of	`10`	N-Neighbor Value is 0.84
No description has been provided for this image

Optimal K-Neighbors Value & Test Subset¶

After multiple iteration, we determined that the best yielding test_size was .20% and .80% for training, with the highest n_neighbors = 7, I've chosen this instead of the 10 because now we're just at the elbow-curve and we're just wasting runtime and resources at this point. So just for good measures, I've put it again in a separate plot in the next code block and then going to compare it with the two available methods of weighting provided by sklearn library.

In [9]:
features_tr, features_te, labels_tr, labels_te = train_test_split(scaled_features, label, test_size=0.2, random_state=42)

optimal_k = KNeighborsClassifier(n_neighbors=7)
optimal_k.fit(features_tr, labels_tr)

optimal_pred = optimal_k.predict(features_te)
optimal_accuracy = accuracy_score(labels_te, optimal_pred)

print(f"This is the subset of 0.20% Testing Set")
print(f"The Accuracy of 7 N-Neighbor Value is {optimal_accuracy:.2f}")

confusion = confusion_matrix(labels_te, optimal_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(confusion, annot=True, cmap='Blues', fmt='g',
            xticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"],
            yticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"])
plt.title("Confusion Matrix For KNN")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
print("\n")
This is the subset of 0.20% Testing Set
The Accuracy of 7 N-Neighbor Value is 0.84
No description has been provided for this image

In [10]:
print(optimal_accuracy)
0.8354225796278777

Weighted KNN For Optimal Value of K-Neighbors and Test Subset¶

Now after deciding what our optimal K Neighbor value was and at which exact subset of testing data, we can use the weight attribute to distance and uniform which are the two different attributes that weight can be assigned of the KNN Classifier to determine if it will conclude a better prediction than the unweighted model.

In [ ]:
features_tr, features_te, labels_tr, labels_te = train_test_split(scaled_features, label, test_size=0.2, random_state=42)
weights_list = ['uniform', 'distance']
for weight in weights_list:
  weighted_knn = KNeighborsClassifier(n_neighbors=7, weights=weight)
  weighted_knn.fit(features_tr, labels_tr)

  weighted_pred = weighted_knn.predict(features_te)
  weighted_pred_accuracy = accuracy_score(labels_te, weighted_pred)

  print(f"This is the subset of 0.20% {weight} method Testing Set")
  print(f"The Accuracy of 7 N-Neighbor value with {weight} method:  is {weighted_pred_accuracy:.2f}")

  confusion = confusion_matrix(labels_te, weighted_pred)
  plt.figure(figsize=(6, 4))
  sns.heatmap(confusion, annot=True, cmap='Blues', fmt='g',
              xticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"],
              yticklabels=["No Diabetes", "Type 1 Diabetic", "Type 2 Diabetic"])
  plt.title("Confusion Matrix For KNN")
  plt.xlabel("Predicted Label")
  plt.ylabel("True Label")
  plt.show()
  print("\n")
This is the subset of 0.20% uniform method Testing Set
The Accuracy of 7 N-Neighbor value with uniform method:  is 0.84
No description has been provided for this image

Comparison Between Models (Unweighted & Weighted)¶

To my specific runtime there was no major difference in the output, albeit being the model with uniform weighting method had a 0.1% higher accuracy than the distance weighting method. So to my consensus, in order to make this model a bit better, I have to re-iterate through the feature engineering process again and figuring out other methods of implementation such as importing VectorAssembler from pyspark to assemble all features values into one vector, or dropping certain columns such as the Education column, that column may not correlate to the label as much as Age for example, that in turn would reduce model bias, eventually leading to more-informed decisions, hopefully creating a better generalizing model, or even changing the classifier itself, to try out newer schemes and transformations.

;)

Thanks, I hope you like this notebook.

In [ ]:
!jupyter nbconvert --to html /content/drive/MyDrive/Colab-Notebooks/simple-knn-mcc.ipynb