Predictive Modelling of Breast Cancer Diagnosis using K-Nearest Neighbors Algorithm

In this classification project we will be using the K-Nearest Neighbors (KNN) algorithm on the Breast Cancer Wisconsin (Diagnostic) dataset which can be found at Breast Cancer Wisconsin (Diagnostic). The dataset contains 30 features which are extracted from a digitized image representing a fine needle aspirate (FNA) procedure conducted on a breast mass.

Our task is to develop a predictive model that can accurately classify tumors as either benign or malignant given a set of features. The targets are binary values labeled either M = Malignant or B = Benign. Next, we import necessary libraries as well as the dataset. Note that the dataset has already been split into feature and target dataframes. The features for this dataset are summarised below:

Variable Name	Role	Type	Missing Values
ID	ID	Categorical	no
Diagnosis	Target	Categorical	no
radius1	Feature	Continuous	no
texture1	Feature	Continuous	no
perimeter1	Feature	Continuous	no
area1	Feature	Continuous	no
smoothness1	Feature	Continuous	no
compactness1	Feature	Continuous	no
concavity1	Feature	Continuous	no
concave_points1	Feature	Continuous	no
symmetry1	Feature	Continuous	no
fractal_dimension1	Feature	Continuous	no
radius2	Feature	Continuous	no
texture2	Feature	Continuous	no
perimeter2	Feature	Continuous	no
area2	Feature	Continuous	no
smoothness2	Feature	Continuous	no
compactness2	Feature	Continuous	no
concavity2	Feature	Continuous	no
concave_points2	Feature	Continuous	no
symmetry2	Feature	Continuous	no
fractal_dimension2	Feature	Continuous	no
radius3	Feature	Continuous	no
texture3	Feature	Continuous	no
perimeter3	Feature	Continuous	no
area3	Feature	Continuous	no
smoothness3	Feature	Continuous	no
compactness3	Feature	Continuous	no
concavity3	Feature	Continuous	no
concave_points3	Feature	Continuous	no
symmetry3	Feature	Continuous	no
fractal_dimension3	Feature	Continuous	no

# import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import warnings
import numpy as np
from sklearn.model_selection import validation_curve
warnings.filterwarnings('ignore') 

# import data
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets

Now let’s display the first five entries of both features and targets to get a better idea of the data.

	radius1	texture1	perimeter1	area1	smoothness1	compactness1	concavity1	concave_points1	symmetry1	fractal_dimension1	…	radius3	texture3	perimeter3	area3	smoothness3	compactness3	concavity3	concave_points3	symmetry3	fractal_dimension3
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	…	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	…	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	…	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	…	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	…	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	Diagnosis
0	M
1	M
2	M
3	M
4	M

We can see that the feature entries are all numerical whereas the target entries are categorical. We will now start with the data preparation, and our first task would be to encode the categorical variables of the target dataframe.

Data Preparation

We start by changing the entries of the target dataframe to match the mapping M = 1 and B = 0.

# Change entries to numerical values
y["Diagnosis"] = (y["Diagnosis"] == 'M').astype(int)

# Print dataframe
y.tail()

Now the targets dataframe contains numerical values. This makes it possible to perform mathematical operations on the data.

	Diagnosis
0	1
1	1
2	1
3	1
4	1
…	…
568	0

Next we will use the MinMaxScaler method to normalise the input features to mitigate any biased results in the calculations performed by KNN. We do this because it is clear from the features dataframe that each feature contain different ranges, for example some values are between 0 and 1 whereas others are between 100 and 1000. The MinMax method normalises our data to fit within a range of 0 to 1.

# create a scaler object
scaler = MinMaxScaler()
X_normalised = scaler.fit_transform(X)

# create a pandas dataframe
X_normalised_df = pd.DataFrame(X_normalised, columns=X.columns)

Now all features contain data within the same range since we have normalised the data. Next we can start to train the model.

Training the Model

We start by splitting the features and targets dataframes into training and testing dataframes where 20% of the data goes to testing and the 80% remaining will be used to train the model. After this, we create a knn object and fit it to our training data, however we still have to choose the number of neighbors, so we optimise this selection by plotting validation curves for a selection of these numbers and make a choice based on the results.

# split the data
X_train, X_test, y_train, y_test = train_test_split(X_normalised_df, y, test_size=0.20, random_state=1)

# range of n_neighbors values
param_range = [1, 3, 5, 7, 9, 11, 13, 15]

# validation curve scores
train_scores, test_scores = validation_curve(
    KNeighborsClassifier(), X_train, y_train, param_name="n_neighbors", param_range=param_range, cv=5
)

# validation curves
plt.figure(figsize=(10, 6))
plt.plot(param_range, np.mean(train_scores, axis=1), label="Training score")
plt.plot(param_range, np.mean(test_scores, axis=1), label="Cross-validation score")
plt.xlabel("n_neighbors")
plt.ylabel("Score")
plt.title("Validation Curve for KNN")
plt.legend()
plt.grid()
plt.show()

Validation Curves

We see that the cross-validation score curve peaks at n_neighbors = 7, so this will be our selection. Now let’s create the model.

# Build the KNN model
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the data to the model
knn.fit(X_train, y_train)

# Create predictions
y_prediction = knn.predict(X_test)

Model Performance and Evaluation Metrics

Now that the model has been created we can evaluate its performance on the testing data. We will look at the model accuracy, its confusion matrix and then create a classification report.

# accuracy
accuracy = accuracy_score(y_test, y_prediction)

# confusion matrix
cm = confusion_matrix(y_test, y_prediction)

# classification report
report = classification_report(y_test, y_prediction)

print(f"Model Accuracy Score: {accuracy}")

print("\nModel Confusion Matrix:\n")
display(cm)

print("\nClassification Report:\n")
print(report)

Model Accuracy Score: 0.9473684210526315

Model Confusion Matrix:

array([[71,  1],
       [ 5, 37]], dtype=int64)

Classification Report:

	Precision	Recall	F1-Score	Support
Class 0	0.93	0.99	0.96	72
Class 1	0.97	0.88	0.93	42
Accuracy			0.95	114
Macro Avg	0.95	0.93	0.94	114
Weighted Avg	0.95	0.95	0.95	114

Our model has an accuracy score of 94.7% which is remarkable. Looking at the classification report, we should note that class 1 corresponds to malignant cases and class 0 corresponds to benign cases. From the Precision column we see that 93% of the samples predicted as benign were actually benign and 97% of the samples predicted as malignant were actually malignant.

Furthermore, from the recall column we see that 99% of actual benign samples were correctly identified, but only 88% of the malignant samples were correctly identified. This is still impressive. Moreover, we see that both classes have a balanced precision and recall measure by looking at their F1 scores. Lastly, the support columns tells us that 72 samples were benign and 42 samples were malignant.

In summary, our model predicted the correct class for 95% of the testing samples which indicates a strong performance in distinguishing between “Benign” and “Malignant” breast cancer cases.

Data Visualisation

Since we have many features, we can use principle component analysis to reduce the dimentionality of the data.

def plot_decision_boundaries(X, y, model):
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X)
    model.fit(X_pca, y)
    x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
    y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.4)
    scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, s=20, edgecolor='k')
    plt.xlabel('PCA axis 1')
    plt.ylabel('PCA axis 2')
    plt.title('KNN Decision Boundaries with PCA')
    plt.colorbar()
    legend1 = plt.legend(*scatter.legend_elements(), title="Classes")
    plt.gca().add_artist(legend1)
    plt.show()

plot_decision_boundaries(X_test, y_prediction, knn)

Decision Boundaries The results above clearly show the decision bounary of our model and there is a clear distinction between “Class 0 - Benign” and “Class 1 - Malignant” observations. This model is reasonably accurate and can be saved for future use.

Predictive Modelling of Breast Cancer Diagnosis using K-Nearest Neighbors Algorithm#

Data Preparation#

Training the Model#

Model Performance and Evaluation Metrics#

Data Visualisation#

Predictive Modelling of Breast Cancer Diagnosis using K-Nearest Neighbors Algorithm

Data Preparation

Training the Model

Model Performance and Evaluation Metrics

Data Visualisation