We vary the K-nearest-neighbors (KNN) hyper-parameters to understand KNN’s performance better.

Number of neighbors

KNN has two significant components that affect its performance. The first is the number of neighbors ($n$) used to determine the classification of a new data point (sample). We vary $n$ between $1$ and $6$ (1).

Minkowski distance

The Minkowski distance is a generalization of Euclidean distance. For two vectors $x$ and $y$ in d-dimensional space, the Minkowski distance is calculated as $(\sum_{i=1}^{i=d} |x_i - y_i|^{p})^{1/p}$ where $x_i, y_i$ represent the components of the vector.

In the case of the processed CICIDS data (2), each row is a vector, and the features (columns) are the dimensions.

We vary $p$ between ${1.0, 2.0, 3.0}$.

Invariant performance

We find that KNN’s performance hovers around 99% irrespective of the $n, p$ values chosen (3). The steady performance implies that the structure of the data favors KNN.

Investigation challenges

Given that we processed the data to make it suitable for machine learning algorithms, we ask, “Is the structure of the original data well suited for KNN, or did the processing make it so?”

Since we are using open and free (or cheap) computing resources such as Colab and Github, it is not easy to answer such a question. The difficulty arises from the fact that the amount of data that one can store on free computing platforms is limited. Further, more data requires more computations, increasing the computation time beyond one’s patience.

Thus, future investigation of the raw data’s structure hinges on decreasing the data size without significantly changing the its structure.

References