On processed data

We process the raw CICIDS2017 data to get into a form that is usable by machine learning algorithms.

Combining

The raw data consists of 8 comma-separated-values (CSV) files that total up to 283K rows (864MB). As a first step, we combine all the CSV files into one (¹,²).

Cleaning

Subsequently, we clean the data.

The raw data is, in fact, relatively clean to begin with, likely because it was machine-generated. However, the data contains non-numeric values for “features” corresponding to “flow bytes per second” and “flow packets per second” (³). We assume that the non-numeric values are due to a coding error (dating back to the creation of the dataset) and drop these features from the cleaned data.

We might have to revisit the decision to drop the two features mentioned above at a future date.

Balancing

An overwhelming majority of the data is from the “BENIGN” class (⁴). While some attack classes such as “DDOS” have a healthy representation, other attack classes such as “Web Attack SQL Injection” are barely represented in the data.

We resample the “BENIGN” class and the attack classes such that the “BENIGN” class has 40,000 entries, and each of the 12 attack classes has 8,000 entries.

Other processing

To make the features and labels easier to reference, we change the label for each row from a string (example “BENIGN”) to a number (example 0). Similarly, we change the feature labels from long strings (example “Destination Port” and “Label”) to shorter strings (example “X1” and “YY”).

Processed data

We end up with 76 features (X1 - X14, X17 - X78), accounting for the two features dropped above (corresponding to X15, X16). We also have 13 numeric labels (0 - 12). The final processed file has 136,00 entries (⁵).

Combining

Cleaning

Balancing

Other processing

Processed data

References