This blog and the associated Github repository discuss data science in information security. Much of the blog is about analyzing the CICIDS2017 traffic flow dataset. However, there are posts on related topics such as machine learning at scale and data science techniques in general.
Posts
Visualizing neural network metrics with TensorBoard
We finish our port of the neural networks model to Keras and TensorFlow by incorporating TensorBoard into the Colab notebook.
Data Science Operations at Scale
This is article 5 of a 5-part series on data science operations.
Model Monitoring
This is article 4 of a 5-part series on data science operations.
Model Deployment
This is article 3 of a 5-part series on data science operations.
Model Development and Maintenance
This is article 2 of a 5-part series on data science operations.
Infrastructure for Data Science
This is article 1 of a 5-part series on data science operations. The series was originally written in September 2019 but is being posted in November 2020.
Reimplementing the neural network classifier with Keras
We reimplement the neural classifier using Keras to develop a feel for the difference between Keras and PyTorch.
Anomaly detection with isolation forest
In this experiment, we use an isolation forest to detect heartbleed traffic flows.
Retrying principal component analysis and gaussian mixture models
We create a simpler dataset and use principal component analysis (PCA) and gaussian mixture models (GMMs) over this dataset.
Visualizing the principal components
We use principal component analysis (PCA) to extract components and attempt visualizing the data.
Exploring the data using gaussian mixture models
We use gaussian mixture models (GMMs) to improve our understanding of the attack class data.
Measuring classification performance
We use two measures for classification performance in this project: accuracy and F1-score.
Varying K nearest neighbors hyper-parameters
We vary the K-nearest-neighbors (KNN) hyper-parameters to understand KNN’s performance better.
Experimenting with K nearest neighbors
We attempt classification using K-nearest-neighbors (KNN) to increase the diversity of techniques used with the dataset.
Using neural networks
We attempt to beat the baseline classification accuracy of logistic regression with a neural network-based classifier.
Developing a baseline with logistic regression
We use logistic regression as the first classification technique on the processed data to develop a baseline for classification results.
On processed data
We process the raw CICIDS2017 data to get into a form that is usable by machine learning algorithms.
About the raw data
The raw CICIDS2017 data is a summarization of network traffic flows from a test network. The data can be used for training and comparing machine learning models.
Introduction to the IDS analysis project
The IDS analysis project seeks to analyze the CICIDS2017 dataset from the University of New Brunswick (UNB). The CICIDS2017 dataset contains information on network traffic flows. The traffic flows are tagged as benign or one of several attacks. The analysis project attempts to understand the characteristics of various techniques that separate benign traffic flows from attack traffic flows.