Logistic Regression with Numerical Features
We attempt to understand the data, particularly the parts of the data that contain the most signal, by running logistic regression.
Poor results
The data (1) contains both text and numerical features. However, some samples (rows) have missing values for one or both types of fields.
As a first cut, we remove the text fields and only record the presence (or absence) of each sample’s features (2). We split the dataset into a training set and a test set. Subsequently, we train a logistic regression classifier on the training set and test the model with the test set.
The logistic regression classifier produces an accuracy of 94.78% and an F1 score of 5.41% on the test set. At first glance, the accuracy score looks impressive. However, we know that the accuracy score is misleading since the dataset is highly imbalanced, with most job postings labeled as 0 (real jobs). The F1 score is a better measure of the classifier’s performance. At 5.41%, the F1 score is disappointing.
The signal in the text
It appears from the above experiment that the numerical features in the data (including conversion of text features to numerical) hold only a small amount of signal. We conclude that future experimentation should focus on extracting signals from the text fields.