We use the text of the job descriptions instead of the presence of individual features to improve the F1 score.

Classifier design

The classifier is a fully-connected neural network with two hidden layers - the first with 100 units and the second with 10 units. The output-layer is a single unit. We use sigmoid activation for the output and binary cross-entropy loss. We also add regularization by introducing dropout at both the hidden layers.

The input is a bag-of-words vector, one for each sample. The bag-of-words is produced by a Scikit-learn’s CountVectorizer module (1).

Small dataset

We first try the classifier on a small version of the data set (2). Converting the job descriptions to a bag-of-words and training a neural network on the bag-of-words is likely expensive. Thus, we want to get some idea of our approach’s performance before using the full dataset.

As it turns out, the classifier reports an F1 score of 0% on the test portion of the small dataset.

Medium dataset

We hypothesize that the classifier’s poor performance is due to a lack of enough data to train the neural network. To test this hypothesis, we create a medium-sized version of the dataset (3) and redo the training and test runs. Encouragingly, the run does not take too long and reports an F1 score of 80.95%

Full dataset

We wonder if using more data - the full dataset (4) - would improve performance further.

Unfortunately, when we attempt to use the full dataset, the Colab instance crashes while running the CountVectorizer. Colab’s error message indicates that the crash is due to excessive memory usage.

We look through CountVectorizer’s documentation and recognize a mechanism to force CountVectorizer to use 8-bit integers rather than the default 64-bit integers. With this change, the training run completes smoothly.

The classifier returns an F1 score of 88.64%, a noticeable improvement over the performance on the medium dataset.

Adding back numerical features

We wonder if the classifier can be improved by adding some numerical features (those with no text counterpart). The numerical features may have a small amount of additional information.

We create a version of the fully-connected neural network model (5) that stacks the numerical features onto the bag-of-words produced by the CountVectorizer. The resulting classifier achieves an F1 score of 86.04% on the test dataset. Since the F1 score is less than that for the previous model, we conclude that numerical fields do not have useful information.

References