This blog and the associated Github repository discuss data science in detecting fake jobs. Much of the blog is about analyzing the Fake Jobs Postings dataset.
Posts
Word2Vec experiments to match the performance of Logistic Regression
Since Word2Vec performed better than FastText, we experiment with hyper-parameters used to generate the embeddings.
More Experimentation with Embeddings
We investigate the poor performance of the composite model with external embeddings.
Experimentation with FastText Embedding
We extend our experiment with the composite token and character model to include an externally generated (non-inline) embedding.
A composite of char and token-based models
We continue our experimentation by creating a neural network model composed from a character and a token (word) based model,
On character-level models
We attempt to classify the data using character-level (rather than word-level) models.
ROC curve comparison of models
Given that the models’ performance is similar for F1 scores, we draw some Receiver Operating Characteristic (ROC) curves to understand these models’ performance at low false-positive rates.
A Neural Network with No Hidden Layers
Given the encouraging results with the bag-of-words (BOW) plus logistic regression model, we seek to replicate the performance with a simple neural network that mimics logistic regression.
Returning to Logistic Regression
We return to a logistic regression model to understand which job description words influence a job posting’s designation as fraudulent.
Ensembles of other models
Finally, we create ensembles from some of the other models that we have attempted in this project.
A Transformer model from scratch
We create and experiment with a Transformer model and compare results with the fully-connected neural network and LSTM models.
LSTM with a pre-trained embedding layer
We experiment with a pre-trained word-embedding layer as part of an LSTM model.
LSTM with an embedding layer
LSTM is a specialized form of neural network and a classic technique for processing word sequences. Given that the job descriptions are just word sequences, we experiment with LSTM for classification.
Bag-of-words with Fully-connected Neural Network
We use the text of the job descriptions instead of the presence of individual features to improve the F1 score.
Logistic Regression with Numerical Features
We attempt to understand the data, particularly the parts of the data that contain the most signal, by running logistic regression.
Introduction to the Fake Jobs Detection Project
The Fake Jobs Detection project seeks to analyze the Fake Jobs Postings dataset from Kaggle. The dataset is a collection of real (label 0) and fraudulent (label 1) job postings. This project attempts to understand the characteristics of various techniques that separate actual job postings from fraudulent ones.