Detecting Fake Jobs

This blog and the associated Github repository discuss data science in detecting fake jobs. Much of the blog is about analyzing the Fake Jobs Postings dataset.

Posts

Feb 5, 2021
Word2Vec experiments to match the performance of Logistic Regression
Since Word2Vec performed better than FastText, we experiment with hyper-parameters used to generate the embeddings.
Feb 2, 2021
More Experimentation with Embeddings
We investigate the poor performance of the composite model with external embeddings.
Feb 1, 2021
Experimentation with FastText Embedding
We extend our experiment with the composite token and character model to include an externally generated (non-inline) embedding.
Jan 29, 2021
A composite of char and token-based models
We continue our experimentation by creating a neural network model composed from a character and a token (word) based model,
Jan 26, 2021
On character-level models
We attempt to classify the data using character-level (rather than word-level) models.
Jan 23, 2021
ROC curve comparison of models
Given that the models’ performance is similar for F1 scores, we draw some Receiver Operating Characteristic (ROC) curves to understand these models’ performance at low false-positive rates.
Jan 15, 2021
A Neural Network with No Hidden Layers
Given the encouraging results with the bag-of-words (BOW) plus logistic regression model, we seek to replicate the performance with a simple neural network that mimics logistic regression.
Jan 12, 2021
Returning to Logistic Regression
We return to a logistic regression model to understand which job description words influence a job posting’s designation as fraudulent.
Dec 28, 2020
Ensembles of other models
Finally, we create ensembles from some of the other models that we have attempted in this project.
Dec 13, 2020
A Transformer model from scratch
We create and experiment with a Transformer model and compare results with the fully-connected neural network and LSTM models.
Dec 10, 2020
LSTM with a pre-trained embedding layer
We experiment with a pre-trained word-embedding layer as part of an LSTM model.
Dec 9, 2020
LSTM with an embedding layer
LSTM is a specialized form of neural network and a classic technique for processing word sequences. Given that the job descriptions are just word sequences, we experiment with LSTM for classification.
Dec 6, 2020
Bag-of-words with Fully-connected Neural Network
We use the text of the job descriptions instead of the presence of individual features to improve the F1 score.
Dec 5, 2020
Logistic Regression with Numerical Features
We attempt to understand the data, particularly the parts of the data that contain the most signal, by running logistic regression.
Dec 4, 2020
Introduction to the Fake Jobs Detection Project
The Fake Jobs Detection project seeks to analyze the Fake Jobs Postings dataset from Kaggle. The dataset is a collection of real (label 0) and fraudulent (label 1) job postings. This project attempts to understand the characteristics of various techniques that separate actual job postings from fraudulent ones.