Developing a baseline with logistic regression
We use logistic regression as the first classification technique on the processed data to develop a baseline for classification results.
Difficulty converging
On running logistic regression directly on the processed data (1,2), we find that the classification results are poor and logistic regression has difficulty converging. Python Scikit-learn’s logistic regression parameter finding algorithm hits the default limit for the maximum number of (Scikit-learn internal) iterations even as classification accuracy languishes well below 90%.
Standardizing data
On standardizing the data, we find that classification accuracy increases to about 92% (3), even though Scikit-learn complains about hitting the default maximum iteration limit.
Maximum iterations
To prevent Scikit-learn from complaining, we tried increasing the maximum number of iterations from 100 to 10000. The higher limit eliminates the complaint but runs the algorithm for a long time while topping out at 93% accuracy.
Baseline accuracy
Given the experience above, we choose 92% as the baseline accuracy that other classification techniques would have to beat.