Developing a baseline with logistic regression

We use logistic regression as the first classification technique on the processed data to develop a baseline for classification results.

Difficulty converging

On running logistic regression directly on the processed data (¹,²), we find that the classification results are poor and logistic regression has difficulty converging. Python Scikit-learn’s logistic regression parameter finding algorithm hits the default limit for the maximum number of (Scikit-learn internal) iterations even as classification accuracy languishes well below 90%.

Standardizing data

On standardizing the data, we find that classification accuracy increases to about 92% (³), even though Scikit-learn complains about hitting the default maximum iteration limit.

Maximum iterations

To prevent Scikit-learn from complaining, we tried increasing the maximum number of iterations from 100 to 10000. The higher limit eliminates the complaint but runs the algorithm for a long time while topping out at 93% accuracy.

Baseline accuracy

Given the experience above, we choose 92% as the baseline accuracy that other classification techniques would have to beat.

Difficulty converging

Standardizing data

Maximum iterations

Baseline accuracy

References