Death from Swapping
Can Python be used for machine learning with 1.400.000 samples on a regular desktop PC?
The Python programming language showed its worth for rapid model development in a machine learning open innovation challenge. However, it meets its limit when the model and the data set cannot fit in the computers main memory at the same time, resulting in extensive swapping of memory and inefficient use of CPU. I wish to thank Lukas Jannik Bjerrum for loan of 4 GB of ram modules from his gaming PC.
It was a refreshing change from the data analysis and modelling of molecular datasets. With the number of samples being much larger than the number of variables, there was plenty of well-structured data available for testing. The real challenge turned out to be memory management with Python.
Python’s libraries Pandas with the excellent DataFrame object structure made it easy to handle the gigabyte sized training data. Once the text data files had been parsed, it was fast to save and load the DataFrame object to disk using the HFStore utility of Pandas. Numerical Python (numpy) made it fast to do matrix and mathematical manipulation of the objects data.
Model Building was done with the Scikit-Learn machine learning library in Python. It quickly turned out that the random forests model was most efficient at building predictive classifications models from the provided training data. It was superior to linear classification models such as logistic regression, probably because some of the numerical values showed discretization with a nonlinear relationship to the classes. Random forest can handle the nonlinear nature of the data by having multiple criteria for splitting of the dataset from the same variable in different tree nodes. As example one node says > 0.5 the other <0.75.
Unfortunately it was not possible to max out the accuracy by adding enough trees. After a certain number of trees in the ensemble of the model could not fit in the 4, later 8 GB computers memory and swapping became intense resulting in an inefficient use of the CPU.
It could be partially solved with a return to old school for-loop programming handling batches of data instead of a completely vectorized approach. However, the Scikit-Learn implementation of random forests could not easily be split up in memory and left traces which was not immediately removed by Pythons garbage collector. Python’s lack of explicit memory management became a problem, signaling that with this particular problem it was time to use another programming language (or buy a computer with more ram).
On the other hand, Python really shined as a platform for rapid model development, and the Pandas and Numerical python modules showed surprisingly good efficiency with the large datasets. The total modelling and prediction script was below 100 lines of python code. Moving the modelling to another platform was not in the scope or time frame for participation.
Overall it was a fun getting experience with handling large datasets and testing the limits of Pythons capabilities. It was a pity that the available hardware should limit the model performance. Additionally, the variables were blinded and encoded, as it otherwise could have been interesting to experiment with more intelligent and rational feature and variable selection.
Esben Jannik Bjerrum
CEO, Wildcard Pharmaceutical Consulting