My Machine Learning Toolbox

I have been involved with doing machine learning work at my day job for quite some time now. My work is mainly focused on applying existing ML knowledge to try to solve problems. I don’t have the necessary background required to contribute towards the academic or theoretical side of ML, i.e. things like coming up with new activation functions or other novel techniques. My only background knowledge when I started were from a course in artifical intelligence and a course in machine learning, both of which were taken during my undergraduate engineering degree. The coursework involved Java programming for the former, and Matlab/Octave scripting and Weka for the latter.

Since I was aware that the state of the art has changed since my undergraduate days, I enrolled in a couple of free online courses offered by Udacity: “Intro to Machine Learning” (UD120) and “Deep Learning” (UD730). The former was a refresher course which covered most, if not all, of the material in my undergraduate ML course. The latter course was an introduction to Deep Learning and the use of Tensorflow. Overall, both courses helped me get up to speed. For the rest of my knowledge, I read research papers and discussions around the Internet.

Since my work involves the application of ML techniques, tools are quite important. Most people who do ML work will develop their toolbox containing their favourites. I would like to describe some of my favourite tools. These tools are not presented in any particular order of preference. Most of the work I do is implemented using the Python programming language, a consequence of its favoured status in the ML community, most likely because of very important foundational libraries (eg. NumPy) for efficient math operations.

scikit-learn

scikit-learn is a Python library with almost all well known ML algorithms implemented in an easy to use and consistent programming interface. This is a very common pattern for using the library:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
print("Score: %.4f" % (model.score(X_test, y_test),))

Of course, this is a superficial example that has left out things like data loading and parameter tuning. The value of this library is being able to try out different algorithms quickly without overhauling large sections of your code.

The library also ships with tools to perform many common tasks such as splitting a dataset into training and test sets, feature scaling, and parameter selection.

Keras

I use Keras when I need to build neural networks. It sits as an abstraction on top of TensorFlow and Theano. This makes it easy to switch between the two tensor libraries without having to rewrite code.

What I really like about Keras is that a lot of the micro decisions that go into each part of a neural network architecture, like the initialization of weights or initial parameters of an optimizer function, have good defaults already chosen for you by people who are very knowledgeable about ML state of the art research, to help you “fall into the pit of success”. You can of course easily override these defaults.

Another great thing is that all (or at least most) of the well known neural network layers, like convolutional layers and embedding layers, have been already defined. Sure, you could use a tensor library and implement these layers yourself, but in Keras, someone has done the work and tested it. There are a lot of things that could go wrong when building a neural network; screwing up a layer implementation and not knowing is really something that you cannot afford.

Ruby

Ruby?!

I find some tasks to be easier and faster to complete in Ruby. This is probably because I am much more familiar with developing in Ruby than Python.

My main use case for incorporating Ruby into my workflow is sourcing data from databases using the Sequel gem and then formatting it into an output file format like CSV or JSON Lines.

I also write a bunch of project-specific utilities, mainly to do more processing on files. This is a very common code block that I use:

ARGF.each_line do |line|
  # process each line from every file passed to the script
end

kaggle/python

This is a Docker image containing almost every Python tool out there for ML and associated tasks. It is a pretty hefty image, but I’m happy to let someone else take care of installing all the tools and making sure they all play nicely with one another. Some Python libraries require compiling platform dependent code, so not having to do any of that and make my dev environment less polluted is another benefit.

Follow Kaggle’s article to learn more about using the image.

caffeinate

This is a command line tool that ships with macOS. It prevents macOS from sleeping. Depending on the ML algorithm, you may end up with a script that takes a long time to complete, so time is a very valuable resource that you are spending constantly. At some point, you are probably going to leave your computer unattended and it may switch to sleep mode automatically. When this happens, you’ve lost progress, which has happened to me a number of times.

To use caffeinate, you just like let caffeinate run your program for you, for example:

$ caffeinate python my_script.py

You computer will not go to sleep until the Python process finishes. If you already have a running process, you can find its PID and use:

$ caffeinate -w PID

So there you have it, a list of some of my most commonly used tools.