For the past year at Rock Solid Knowledge, we’ve been running a weekly Machine Learning Club to skill up our developers in this space.
As a software company, we are constantly taking on new projects. Machine Learning or incorporating Machine Learning into these projects had been talked about more and more over the past year, so in 2019 we decided to get a few of our developers on a machine learning training course.
While some of us had academic experience with machine learning, none of us had any real commercial experience and our skills were a little rusty.
We ideally wanted to stick with C#. With .NET Core 3, Microsoft bought in their new machine learning framework, ML.NET. The timing seemed ideal and we decided to go ahead and try out ML.net.
ML.NET was released back in 2018 and has since seen major API reworks and changes since release, with a lot of care and attention from Microsoft going into the product. Microsoft clearly want to make machine learning more accessible and get it into the hands of developers rather than solely remaining the role of Data Scientists.
ML.NET contains all the tools necessary to create and consume machine learning pipelines within the .NET eco system. ML.NET includes automated Machine Learning (AutoML) with its model builder feature. Feed ML.NET model builder a dataset, and it will perform training and pick the best algorithm.
Whilst ML.NET is not perfect, it is a good starting point. The quality of the model will vary depending on how complex the structure of the data is; ML.NET is smart enough to one hot encode categorical data but complex transforms are out of its reach.
The AutoML space has grown over the past few years with companies offering similar services, and ML.NET seems to be a good free alternative. Microsoft clearly want to make machine learning more accessible and get it into the hands of developers rather than solely remaining the role of data scientists.
Machine learning roles
In a more traditional structure, data scientists would spend their time:
- Gathering data
- Doing feature correlation
- Training/Creating the model
- Evaluating the model
Software developers would incorporate that model into the product and bring it to production. However, once the machine learning part of the project is mostly over, you’re left with a team of data scientists with not much to do (unless you’ve got a string of machine learning projects ready to go).
There is a clear separation between these two disciplines, data science and development. Each uses different tools, terminology, and languages, but the goal is the same: to solve a problem. This might work fine for a larger organization or for a company whose focus is machine learning, but for a smaller company (like us) it makes the barrier to entry a little too high.
This is similar to how DevOps and developers were viewed. In the past software development and DevOps were two separate disciplines, but more and more job titles such as "site reliability engineer" which combines the two skills sets have been rising in popularity thanks to Google who pioneered the job title.
With the way Microsoft have integrated ML.NET into the framework and the clear synergy between the data science and software engineer skill sets I think we’ll start to see the merging of the the regular development role with the data scientist role in the future.
Training our developers
In 2019, our developers were undertaking research and development with ML.NET and Microsoft LUIS. However, we didn’t have any data scientists and knew we needed to more deeply understand the machine learning space.
To continue our exploration, we needed some formal training, so a group of us attended a machine learning training course. This was great as both a refresher and an introduction to what machine learning was.
Although the course was taught in Python (a first for some of our developers), we learnt:
- The basics of each algorithm
- How and why the algorithms work the way they do
- Real-life scenarios to use the algorithms in
We got the chance to ask all the questions we wanted. It really helped to demystify machine learning and put it into a perspective that developers could understand.
The trainer introduced us to a few different resources like Kaggle, and how and where to find data sets we could continue to develop our machine learning skills on.
During training breaks, I spent the time going through various competitions on Kaggle.
The way Kaggle works is:
- A user uploads a data set and creates a competition to create a model from that dataset to achieve a goal
- Users use notebooks to show their findings, theories and models, mostly in python
- Users upload these notebooks for other people to critique and learn from
As submissions are open to the public the quality does vary, but the Kaggle competitions and their notebooks were a great way to get inside the minds of data scientists and see how they went about their work. Sometimes the competitions are sponsored by a company with prizes for the best model. One of the bigger competitions was run by Netflix for a new recommendation engine for a $1 million prize.
After the training
As with all training, if you’re not actively using the skill it can easily be forgotten. The best time to make use of something you’ve just learnt is when it is still fresh in your mind. Not three to six months later.
We had a similar situation after our technical writing training. To help retrain and practice the skills we learnt there, we as company started to regularly schedule article writing days, alongside having fortnightly sessions dedicated to technical writing and peer review/learning.
This style of continuing training has been great both in developing confidence in our writing skills as well producing significant increase in our productivity and writing output.
Machine Learning Club
I got together a few other interested developers from the training and proposed an after-work Machine Learning Club where we would find a Kaggle competition we were interested in and work through it as a team each week. Pooling our combined experience and learning from each other, we would work through each part of the data science process together.
When you get down to it machine learning and each of its different algorithms are essentially patterns. Slightly different variations of each exist, but the concepts remain constant throughout. If a developer has never studied design patterns, they might use one and never know what they did was called a strategy or observer pattern.
If you break down common machine learning problems and think about how as a developer you would solve these problems with code, you might end up with something close in concept to what these algorithms are doing.
So that’s the approach we took. Breaking down machine learning problems in ways we could understand as regular developers and comparing that to the techniques we had learnt from our training.
Project 1: Surviving the Titanic disaster
The first project we choose was based on the Titanic disaster. This Kaggle competition had a data set of all the Titanic passengers, including their age, number of family on board and whether they survived the disaster. The goal of the competition was to create a model that could predict if a person would survive.
We followed the process we were taught in our training; understanding and refining. Machine Learning Club members worked together to:
- Research the Titanic disaster to gain context for the data
- Remove features that were unimportant or had too much missing data
- Started to build a picture of what the inputs to our model our looked like
- Undertook feature correlation to extract the features that correlated highly with survival and those that did not
- Came up with various hypotheses
- Attempted to prove or disprove each hypothesis using the data we had
We found (unsurprisingly, if you have seen the movie) that age then gender were the most important factors.
We also discovered that we could extract a person’s title from their name and that correlated highly with survival. Although men did not have a high chance of survival, their title could affect their survival. For example, "Mr. John Smith" with the common title of "Mr" had a much lower chance of surviving compared to "Dr. John Smith" or "Co. John Smith". We separated titles out into two groups, rare and common.
In machine learning you can also improve feature correlation by adding features together. We found this by combining “Number of Siblings” and "Number of Parents" together to create a new feature called "Family Count". This correlated much better with survival rate compared to before.
The project was a good start to our club, and gave us a taste for the next project, refreshed with some new club members.
Project 2: Disaster Tweets
Our next project, Disaster Tweets, saw us deciding if a Tweet was about a real disaster or a fake disaster (or rather just a regular tweet).
Twitter is usually the first place that disaster news breaks. The idea behind the Kaggle competition this was based on was to create an early warning system for disasters.
This Natural Language Processing (NLP) problem was a great experience for us, and we learnt a lot about how different algorithms fit different data structures/problems better than others. For example, random forest or similar algorithms handle missing data much better than others due to how decision trees are built. With each tweet being so small compared to our final vector, random forest came out on top.
Once we had our model created, we next had to tune it. While other tools might have functionality to do this for you, ML.NET doesn't yet. To speed things up we created a basic Genetic Algorithm to optimise our model.
We took the concepts of:
And applied them to our training configuration.
This process helped us refine our model and get the final few percentage improvements in our models score.
Project 3: Fake news detection
The next project we’re working on is Fake News Detection. We enjoyed working with NLP and we want to continue in that direction.
Although there are plenty of datasets on fake news, their bias is something to consider. Creating a true model to predict fake from real news is mostly likely something far beyond our capabilities.
What we want out of this machine learning club project this time is to bring a model to a production-like environment. The goal is to:
- Create the model
- Develop a simple application using the model
- Continue training and refining the model with new data
So far each of our projects has been limited to Visual Studio or notebooks. A big part of ML.NET is that it includes everything a developer needs to bring a model to production rather than just training/creating a model. We’re excited to learn the capabilities of ML.NET in production during this project.
Machine Learning Club is great and has given our developers the confidence to tackle machine learning problems. we’ve already used these new skills in production:
We have built an automated customer service bot to detect the intention and sentiment of an email using Microsoft LUIS
We’ve just released our first machine learning based commercial product, AimHappy: a Zendesk app to detect sentiment in customer messages.
We’re always looking for other places to make use of the skill, so get in touch if you have a machine learning project you’d like to discuss.