The topic of how to approach Machine Learning (ML) research projects has come up many times over the past several years speaking to many in the field and industry. One of the fundamental aspects of ML projects is that there are often larger risks and unknowns attached to these project compared to others that are being undertaken within your company’s engineering department.

While we have made tremendous progress in ML technologies, modelling algorithms, platforms, and libraries, we are still very far from having what I call text book algorithms that would reduce the risk of implementing a project.

As a result, I’ve created and adapted over time a rough approach for handling these kind of projects that I’ve found to lead to good results in practice.

To start, imagine you are starting a new ML project. Let’s say you are intending to design an API that automatically extracts the title and the name of the authors of a book from a given input webpage. Here are the stages and steps I would take.

Stage 1: Implement your evaluation pipeline

Contrary to many other approaches I’ve seen that do this part as the last stage, building the evaluation pipeline, to me, is the first and one of the most important steps that you will have to take. Why? Let me take you through the individual steps and the reason hopefully will become clear.

1.1. Gather and create benchmark datasets

Doing this step makes sure that you have access to at least one dataset that is large enough and covers cases you would like to compute the performance. Unless you have access to a good and large in-house dataset, this is going to be one of the crucial steps that you will have to spend a good amount of time on it.

Be prepared to check web for public and open datasets, use crowsourcing platforms for human labelling, adapt an existing knowledge base to give you annotations, check your application logs and user feedbacks, add feedback mechanisms to your other apps to capture datasets.

Any time and money investment here will pay off big, so take this very very seriously. And don’t opt for partial datasets (e.g. a list of webpages with only labelled book titles in the example above): it will not take you very far and you will not be able to answer simple questions such as how is the overall performance.

Also be sure to version your datasets and benchmarks. You should know which version of the dataset a model was trained and on which test dataset version it was evaluated on. Note that it’s better to have your benchmark dataset to change at slower pace than your training dataset making it easier to understand the impact of a new training dataset, a new model, a new bug fix, … .

1.2. Identify your main evaluation metric

No matter what smartness you are going to pull off later on, you need to be able to evaluate all your models on metrics that matter and are closest to what users would experience if they used your API.

Rule of thumb, it’s good to have multiple metrics to cover how models behave, but choose one overall metric that improving on that will be evident to your users. That’s what you need to ace!

In the example above, there are many choices: you might consider classical metrics of accuracy, precision, recall, AUC, AUPRC. But remember that you are dealing with a structured output: a book entry will have a title and a list of authors. How would you compute a metric for this? Would you evaluate model outputs independently or in conjunction? Would you weight them differently? Does weighted averaging of these metrics makes sense? How would you decide on weights? If you improved 10% in your metric of choice would you be able to translate that to what the new experience is going to be for your typical (possibly non-technical) user?

Answers to these will depend on your particular application, but tackling them head on will actually make you aware of kind of modelling approaches you will need to take later on.

1.3. Implement a dummy random algorithm

This will be your first model! Pure random predictions!

While it might not be intuitive, what doing this step gives you is the first trivial algorithm to evaluate. Another side-effect is that it provides you with a rough sketch of the prediction API. No matter the algorithm underneath, you need to figure out what are the necessary inputs and outputs, i.e. your run-time API.

 1.4. Evaluate your trivial algorithms

Pass the test dataset you’ve made through the random algorithm and measure the performance. While you shouldn’t expect miracles to happen here, be prepared to see skewed results showing up, indicating peculiarities of your test set. Make multiple runs and analyse how the performance varies.

Now that you’ve your first model tested, test other trivial models, such as predicting all test instances to a fixed class, not predicting anything at all, etc. Many of these will tell you about characteristics of your dataset as well as your metrics of choice. If your dataset is unbalanced (95% neg vs 5% pos) and you’re measuring accuracy, these trivial algorithms will tell you that you might want to consider changing to something that takes the unbalancedness into account or has different penalties per different types of mistakes, … .

1.5. Bonus steps

If you’ve time and means to do this without a lot of effort, consider

  • Setting up a repository for collecting and displaying evaluation results of different models (this could be as simple as a backed up CSV file with a new line appended to it)
  • Possibility of triggering the automatic evaluation runs to commits of the main code repository or regular nightly jobs if the evaluation takes a lot of time

Stage 2: Implement your baseline API

This is also a step that many might not pursue when they start a new project. There’s a lot of enthusiasm about the latest Arxiv paper you read a few days ago, so you are dreaming of just getting started on that super awesomeness.

Document all these ideas and papers, they are very important down the line. But consider establishing a quick baseline API first.

2.1. Implement a simple baseline

This should be the quickest model that you can build to establish a baseline. Consider using off-the-shelf ML algorithms that don’t need spending huge amounts of time fine tuning or training and build a model using the training dataset you have. Many ML and Deep Learning libraries come with tutorials and example demos bolted on, so don’t shy away from writing wrappers around and using them for this.

Any heuristically built model is also a good baseline: wear your engineering hat and come up with hacky solutions that would convert inputs to predictions.

And don’t overthink it! You are preparing for your big guns down the road! Just implement a sensible baseline as quickly as possible.

These baselines are there to give you a sense of how difficult the problem is: if your hacky off-the-shelf algorithm is already doing pretty well then is the problem an easy one? Or maybe you don’t have a good benchmark dataset? Or maybe again something is wrong with your metrics?

If the results are poor, start analysing the error cases. Is there a repeating pattern in errors? Do these errors feel genuinely difficult? Would another approach have a better chance?

All of these will give you important tips and directions when you’re building more complex models.

2.2. Demo your baseline API

Now that you’ve a model that hopefully is doing better than random predictions, why not to actually put all the bits together and release your first API? Depending on how good the baseline algorithm was, you well might be looking here at your v0.1 release of the API.

Demo this to your colleagues, see what they say, many of them are smart people so you have a good chance of gathering invaluable feedback here. Slap a simple front-end to it if it makes it easier to show what’s happening. Visualise intermediate outputs from your model to help you debug and interpret what is happening when people are using the demo app.

Stage 3: Let’s Rock

Now is the time to bring in the big guns and get fancy with the new model or architecture you thought about implementing. This is a very exciting phase: unless you’ve been dealing with a simple problem, there’s a lot of ground to be covered here with respect to your baselines.

This stage is often a cycle of a few repeating steps:

3.1. Research a new ML algorithm

Pick good papers and read with colleagues. Brainstorm on possible scenarios and approaches. Check and study open source implementations if they exist.

And early on, choose algorithms that have a good track record compared to a new architecture that you’ve just seen published with marginal performance gains. You will be able to come up with novel methods to get similar or much better marginal improvements. So it’s better from a benchmarking point of view to cover more established models first.

3.2. Develop a prototype

Try implementing quick and albeit dirty prototypes first. The goal here is to get to the end of training and evaluation as fast as possible.

Don’t spend a lot of time on the cleanest code, best infrastructure for your model, etc unless you genuinely know that you will be reusing these and they will make your subsequent work much smoother. To me it’s always a better option to have a model that is proven to perform on a difficult problem but the code is complicated and not up to standard versus a very well designed code with all the standards of modern software practices but does slightly better than the simple baseline above. The first problem is a great one to have compared to the second case!

3.3. Do the engineering work and release

Once you’re happy with how your model performs and you see a considerable jump in performance, take on the engineering work: think about the best architecture, how you would scale the training and run-time, how you would deploy this as an API, how would you do the DevOps side of it, write tests and documentation and explain API contracts, … .

Remember to version your models (as well as the datasets mentioned above). Also add instrumentations to your API/UI to capture user feedbacks, create a consistent logging system for your models, create dashboards on how your models are being used in production.

3.4. Repeat

If there is still a good room for improvement, consider repeating the R&D cycle again. Spend time analysing the error cases again from your live API, check what users have been trying and see if your models performs well on those.

Before embarking on a completely new model, remember that having a bigger/better dataset might actually make a lot of difference. So in parallel to R&D consider expanding the size and diversity of the dataset.

If you’re lucky to have a good user base and you’ve built mechanisms to capture their feedback, you might be able to have a growing dataset over time. If this is the case, consider automating and scheduling the training of your models to benefit from the new and hopefully better dataset. You can use the benchmark evaluation also as an additional criteria for automatic publication of your newly generated models!

3.5. Publish

Hopefully what you’ve been working on here is innovative and the gains you are measuring against the state-of-the-art baselines are interesting and substantial enough to be useful for the research community. So consider writing a draft paper once you see the signs of interesting outcomes (both positive and negative results). It’s always a good idea to review these documents with colleagues and peers during your R&D cycle to gather feedback on methods you are developing, things to watch out for, other ideas/methods you’ve not seen or considered, … . Early on also think about publication venues (top conferences and workshops will have deadlines that you need to hit), as well as publishing your code + datasets. These steps will be challenging due to IP issues, so you need to get ahead of them to be able to hit deadlines.

Closing comments

The whole point of the approach I outlined above is to have a method that ensures you are taking solid steps in reducing the risks associated with ML projects. It also has an emphasis on speed and continuous delivery compared to far apart pure R&D cycles with occasional releases. If you’re lucky with the dataset situation and you’ve a couple of ML engineers/scientists, there is a good chance you will be demoing your v0.1 API (step 2.2) in a sprint or two and your v0.2 with a more sophisticated model might be another couple of month or so away (don’t use the timeline mentioned as a definite, it will highly depend on the nature of your problem, how familiar the team is with the project, do you need a special infrastructure, … think about your problem and adapt!).

This often sets you on a good path: you’ve covered important metrics, you’ve a reliable training dataset and a test set you can trust, you’ve discovered your APIs, you will be gathering invaluable feedback from alpha/beta users, and have solid baselines to compare to.

While I tried to cover as much as possible, I might have missed a few points here and there. Also I don’t believe this is the only way of approaching ML projects, so would love to hear your experience on this.

Other recommendations

  1. Incorporate baseline methods as well as external datasets (if directly applicable) inside your codebases.
    1. Having the ability to run models on both internal datasets as well as external will accelerate the pace of research and how quickly you can go from an idea to a working example benchmarked against various methods and datasets.
    2. This will also help publication process by knowing the performance against the state-of-the-art as well as avoiding the pitfall of not having run your systems on external datasets.
  2. Prefer end-to-end testing as much as possible.
    1. Many of your systems will have impact on downstream systems which in turn will have cascade of impacts until it reaches actual users. So try to capture the ripple effect of changes down the pipeline as much as you can.
    2. Perform error analysis on real-world live systems and use that to guide your research.