Parts 1 and 2 of this series focussed on the technical aspect of improving the experimentation process. This started with rethinking how code is created, stored and used, and ended with utilising large scale parallelization to cut down the time taken to run experiments. This article takes a step back from the implementation details and instead takes a wider look at how / why we experiment, and how we can reduce the time of value of our projects by being smarter about experimenting.

Failing to plan is planning to fail

Starting on a new project is often a very exciting time as a data scientist. You are faced with a new dataset with different requirements compared to previous projects and may have the possibility to try out novel modelling techniques you have never used before. It is sorely tempting to jump straight into the data, starting with EDA and possibly some preliminary modelling. You are feeling energised and optimistic about the prospects of building a model that can deliver results to the business.

While enthusiasm is commendable, the situation can quickly change. Imagine now that months have passed and you are still running experiments after having previously run 100’s, trying to tweak hyperparameters to gain an extra 1-2% in model performance. Your final model configuration has turned into a complex interconnected ensemble, using 4-5 base models that all need to be trained and monitored. Finally, after all of this you find that your model barely improves upon the current process in place.

All of this could have been avoided if a more structured approach to the experimentation process was taken. You are a data scientist, with emphasis on the scientist part, so knowing how to conduct an experiment is critical. In this article, I want to give some guidance about how to efficiently structure your project experimentation to ensure you stay focussed on what is important when providing a solution to the business.

Gather more business information and then start simple

Before any modelling begins, you need to set out very clearly what you are trying to achieve. This is where a disconnect can happen between the technical and business side of projects. The most important thing to remember as a data scientist is:

Your job is not to build a model, your job is to solve a business problem that may involve a model!

Using this point of view is invaluable in succeeding as a data scientist. I have been on projects before where we built a solution that had no problem to solve. Framing everything you do around supporting your business will greatly improve the chances of your solution being adopted.

With this is mind, your first steps should always be to gather the following pieces of information if they have not already been supplied:

  • What is the current business situation?
  • What are the key metrics that define their problem and how are they wanting to improve them?
  • What is an acceptable metric improvement to consider any proposed solution a success?

An example of this would be:

You work for an online retailer who need to make sure they are always stocked. They are currently experiencing issues with either having too much stock lying around which takes up inventory space, or not having enough stock to meet customer demands which leads to delays. They require you to improve this process, ensuring they have enough product to meet demand while not overstocking.

Admittedly this is a contrived problem but it hopefully illustrates that your role is here to unblock a business problem they are having, and not necessarily building a model to do so. From here you can dig deeper and ask:

  • How often are they overstocked or understocked?
  • Is it better to be overstocked or understocked?

Now we have the problem properly framed, we can start thinking of a solution. Again, before going straight into a model think about if there are simpler methods that could be used. While training a model to forecast future demand may give great results, it also comes with baggage:

  • Where is the model going to be deployed?
  • What will happen if performance drops and the model needs re-trained?
  • How can you explain its decision to stakeholders if something goes wrong?

Starting with something simpler and non-ML based gives us a baseline to work from. There is also the possibly that this baseline could solve the problem at hand, entirely removing the need for a complex ML solution. Continuing the above example, perhaps a simple or weighted rolling average of previous customer demand may be sufficient. Or perhaps the items are seasonal and you need to up demand depending on the time of year.

Simpler methods may be able to answer the business question. Image by author

If a non model baseline is not feasible or cannot answer the business problem then moving onto a model based solution is the next step. Taking a principled approach to iterating through ideas and trying out different experiment configurations will be critical to ensure you arrive at a solution in a timely manner.

Have a clear plan about experimentation

Once you have decided that a model is required, it is now time to think about how you approach experimenting. While you could go straight into an exhaustive search of every possibly model, hyperparameter, feature selection process, data treatments etc, being more focussed in your setups and having a deliberate strategy will make it easier to determine what is working and what isn’t. With this in mind, here are some ideas that you should consider.

Be aware of any constraints

Experimentation does not happen in a vacuum, it is one part of the the project development process which itself is just one project going on within an organisation. As such you will be forced to run your experimentation subject to limitations placed by the business. These constraints will require you to be economical with your time and may steer you towards particular solutions. Some example constraints that are likely to be placed on experiments are:

  • Timeboxing: Letting experiments go on forever is a risky endeavour as you run the risk of your solution never making it to productionisation. As such it common to give a set time to develop a viable working solution after which you move onto something else if it is not feasible
  • Monetary: Running experiments take up compute time and that isn’t free. This is especially true if you are leveraging 3rd party compute where VM’s are typically priced by the hour. If you are not careful you could easily rack up a huge compute bill, especially if you require GPU’s for example. So care must be taken to understand the cost of your experimentation
  • Resource Availability: Your experiment will not be the only one going on in your organisation and there may be fixed computational resources. This means you may be limited in how many experiments you can run at any one time. You will therefore need to be smart in choosing which lines of work to explore.
  • Explainability: While understanding the decisions made by your model is always important, it becomes critical if you work in a regulated industry such as finance, where any bias or prejudice in your model could have serious repercussions. To ensure compliance you may need to restrict yourself to simpler but easier to interpret models such as regressions, Decision Trees or Support Vector Machines.

You may be subject to one or all of these constraints, so be prepared to navigate them.

Start with simple baselines

When dealing with binary classification for example, it would make sense to go straight to a complex model such as LightGBM as there is a wealth of literature on their efficacy for solving these types of problems. Before that however, having a simple Logistic Regression model trained to serve as a baseline comes with the following benefits:

  • Little to no hyperparameters to assess so quick iteration of experiments
  • Very straightforward to explain decision process
  • More complicated models have to be better than this
  • It may be enough to solve the problem at hand
Assessing clearly what additional complexity brings you in terms of performance is important. Image by author

Beyond Logistic Regression, having an ‘untuned’ experiment for a particular model (little to no data treatments, no explicit feature selection, default hyperparameters) could also be important as it will give an indication of how much you can push a particular avenue of experimentation. For example, if different experimental configurations are barely outperforming the untuned experiment, then that could be evidence that you should refocus your efforts elsewhere.

Using raw vs semi-processed data

From a practicality standpoint the data you receive from data engineering may not be in the perfect format to be consumed by your experiment. Issues can include:

  • 1000’s of columns and 1,000,000’s of transaction making it a strain on memory resources
  • Features which cannot be easily used within a model such as nested structures like dictionaries or datatypes like datetimes
Non-tabular data poses a problem to traditional ML methods. Image by author

There are a few different tactics to handle these scenarios:

  • Scale up the memory allocation of your experiment to handle the data size requirements. This may not always be possible
  • Include feature engineering as part of the experiment process
  • Process your data slightly prior to experimentation

There are pro and cons to each approach and it is up to you to decide. Doing some pre-processing such as removing features with complex data structures or with incompatible datatypes may be beneficial now, but it may require backtracking if they come into scope later on in the experimentation process. Feature engineering within the experiment may give you better control over what is being created, but it will introduce extra processing overheard for something that may be common across all experiments. There is no correct choice in this scenario and it is very much situation dependent.

Evaluate model performance fairly

Calculating final model performance is the end goal of your experimentation. This is the result you are going to present to the business with the hope of getting approval to move onto the production phase of your project. So it is crucial that you give a fair and unbiased evaluation of your model that aligns with stakeholder requirements. Key aspects are:

  • Make sure you evaluation dataset took no part in your experimentation process
  • Your evaluation dataset should reflect a real life production setting
  • Your evaluation metrics should be business and not model focussed
Unbiased evaluation gives absolute confidence in results. Image by author

Having a standalone dataset for final evaluation ensures there is no bias in your results. For example, evaluating on the validation dataset you used to select features or hyperparameters is not a fair comparison as you run the risk of overfitting your solution to that data. You therefore need a clean dataset that hasn’t been used before. This may feel simplistic to call out but it so important that it bears repeating.

Your evaluation dataset being a true reflection of production gives confidence in your results. As an example, models I have trained in the past were done so on months or even years worth of data to ensure behaviours such as seasonality were captured. Due to these time scales, the data volume was too large to use in its raw state so downsampling had to occur prior to experimenting. However the evaluation dataset should not be downsampled or modified in such a way to distort it from real life. This is acceptable as for inference you can use techniques like streaming or mini-batching to ingest the data.

Your evaluation data should also be at least the minimum length that will be used in production, and ideally multiples of that length. For example, if your model will score data every week then having your evaluation data be a days worth of data is not sufficient. It should at least be a weeks worth of data, ideally 3 or 4 weeks worth so you can assess variability in results.

Validating the business value of your solution links back to what was said earlier about your role as a data scientist. You are here to solve a problem and not merely build a model. As such it is very important to balance the statistical vs business significance when deciding how to showcase your proposed solution. The first aspect of this statement is to present results in terms of a metric the business can act on. Stakeholders may not know what a model with an F1 score of 0.95 is, but they know what a model that can save them £10 million annually brings to the company.

The second aspect of this statement is to take a cautious view on any proposed solution and think of all the failure points that can occur, especially if we start introducing complexity. Consider 2 proposed models:

  • A Logistic Regression model that operates on raw data with a projected saving of £10 million annually
  • A 100M parameter Neural Network that required extensive feature engineering, selection and model tuning with a projected saving of £10.5 million annually

The Neural Network is best in terms of absolute return, but it has significantly more complexity and potential points of failure. Additional engineering pipelines, complex retraining protocols and loss of explainability are all important aspects to consider and we need to think about whether this overheard is worth an extra 5% uplift in performance. This scenario is fantastical in nature but hopes to illustrate the need to have a critical eye when evaluating results.

Know when to stop

When running the experimentation phase you are balancing 2 objectives: the want to try out as many different experimental setups as possible vs any constrains you are facing, most likely the time allocated by the business for you to experiment. There is a third aspect you need to consider, and that is knowing if you need to end the experiment phase early. This can be for a variety reasons:

  • Your proposed solution already answers the business problem
  • Further experiments are experiencing diminishing returns
  • Your experiments aren’t producing the results you wanted

Your first instinct will be to use up all your available time, either to try and fix your model or to really push your solution to be the best it can be. However you need to ask yourself if your time could be better spent elsewhere, either by moving onto productionisation, re-interpreting the current business problem if your solution isn’t working or moving onto another problem entirely. Your time is precious and you should treat it accordingly to make sure whatever you are working on is going to have the biggest impact to the business.

Conclusion

In this article we have considered how to plan the model experiment phase of your project. We have focussed less on technical details and more on the ethos you need to bring to experimentation. This started with taking time to understand the business problem more to clearly define what needs to be achieved to consider any proposed solution a success. We spoke about the importance of simple baselines as a reference point that more complicated solutions can be compared against. We then moved onto any constraints you may face and how that can impact your experimentation. We then finished off by emphasising the importance of a fair dataset to calculate business metrics to ensure there is no bias in your final result. By adhering to the recommendations laid out here, we greatly increase our chances of reducing the time to value of our data science projects by quickly and confidently iterating through the experimentation process.

Share.

Comments are closed.