Reducing Time to Value for Data Science Projects: Part 2

In part 1 of this series we spoke about creating re-usable code assets that can be deployed across multiple projects. Leveraging a centralised repository of common data science steps ensures that experiments can be carried out quicker and with greater confidence in the results. A streamlined experimentation phase is critical in ensuring that you deliver value to the business as quickly as possible.

In this article I want to focus on how you can increase the velocity at which you can experiment. You may have 10s–100s of ideas for different setups that you want to try, and carrying them out efficiently will greatly increase your productivity. Carrying out a full retraining when model performance decays and exploring the inclusion of new features when they become available are just some situations where being able to quickly iterate over experiments becomes a great boon.

We Need To Talk About Notebooks (Again)

While Jupyter Notebooks are a great way to teach yourself about libraries and concepts, they can easily be misused and become a crutch that actively stands in the way of fast model development. Consider the case of a data scientist moving onto a new project. The first steps are typically to open up a new notebook and begin some exploratory data analysis. Understanding what kind of data you have available to you, doing some simple summary statistics, understanding your outcome and finally some simple visualisations to understand the relationship between the features and outcome. These steps are a useful endeavour as better understanding your data is critical before you begin the experimentation process.

The issue with this isn’t in the EDA itself, but what comes after. What normally happens is the data scientist moves on and instantly opens a new notebook to begin writing their experiment framework, normally starting with data transformations. This is typically done via re-using code snippets from their EDA notebook by copying from one to the other. Once they have their first notebook ready, it is then executed and the results are either saved locally or written to an external location. This data is then picked up by another notebook and processed further, such as by feature selection and then written back out. This process repeats itself until your experiment pipeline is formed of 5-6 notebooks which need to be triggered sequentially by a data scientist in order for a single experiment to be run.

Chaining notebooks together is an inefficient process. Image by author

With such a manual approach to experimentation, iterating over ideas and trying out different scenarios becomes a labour intensive task. You end up with parallelization at the human-level, where whole teams of data scientists devote themselves to running experiments by having local copies of the notebooks and diligently editing their code to try different setups. The results are then added to a report, where once experimentation has finished the best performing setup is found amongst all others.

All of this is not sustainable. Team members going off sick or taking holidays, running experiments overnight hoping the notebook doesn’t crash and forgetting what experimental setups you have done and are still to do. These should not be worries that you have when running an experiment. Thankfully there is a better way that involves being able to iterate over ideas in a structured and methodical manner at scale. All of this will greatly simplify the experimentation phase of your project and greatly decrease its time to value.

Embrace Scripting To Create Your Experimental Pipeline

The first step in accelerating your ability to experiment is to move beyond notebooks and start scripting. This should be the simplest part in the process, you simply put your code into a .py file as opposed to the cellblocks of a .ipynb. From there you can invoke your script from the command line, for example:

python src/main.py

if __name__ == "__main__":
    
    input_data = ""
    output_loc = ""
    dataprep_config = {}
    featureselection_config = {}
    hyperparameter_config = {}
    
    data = DataLoader().load(input_data)
    data_train, data_val = DataPrep().run(data, dataprep_config)
    features_to_keep = FeatureSelection().run(data_train, data_val, featureselection_config)
    model_hyperparameters = HyperparameterTuning().run(data_train, data_val, features_to_keep, hyperparameter_config)
    evaluation_metrics = Evaluation().run(data_train, data_val, features_to_keep, model_hyperparameters)
    ArtifactSaver(output_loc).save([data_train, data_val, features_to_keep, model_hyperparameters, evaluation_metrics])

Note that adhering to the principle of controlling your workflow by passing arguments into functions can greatly simplify the layout of your experimental pipeline. Having a script like this has already improved your ability to run experiments. You now only need a single script invocation as opposed to the stop-start nature of running multiple notebooks in sequence.

You may want to add some input arguments to this script, such as being able to point to a particular data location, or specifying where to store output artefacts. You could easily extend your script to take some command line arguments:

python src/main_with_arguments.py --input_data --output_loc

if __name__ == "__main__":
    
    input_data, output_loc = parse_input_arguments()
    dataprep_config = {}
    featureselection_config = {}
    hyperparameter_config = {}
    
    data = DataLoader().load(input_data)
    data_train, data_val = DataPrep().run(data, dataprep_config)
    features_to_keep = FeatureSelection().run(data_train, data_val, featureselection_config)
    model_hyperparameters = HyperparameterTuning().run(data_train, data_val, features_to_keep, hyperparameter_config)
    evaluation_metrics = Evaluation().run(data_train, data_val, features_to_keep, model_hyperparameters)
    ArtifactSaver(output_loc).save([data_train, data_val, features_to_keep, model_hyperparameters, evaluation_metrics])

At this point you have the start of a good pipeline; you can set the input and output location and invoke your script with a single command. However, trying out new ideas is still a relatively manual endeavour, you need to go into your codebase and make changes. As previously mentioned, switching between different experiment setups should ideally be as simple as modifying the input argument to a wrapper function that controls what needs to be carried out. We can bring all of these different arguments into a single location to ensure that modifying your experimental setup becomes trivial. The simplest way of implementing this is with a configuration file.

Configure Your Experiments With a Separate File

Storing all of your relevant function arguments in a separate file comes with several benefits. Splitting the configuration from the main codebase makes it easier to try out different experimental setups. You simply edit the relevant fields with whatever your new idea is and you are ready to go. You can even swap out entire configuration files with ease. You also have complete oversight over what exactly your experimental setup was. If you maintain a separate file per experiment then you can go back to previous experiments and see exactly what was carried out.

So what does a configuration file look like and how does it interface with the experiment pipeline script you have created? A simple implementation of a config file is to use yaml notation and set it up in the following manner:

Top level boolean flags to turn on and off the different parts of your pipeline
For each step in your pipeline, define what calculations you want to carry out

file_locations:
    input_data: ""
    output_loc: ""

pipeline_steps:
    data_prep: True
    feature_selection: False
    hyperparameter_tuning: True
    evaluation: True
    
data_prep:
    nan_treatment: "drop"
    numerical_scaling: "normalize"
    categorical_encoding: "ohe"

This is a flexible and lightweight way of controlling how your experiments are run. You can then modify your script to load in this configuration and use it to control the workflow of your pipeline:

python src/main_with_config –config_loc

if __name__ == "__main__":
    
    config_loc = parse_input_arguments()
    config = load_config(config_loc)
    
    data = DataLoader().load(config["file_locations"]["input_data"])
    
    if config["pipeline_steps"]["data_prep"]:
        data_train, data_val = DataPrep().run(data, 
                                              config["data_prep"])
        
    if config["pipeline_steps"]["feature_selection"]:
        features_to_keep = FeatureSelection().run(data_train, 
                                                  data_val,
                                                  config["feature_selection"])
    
    if config["pipeline_steps"]["hyperparameter_tuning"]:
        model_hyperparameters = HyperparameterTuning().run(data_train, 
                                                           data_val, 
                                                           features_to_keep, 
                                                           config["hyperparameter_tuning"])
    
    if config["pipeline_steps"]["evaluation"]:
        evaluation_metrics = Evaluation().run(data_train, 
                                              data_val, 
                                              features_to_keep, 
                                              model_hyperparameters)
    
    
    ArtifactSaver(config["file_locations"]["output_loc"]).save([data_train, 
                                                                data_val, 
                                                                features_to_keep, 
                                                                model_hyperparameters, 
                                                                evaluation_metrics])

We have now completely decoupled the setup of our experiment from the code that executes it. What experimental setup we want to try is now completely determined by the configuration file, making it trivial to try out new ideas. We can even control what steps we want to carry out, allowing scenarios like:

Running data preparation and feature selection only to generate an initial processed dataset that can form the basis of a more detailed experimentation on trying out different models and related hyperparameters

Leverage automation and parallelism

We now have the ability to configure different experimental setups via a configuration file and launch full end-to-end experiment with a single command line invocation. All that is left to do is scale the capability to iterate over different experiment setups as quickly as possible. The key to this is:

Automation to programatically modify the configuration file
Parallel execution of experiments

Step 1) is relatively trivial. We can write a shell script or even a secondary python script whose job is to iterative over different experimental setups that the user supplies and then launch a pipeline with each new setup.

#!/bin/bash

for nan_treatment in drop impute_zero impute_mean
do
  update_config_file($nan_treatment, )
  python3 ./src/main_with_config.py --config_loc 
done;

Step 2) is a more interesting proposition and is very much situation dependent. All of the experiments that you run are self contained and have no dependency on each other. This means that we can theoretically launch all of them at the same time. Practically it relies on you having access to external compute, either in-house or though a cloud service provider. If this is the case then each experiment can be launched as a separate job on your compute, assuming that you have access to using these resources. This does involve other considerations however, such as deploying docker images to ensure a consistent environment across experiments and figuring out how to embed your code within the external compute. However once this is solved you are now in a position to launch as many experiments as you wish, you are only limited by the resources of your compute provider.

Embed Loggers and Experiment Trackers for Easy Oversight

Having the ability to launch 100’s of parallel experiments on external compute is a clear victory on the path to reducing the time to value of data science projects. However abstracting out this process comes with the cost of it not being as easy to interrogate, especially if something goes wrong. The interactive nature of notebooks made it possible to execute a cellblock and instantly look at the result.

Tracking the progress of your pipeline can be realised by using a logger in your experiment. You can capture key results such as the features chosen by the selection process, or use it to signpost what what is currently executing in the pipeline. If something were to go wrong you can reference the log entries you have created to figure out where the issue occurred, and then possibly embed more logs to better understand and resolve the issue.

logger.info("Splitting data into train and validation set")
df_train, df_val = create_data_split(df, method = 'random')
logger.info(f"training data size: {df_train.shape[0]}, validation data size: {df_val.shape[0]}")
            
logger.info(f"treating missing data via: {missing_method}")
df_train = treat_missing_data(df_train, method = missing_method)

logger.info(f"scaling numerical data via: {scale_method}")
df_train = scale_numerical_features(df_train, method = scale_method)

logger.info(f"encoding categorical data via: {encode_method}")
df_train = encode_categorical_features(df_train, method = encode_method)
logger.info(f"number of features after encoding: {df_train.shape[1]}")

The final aspect of launching large scale parallel experiments is finding efficient ways of analysing them to quickly find the best performing setup. Reading through event logs or having to open up performance files for each experiment individually will quickly undo all the hard work you have done in ensuring a streamlined experimental process.

The easiest thing to do is to embed an experiment tracker into your pipeline script. There are a variety of 1^st and 3^rd party tooling available to you that lets you set up a project space and then log the important performance metrics of every experimental setup you consider. They normally come a configurable front end that allow users to create simple plots for comparison. This will make finding the best performing experiment a much simpler endeavour.

Conclusion

In this article we have explored how to create pipelines that facilitates the ability to effortlessly carry out the Experimentation process. This has involved moving out of notebooks and converting your experiment process into a single script. This script is then backed by a configuration file that controls the setup of your experiment, making it trivial to carry out different setups. External compute is then leveraged in order to parallelize the execution of the experiments. Finally, we spoke about using loggers and experiment trackers in order to maintain oversight of your experiments and more easily track their outcomes. All of this will allow data scientists to greatly accelerate their ability to run experiments, enabling them to reduce the time to value of their projects and deliver results to the business quicker.