The Importance of Effectively Experimenting in an AI PhD • David Stutz

When working in AI, especially when doing an empirical PhD, the engineering required for effective experimentation is incredibly important. I procrastinated on writing this article for a while now, but it is getting more and more apparent these days when a large portion of research shifted to working with large foundation models. But even ~7 years ago, when I started my PhD, I quickly realized that running experiments effectively will be crucial. This is because effective experimentation means that research hypotheses can be tested quickly and provide insights into the next hypotheses to test. For example, my TPAMI paper on bit error robustness includes thousands of trained and evaluated models to test various orthogonal hypotheses.

Unfortunately, infrastructure and utilities for managing experiments are rarely open-sourced alongside research code. This may be due to these aspects often receiving less attention or being extremely specific to circumstances such as hardware, operating systems, etc. Even my open-sourced code usually includes only the essentials and discards many utilities that I used daily for running experiments. Nevertheless, you can find individual artifacts here and there. For example, the logging utilities or setup tests for my confidence-calibrated adversarial training work or the JSON configuration files from my work on 3D shape completion.

In this article, I want to share some more general lessons that I learned during my PhD for running large-scale experiments with machine learning models. Specifically, I identified the following aspects as important:

Running the right experiments

Making experiments reproducible

Analysis through logging and monitoring

Automating everything

Running the Right Experiments

This is the key problem we tackle in research every day and I feel that this is to a large extent intuition. However, there are some tricks that I learned to make this easier. For me, ideas for experiments usually came from reading papers or discussing problems with colleagues and advisors. Once an idea gets more concrete, I would try write down a hypothesis and the high-level experiment design. Sometimes this experimental setup needs to be iterated on in terms of how exactly to implement it, what to measure and how to summarize the results. Then, once the setup is finalized and implemented, I would run the experiment and make sure to save the evaluation results/analysis — this could be some numbers or a plot, or just a Jupyter notebook with some analysis.

Ideally every experiment should answer a fairly specific question. Especially in the beginning of a PhD it can be useful to start with rather incremental research questions based on prior published work. Of course, this is somewhat idealistic. There will be plenty of experiments that are mainly meant to test code or that fail for various unrelated reasons. Sometimes one question also involves a series of experiments, for example, hyper-parameter tuning. However, I found that repeatedly doing this exercise of explicitly writing down a specific hypothesis and then evaluating it with a series of experiments helps to develop intuition and the ability to come up with good research questions for next experiments. It also makes sure that each experiment serves a purpose.

Reproducible Experiments

Before starting experiments, it is important to ensure they are reproducible. Reproducibility is a key element in academic research as it allows to replicate experiments and eventually come to a consensus about the research hypothesis being tackled. In AI, reproducibility is even more important since a lot of the research we do is constructive in nature — coming up with new algorithms or training new models. In my experience, the three most important aspects to enable reproducibility are version control, taming randomness and controlling the coding environment. Specific to ML experiments, I also learned that having explicit configuration files is important. And even though machine learning models are hard to test, I am convinced that having at least some tests can improve reproducibility significantly.

Version control sounds straight forward and should be the default. However, this mostly addresses version control of code and does not necessarily instruct us of how to deal with experiments. I learned that associating each and every experiment with a commit is the ideal outcome of using version control. In practice, I implemented this by having my code checked-out twice: once for active development, and once for running experiments. The latter is not even opened in an IDE, so there is no way to make changes; it is only meant to run experiments. In many version control setups, this could also be accomplished using separate branches: a “development” branch, and an “experiment” branch. Then, I have a launch script that makes sure that all changes in the “development” branch are committed as part of launching the experiment. These changes are then checked out in the “experiment” branch to run the actual experiment. With some additional logging, this looks as follows:


def yes_or_no():
    answer = input('Commit? (y/n): ').lower().strip()
    while not (answer == 'y' or answer == 'yes' or answer == 'n' or answer == 'no'):
        answer = input('Commit? (y/n): ').lower().strip()
        if answer[0] == 'y':
            return True
        else:
            return False

# This is run from the development checkout of the code, and WDIR
# holds the directory from which experiments are run:
WDIR = '...'
if os.path.normpath(WDIR) != os.path.normpath(os.getcwd()):
    if yes_or_no():
        os.system('git commit ..')
        os.system('git push origin master')
    else:
        exit()

# Name of the experiment.
name="..."
# All experiments were started on a contact server as tmux sessions so I can
# check on their progress and debug if I want to.
response = os.system('tmux has-session -t=%s' % vname)
exists = (response != 256)
if exists:
    log('Name taken!', LogLevel.WARNING)
    log('Continue to kill session before starting the experiment:')
    input('press any key')

    files = self.args.file.split(',')
    if len(files) > 1:
        log('[Error] multiple commands not supported on Slurm')
        exit();

# The actual server to run the experiment on, script with arguments to run.
server="..."  # This could be a Slurm submit server.
script="..."
arguments="..."
script_cmd = 'python3 %s %s' % (script, arguments)
# There could be an optional step to set up some Slurm launch file or any
# other file needed for a specific cluster.

commands = [
    'tmux kill-session -t %s' % name,
    'tmux new-session -d -s %s' % name,
    # ssh into the server where the experiment is actually run.
    'tmux send-keys -t %s "ssh %s" ENTER' % (name, server),
    'tmux send-keys -t %s "cd %s" ENTER' % (name, WDIR),
    # Some optional bash profile to set the right CUDA env.
    'tmux send-keys -t %s "source ~/.bashrc_cuda10" ENTER' % name,
    # Git pull the commit we did above.
    'tmux send-keys -t %s "git pull" ENTER' % name,
    # Run the experiment; this will also log progress, experiment configuration etc.
    'tmux send-keys -t %s "%s" ENTER' % (name, script_cmd),
]

for command in commands:
    common.experiments.Monitor.get_instance().log(command)
    os.system(command)
    log(command)

Another important aspect of version control is the question of what to include in versioning and what not. While in classical software development, configuration files and data is usually handled separately, I believe that it is important to include all configuration files with hyper-parameters. This also includes all random seeds, paths to data files and might even include some data files, for example data normalization values that have been tuned.

Randomness is a key ingredient in all of machine learning. It determines initialization of models, data splits, the order we see training examples, many components such as dropout or any random noise during optimization and so on. As a result, it is common that randomness stems from various different libraries and functions. For example, it is fairly common to deal with TensorFlow, PyTorch and NumPy random seeds. Personally, I think Jax is currently doing the best job in making randomness explicit in function calls. For reproducibility it is crucial to be aware of all sources of randomness and control them relentlessly; this not only includes training but also evaluation and testing. I usually have explicit seeds for all sources of randomness, that can be controlled in configuration files.

The environment is fairly easy to control these days, but can still be challenging in the face of having to use various types of GPUs for many machine learning projects. There are various tools like conda to fully control all software versions used and I recommend to heavily rely on such tools. This is important as the environment may vary across experiments, especially when working with open-sourced baselines. For conda, I found it useful to update/export/commit the environment.yml whenever updating packages.

The experiment configuration includes all hyper-parameters, seeds, model configuration, evaluation and dataset configuration. As mentioned above, it should be part of version control. However, it is also worth mentioning that it is actually useful to collect all these parameters and any additional magic numbers in an explicit configuration file. Essentially, each experiment should be fully defined by the experiment configuration. Then, given the commit in the repository and the experiment configuration (including random seeds), each experiment should be fully reproducible. Over the years, it became common practice to store these hyper-parameters in JSON files or as Python dictionaries. Another alternative is Google’s ml_collections. I also learned that actually dumping these configuration files alongside the experiment output can be useful for debugging.

Finally, a short note on tests: it is research, most PhD students are reluctant to add elaborate tests. Experiments failing is a very common occurence and part of research. However, everyone knows how annoying it is to start an experiment Friday evening and discover on Monday that it failed because it couldn’t find the data, a package update messed up an import, it could not run on GPUs with the right CUDA version, or some location was not writable/readable as expected and so on. So I started having a test suite that checks (a) all necessary imports, (b) all data sources, (c) readable/writable locations, (d) basic operations on GPU, among others. This setup.py is an early version of that. This should be lightweight enough to run before each experiment.

Analysis through Logging and Monitoring

After properly starting an experiment, logging and monitoring is important for debugging and analysis. The goal of logging is just to output useful information that helps to track down errors. Instead, monitoring focuses on analysis. For example, this includes tracking key quantities throughout training or evaluation that are later used to actually evaluate the experiment and the corresponding research hypothesis. Both logging and monitoring are somewhat cumbersome to set up because they often require passing around weird information across models, training epochs, datasets, etc.

In the beginning of my PhD, I wrote a custom logging utility myself, but I am sure there are better alternatives around nowadays. Essentially, I learned it is useful to always log key experiment information to a (non-temporary) file. Loggers will often include timestamps and the file logged from. Specifically, I found the following things to be very useful for logging:

Experiment environment (host/machine, GPU info, any scheduling/deschedluing events, Python/package versions)

Experiment configuration (network architecture, all hyper-parameters, etc.)

Data loading (number of examples, files read, image or input sizes, network size after initialization, etc.)

Losses, including all components uch as regularization terms

Checkoints, model files written, data files written, plots produced, etc.

Beyond basic logging, monitoring actually saves intermediate state or experiment results that are relevant to the research hypotheses to be tackled. Usually this includes the trained model or evaluation results such as metrics or raw predictions. But it might also include intermediate model activations, predictions on the test set throughout training, some training or test inputs to check data augmentation schemes. Many of these can be combined with tools such as TensorBoard to make monitoring interactive. Essentially, I wanted to have access to all information to run the planned analysis.

Automating everything

A lot of the above is implicitly about automation — automated version control, scheduling, logging, monitoring, etc. Fundamentally, automation requires all of these components because otherwise your time becomes the bottleneck. Then, the number of experiments you can run is limited by your bandwidth and it does not allow to parallelize experiments and other work items such as paper writing, discussions, programming and so on. However, automation can go even further. I typically try to also automate plotting and analysis to a large extent. For example, evaluation can run automatically after training and analysis can be a Jupyter notebook that is run automatically and saved as PDF or HTML afterwards. This allows to run larger sets of experiments, especially for ablations or appendices when preparing for top-tier paper submissions.

Conclusion

Overall, effectively running experiments is important for an empirical PhD in AI. For me, running the right experiments, making experiments reproducible, proper logging and monitoring and automation are key aspects of enabling effective experiments. All of these will make it easier to test research hypothesis, allow to run more experiments and free your time for other work items besides “baby-sitting” experiments. Now at Google DeepMind, I also learned that these aspects is what makes many projects incredibly successful.

Source link
lol