Scale And Track Your AI/ML Workflows: neptune.ai + Flyte & Union Integration

In the machine learning (ML) and artificial intelligence (AI) domain, managing, tracking, and visualizing model training processes is a significant challenge due to the scale and complexity of managed data, models, and resources.

Union, an optimized and more performant version of the open-source solution Flyte, provides scalability, declarative infrastructure, and data lineage, allowing AI developers to iterate and productionize AI or ML workflows quickly.

Neptune is an experiment tracker that allows AI researchers to monitor their model training in real time, visualize and compare experiments, and collaborate on them with a team. Like Union, Neptune excels in scalability, making it the ideal tracking solution for teams working on large-scale model training.

The new Neptune Flyte plugin enables you to use Neptune to track, visualize, and manage your models. The plugin automatically logs Flyte’s execution metadata into Neptune and adds a link in Union’s UI to Neptune. In this blog post, you’ll learn how to use the Neptune plugin on Union.

Orchestrate and track your models with Flytekit’s Neptune Plugin

In Union, data and compute are fundamental building blocks for developing all workflows. You can train models using machine learning or AI libraries such as PyTorch Lightning or XGBoost. Union is built on Flyte, which uses declarative orchestration to scale any computation easily.

In this first example, flytekit’s neptune_init_run configures the Neptune run, and the PyTorch Lightning callback to automatically track the model’s progress. With Flyte’s declarative infrastructure, you set accelerator=A100 to allocate an NVIDIA A100 GPU to run the training task with Lightning. The neptune_init_run decorator initializes a Neptune Run object and stores it into flyte’s context.

With the plugin, Union’s execution page now has a link that goes directly to Neptune’s web app dashboard:

The Neptune Run object can be used directly or passed into many of Neptune’s integrations with machine learning libraries. For PyTorch Lightning, you can use Neptune to track metrics during training:

Scale to multiple training tasks with dynamic workflows

With Flyte’s dynamic workflows, you can quickly scale up to multiple training tasks, each with its own resources. In this example, you see how to use Flyte’s declarative infrastructure to train various models using XGBoost. Similar to the previous example, Flyte’s context provides a Neptune Run object which is passed to Neptune’s XGBoost integration.

Neptune’s XGBoost integration will automatically log metadata associated with training the XGBoost model.

In the Union UI, the workflow dynamically scales out to multiple tasks, each with a link to Neptune:

Wrapping up

Union’s declarative infrastructure and scalable orchestration platform make it simple to scale up our machine learning or AI workflows and put them in production. With flytekit’s Neptune plugin, you can easily track your experiments, visualize results, and debug your models. Use the plugin by installing it with pip install flytekitplugins-neptune.

To learn more about Union, contact the team at union.ai/demo.

To learn more about Neptune, get in touch with us at neptune.ai/contact-us.