3 Takes on End-to-End For the MLOps Stack: Was It Worth It?

MLOps Journey: Building a Mature ML Development Process


As machine learning (ML) drives innovation across industries, organizations seek ways to improve and optimize their ML workflows. End-to-end (E2E) MLOps platforms promise to simplify the complicated process of building, deploying, and maintaining ML models in production.

However, while E2E MLOps platforms promise convenience and integration, they may not always align with an organization’s specific needs, existing infrastructure, or long-term goals. In some cases, assembling a custom MLOps stack using individual components may provide greater flexibility, control, and cost-effectiveness.

To help you make this decision, I interviewed three MLOps experts who have worked with E2E platforms and custom stacks for this article. I reached out to them to hear about their different experiences using end-to-end platforms, stacks comprised of open-source components, or a mix of both:

  • Ricard Borràs is a staff machine learning engineer at Veriff, an identity verification company.
  • Médéric Hurier is a freelance MLOps engineer at Decathlon Digital, the technology branch of a leading company in the multisport retail market.
  • Maria Vechtomova is a tech lead and product manager for the MLOps framework at one of the world’s largest retailers.

Ricard Borràs’ take on E2E solutions: success or failure?

Ricard Borràs is an experienced machine learning engineer leading MLOps efforts at Veriff. Veriff is an identity verification platform that combines AI-powered automation with human feedback, deep insights, and expertise.

When I spoke to Ricard, he made it clear right away that he prefers building an MLOps stack with individual components rather than relying solely on end-to-end (E2E) solutions:

If you work with three super basic models to evaluate and easy, maybe it’s enough [to use E2E platforms]. But I recommend open-source components if you work with more complicated tasks such as computer vision, LLMS, etc.

Ricard Borràs

MLOps Lead at Veriff

The MLOps workflow

When I asked about his preferred MLOps workflow, Ricard told me that his very first task at Veriff was to reduce the time it took to develop and use ML models in production.

At first, the process was complicated, and deploying models took months. Ricard’s goal was to streamline this process to make it faster and more cost-effective while easing the workload for data scientists.

Ricard’s team at Veriff implemented a two-part MLOps workflow:

1. Experimentation platform: This platform builds on Metaflow for orchestration and data sharing, and uses Comet for ML experiment tracking.

In our interview, Ricard highlighted the importance of data sharing among different models and tasks, a key factor in accelerating the experimentation process:

Basically, we divided the process into two parts. One part is what we call an experimentation platform. It is a set of processes and libraries to allow for fast experimentation. It is especially targeted at data sharing because it is difficult to curate datasets. The problem, early on, was that the datasets were usually curated for one task. However, we also need to reuse the same dataset for different purposes, hence the need for sharing data.

Ricard Borràs

MLOps Lead at Veriff

2. Production deployment: Veriff uses a combination of NVIDIA Triton Inference Server and Amazon SageMaker multi-model endpoints (MMEs) for model deployment. Ricard explained how this enables them to deploy models easily: They convert them to the Triton format and copy them to S3, from where SageMaker picks them up. The SageMaker MMEs provide auto-scaling and reduce operational overhead.

Ricard’s team uses Metaflow to create automated model monitoring flows that compare live production data to the original training data weekly. This allows the data scientists in his team to log and analyze the live predictions for each model.

(By the way, if you’re interested in diving deeper, Ricard and colleagues described this setup in more detail on the AWS Machine Learning Blog.)

Customization over convenience

When our discussion shifted to the pros and cons of E2E solutions more generally, Ricard stressed that they are often opaque, more expensive, and may need significant external support to navigate:

We have tried to use SageMaker, especially because we are on AWS. But the problem with SageMaker is that it’s super difficult to know how it works. The documentation is poor, and you need people from AWS to tell you how it works. Also, it’s more expensive because they charge a premium for the resources and the service that you manage through SageMaker compared to their regular prices.

Ricard Borràs

MLOps Lead at Veriff

In contrast, he found that using a combination of open-source tools such as Metaflow allows for greater customization and control over the deployment pipeline, catering specifically to the needs of the data science team without the overhead costs associated with fully managed services. Ricard particularly endorses Metaflow, praising its robust design and ease of use.

Ricard claims that this component-based approach has allowed Veriff to reduce model deployment time by 80% while cutting costs in half compared to their previous setup, which used Kubernetes to host each model as a microservice.

Practical recommendations

At the end of our interview, I challenged Ricard to summarize his stance on deciding between an E2E platform versus a custom one. While building a custom ML stack requires upfront investment, Ricard believes the long-term benefits of using flexible, open-source components outweigh the costs of opinionated SaaS platforms for most use cases.

Médéric Hurier: a balanced perspective on E2E ML platforms

Médéric Hurier, a freelance MLOps engineer currently working with Decathlon Digital, offered a nuanced perspective on using end-to-end (E2E) MLOps platforms rather than assembling a stack from individual components.

Médéric told me that over the past few years, he explored various MLOps platforms and earned certifications on GCP, Databricks, and Azure to compare their user experience and advise his customers.

The case for E2E platforms

Médéric believes that E2E platforms like Databricks, Azure ML, Vertex AI, and SageMaker offer a cohesive and integrated experience akin to “using Apple products but with the user experience of Linux.” These platforms bundle multiple tools and services, simplifying the setup and reducing the need for extensive infrastructure management.

However, Médéric pointed out to me that these platforms often have a steep learning curve, lock in users (vendor lock-in), and can be quite complex: 

Sagemaker is a good tool, but for me, it’s a bit complex. It’s more of a tool made for engineers, not data scientists—often, the people who love it the most are people who are the most technically skilled.

And it’s the same for most of these products: they only work well with other components in their ecosystem. This means that the more AWS services you use along with SageMaker, the better it becomes. But if you want to switch to another solution, it may not be easy.

Médéric Hurier

Senior MLOps Engineer

When asked to compare the cloud behemoths, Médéric highlighted SageMaker as a powerful but complex E2E platform, Azure ML as a smoother but less feature-rich option, and Vertex AI as the solid middle ground.

He also praised Databricks for having the simplest interface and being a more accessible platform for data scientists:

Databricks is the solution I recommend to my customers most often because it’s the simplest one to use. Our data scientists at Decathlon love that it combines data analytics, data engineering, and data science in one UI.

Médéric Hurier

Senior MLOps Engineer

The flexibility of open-source components

Throughout our conversation, Médéric emphasized the flexibility and control offered by using open-source components.

For companies with a strategic mindset and skilled engineers, Médéric suggested building their own MLOps platform by integrating open-source tools like MLflow, Argo Workflows, and Airflow. However, he acknowledged that this requires significant engineering resources and infrastructure expertise.

Médéric’s proposal for a high-level architecture of an MLOps platform that decouples the components, data, and configuration |
Modified based on: source

As an alternative, managed platforms provide individual capabilities such as workflow orchestration. Médéric said that, in his experience, stitching together SaaS components works well for startups that need to move quickly. However, he pointed out that European data privacy restrictions can make sending data to an external provider challenging.

The hardest path is building an MLOps platform yourself from different components. This usually only makes sense for companies with a strategic mindset saying ‘We want to be completely independent. We’ll allocate a lot of engineers, and we’ll pay less for the platform in the long run.

Médéric Hurier

Senior MLOps Engineer

Practical recommendations

Toward the end of our interview, I asked Médéric to make a recommendation for organizations starting with MLOps. Instead of listing a specific tech stack, he emphasized once more that evaluating a team’s specific needs and technical capabilities is paramount.

He believes smaller companies or those with less technical expertise benefit from the simplicity and integrated nature of E2E platforms. In contrast, Médéric shared that in his experience larger organizations with skilled engineering teams typically prefer the flexibility and cost savings of assembling their custom MLOps stack from open-source components.

Overall, Médéric pointed out that there are many different scalability requirements—whether you need real-time online inference, batch processing, or the ability to scale your data team:

Deploying solutions at scale depends on your dimension. If you want to do batch inference and scale the whole data team, I’d say consider Databricks. If your workload involves online inference and generative AI, go with SageMaker.

If you are already using a cloud platform like GCP, give it a chance first instead of trying out other platforms simultaneously. They mostly have the same features. Adopting another cloud service usually makes no sense if you already have the service provider within your organization.

Médéric Hurier

Senior MLOps Engineer

To build a robust and user-friendly MLOps pipeline that can adapt to changing requirements and scale effectively, Médéric recommended involving end-users early in the process, explaining model results frequently, and iteratively deploying models to gather feedback and reduce rework.

Maria Vechtomova’s insight on using end-to-end ML platforms for MLOps

Maria Vechtomova is a tech lead and product manager for the MLOps framework at one of the world’s largest retailers, Ahold Delhaize. She brings a wealth of experience to the discussion on end-to-end MLOps platforms.

Maria has developed, deployed, and managed ML systems across multiple brands, which gives her a deep understanding of the intricacies of MLOps.

Choosing Databricks for E2E MLOps stack integration and maintenance

Reflecting on her experience, Maria emphasized the convenience of having an integrated solution like Databricks that covers multiple aspects of the MLOps lifecycle, including orchestration, model training, and serving:

We use Databricks Workflows for orchestration, Databricks Feature Store, MLflow for model registry and experiment tracking, Databricks Runtime for model training, serverless model endpoints, and Kubernetes for serving. We have custom Streamlit, Grafana, and even PowerBI monitoring dashboards.

Maria Vechtomova

MLOps Tech Lead and Product Manager

In Maria’s case, the organization’s prior adoption of Databricks impacted the decision to use an end-to-end platform. Given the team’s limited capacity to manage solutions, she said, opting for a managed product was the logical choice. Their decision ultimately came down to choosing between Azure ML native services and Databricks:

When there is feature parity between alternatives, we almost always prefer Databricks since we do not have to deploy any extra infrastructure and go through security approvals.

Maria Vechtomova

MLOps Tech Lead and Product Manager

Operational efficiency vs. customization when choosing E2E platforms

When I proposed that end-to-end platforms are the best way for teams to get up to speed quickly, Maria agreed that E2E solutions are appealing, especially when they offer a cohesive set of tools. However, she highlighted a core problem:

The main challenge of using end-to-end ML platforms for your MLOps stack is that nothing works exactly as you need. For example, Databricks has a certain definition of URL and payload to interact with model endpoints. Consumers of the APIs may have different needs You may build some hacks to get it to work. This is way harder than when you have your own custom solution.

Maria Vechtomova

MLOps Tech Lead and Product Manager

While Maria was willing to concede that ML platforms have come a long way over the years, she pointed out that even comprehensive platforms like Databricks require periodic migrations and updates:

Over my career, I have built ML platforms four times all over again with different tools. I think ML platforms have experienced significant improvement over the years. It is important to remember that none of the platforms solve all your problems, and you would have to integrate them with existing tooling. 

Also, platforms change significantly over time (like Databricks with the Unity catalog). You would have to do migrations every few years regardless of your choice.

Maria Vechtomova

MLOps Tech Lead and Product Manager

My key takeaways

Looking back at my interviews with Ricard, Mederic, and Maria, they all emphasized that when an organization considers end-to-end MLOps platforms, it’s important it is to carefully consider a team’s specific needs and the tools it already uses. 

While E2E platforms offer convenience and eliminate the need to manage individual components, they may not align with unique requirements. Thus, organizations must weigh the pros and cons, considering factors like team size, the infrastructure they already have, and the need for customization.

If the key to MLOps platform success is assessing requirements rather than picking a solution based on popularity or the number of features, where do we start? I recommend the excellent article on the AI/ML Platform Build vs. Buy Decision by my fellow writer, Luís Silva, which walks through the discovery and decision process in great detail. 

As an avid podcast fan, I’ve also learned much from listening to experienced MLOps engineers share their experiences building platforms on our MLOps Platform podcast.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.