Building MLOps Capabilities at GitLab As a One-Person ML Platform Team

Building MLOps Capabilities at GitLab As a One-Person ML Platform Team


Eduardo Bonet is an incubation engineer at GitLab, building out their MLOps capabilities.

One of the first features Eduardo implemented in this role was a diff for Jupyter Notebooks, bringing code reviews into the data science process.

Eduardo believes in an iterative, feedback-driven product development process, although he emphasizes that “minimum viable change” does not necessarily mean that there is an immediately visible value-add from the user’s point of view.

While LLMs are quickly gaining traction, Eduardo thinks they’ll not replace ML or traditional software engineering but add to the capabilities. Thus, he believes that GitLab’s current focus on MLOps – rather than LLMOps – is exactly right.

This article was originally an episode of the ML Platform Podcast. In this show, Piotr Niedźwiedź and Aurimas Griciūnas, together with ML platform professionals, discuss design choices, best practices, example tool stacks, and real-world learnings from some of the best ML platform professionals.

In this episode, Eduardo Bonet shares what he learned from building MLOps capabilities at GitLab as a one-person ML platform team. 

You can watch it on YouTube:

Or Listen to it as a podcast on: 

But if you prefer a written version, here you have it! 

In this episode, you will learn about: 

  • 1
    Code reviews in the data science job flow
  • 2
    CI/CD pipelines vs. ML training pipelines
  • 3
    The relationship between DevOps and MLOps 
  • 4
    Builiding a native experiment tracker at GitLab from scratch
  • 5
    MLOps vs. LLMOps

Who is Eduardo? 

Aurimas: Hello everyone, and welcome to the Machine Learning Platform Podcast. As always, together with me is my co-host, Pjotr Niedźwiedź, the CEO and founder of neptune.ai.

Today on the show, we have our guest, Eduardo Bonet. Eduardo, a staff incubation engineer in GitLab, is responsible for bringing all of the capabilities and goodies of MLOps to GitLab natively. 

Hi, Eduardo. Please share more about yourself with our audience.

Eduardo Bonet: Hello everyone, thanks for having me. 

I’m originally from Brazil, but I’ve lived in the Netherlands for six years. I have a very weird background. Well, it’s control automation engineering, but I’ve always worked with software development, but not always at the same place, so it’s different. 

I’ve been a backend, frontend, Android developer, data scientist, machine learning engineer, and now I’m an incubation engineer. 

I live in Amsterdam with my partner, my kid, and my dog. That’s the general gist.

What’s an incubation engineer? 

Piotr: Talking of your background, I’ve never heard of the term incubation engineer before. What is it about? 

Eduardo Bonet: The incubation department at GitLab consists of a few more incubation engineers. It’s a group of people who try to explore new features or incubate new features or new markets in GitLab. 

We’re all engineers, so we are supposed to deliver code to the code base. We are supposed to find a group or a new persona that we want to bring into GitLab, talk to them, see what they want, introduce new features, and explore whether those features make sense in GitLab or not. 

It’s a very early development of new features, so we use the term incubation. Our engineers are more focused on moving from zero to eighty. At this point, we pass on to a regular team, which will, if it makes sense, do the eighty to ninety-five or eighty to 100.

A day in the life of an incubation engineer

Aurimas: You’re a single-person team building out the MLOps capabilities, right?

Eduardo Bonet: Yes.

Aurimas: Can you give us a glimpse into your day-to-day? How do you manage to do all of that?

Eduardo Bonet: GitLab is great because I don’t have a lot of meetings—at least not internally. I spend most of my day coding and implementing features, and then I get in contact with customers either directly by scheduling calls with them or by reaching out to the community on Slack, LinkedIn, or physical meetups. I talk to them about what they want, what they need, and what the requirements are. 

One of the challenges is that it’s not a customer; the people that I have to think about are not the users of GitLab but the people who don’t use GitLab. Those are the ones that I’m building for. Those are the ones that I put in features for because the ones that are already using GitLab already use GitLab. 

Incubation is more about bringing new markets and people into GitLab’s ecosystem. Relying only on the customers we already have is not enough. I need to go out and look at users who want to use it or maybe have it available but don’t have reasons to use it.

Aurimas: But when it comes to, let’s say, new capabilities that you are building, you mentioned that you are communicating with customers, right?

I would guess these are organizations that develop regular software but would also like to use GitLab for machine learning. Or are you straightly targeting some customers who are not yet GitLab customers—let’s call them “probable users.” 

Eduardo Bonet: Yes, both of them. 

The easiest ones are customers who are already on GitLab and have a data science group in their company, but that data science group doesn’t find good reasons to use GitLab. I can approach them and see, making it easier because they can start using it immediately. 

But there are also brand new users who have never had GitLab. They have a more data science-heavy workflow, and I’m trying to find a way to set up their MLops cycle and how GitLab can be an option for them.

Aurimas: Was it easy to narrow down the capabilities that you’re going to build next? Let’s say you started at the very beginning. 

Eduardo Bonet: Yeah. 

In DevOps, you have the DevOps lifecycle, and I’m currently looking at the DevOps part, which is up until a model is ready for deployment.

I started with code review. I implemented Jupyter Notebook diffs and code reviews for Jupyter Notebooks a while ago. Then, I implemented model experiments. This was released recently, and now I’m implementing the model registry. I started working on the motor registry within GitLab. 

Once you have the model registry, there are some things that you can add, but right now, that’s the main one. Observability can be added later once you have the registry, so that’s more part of the Ops, but on the Dev side of things, this is what I’ve been looking at: 

  • Code reviews 
  • Model experiments 
  • Model registry
  • Pipelines

Aurimas: And these requests came straight from the users, I guess? 

Eduardo Bonet: It depends. I was a machine learning engineer and a data scientist before, so a lot of what I do is solving personal pain points.

I bring a lot of my experience into looking at what could be because I was a GitLab user before as a data scientist and as an engineer. So I could see what could be done with GitLab but also what I couldn’t do because the tooling was not there. So I bring that to the table, and I talk to a lot of customers.

In the past, customers have suggested features such as integrating MLflow or model experiments and the model registry.

There are a lot of things to be done, and it’s hard to choose what to look for. At that point, I usually go with what I’m most excited about because if I’m excited about something, I can build faster, and then I can build more

Kickstarting a new initiative

Piotr: I have more questions on the organizational level.

It concerns something I’ve read in the GitLab Handbook. For those who don’t know what it is, it’s a kind of open-source, public wiki or a set of documents that describes how GitLab is organized. 

It’s a great source of inspiration for how to structure different aspects of a software company, from HR to engineering products.

There was a paragraph about how we are starting new things, like MLOps support or MLOps GitLab offering for the MLOps community. You’re an example of this policy.

On the one hand, they are starting lean. You’re a one-man show, right? But they put a super senior guy in charge of it. For me, it sounds like a smart thing to do, but it is surprising, and I think that I’ve made this mistake in the past when I wanted to start something new.

I wanted to start lean, so I put a more junior-level person in charge because it is about being lean. However, it was not necessarily successful because the problem was not sufficiently well-defined. 

Therefore, my question is, what are the hats you’re effectively wearing to run this? It sounds like an interdisciplinary project. 

Eduardo Bonet: There are many ways of kickstarting a new initiative within a company. Starting lean or incubation engineers are more for the risky stuff, for example, things that we don’t really know if make sense or not, or are more likely to not make sense than to make sense. 

In other cases, every team that is not incubated can also kickstart their own initiatives. They have their own process of how to approach. They have more people. They have UX support. They have a lot of different ways.

Our way is to have an idea, build it, ship it, and test it with users. The hats I usually have to wear are mostly:

  • Backend/frontend engineer – to deploy the features that I need
  • Product manager – to talk to customers, enter the process of deploying things at GitLab, understand the release cycle, manage everything around, and manage the process with other teams.
  • UX – there’s a little bit of UX, but I prefer to delegate it to actual UX researchers and designers. But for the early version, I usually build something instead of asking a UX or a designer to create a design. I build something and ask them to improve it.

Piotr: You also have this design system, Pajamas, right?

Eduardo Bonet: Yes, Pajamas helps a lot. At least you get the blocks going and moving, but you can still build something bad even if you have blocks. So I usually ask for UX support once there’s something aligned or something more tangible that they can look at. At this point, we can already ship to users as well, so the UX has feedback from users directly.

There’s also the data scientist hat, but it’s not really at delivering things. When I chat with customers, it’s really helpful that I was a data scientist and a machine learning engineer because then I can talk in more technical terms with them or more direct terms. Sometimes the users want to talk technical, they want to talk on a higher level, they want to get down to it. So that’s very helpful.

On the day-to-day, the data science and machine learning hat is more for conversations and what needs to be done rather than for what they do right now.

Piotr: Who would be the next person you would invite to your team to support you? If you can choose, what would be the position?

Eduardo Bonet: Right now, it would be a UX designer and then more engineers. That’s how it would grow a bit more.

Piotr: I’m asking this question because what you do is a kind of extreme hardcore version of an ML platform team, where the ML platform team is supposed to serve data science and ML teams within the organization. Still, you have a broader spectrum of teams to serve.

Eduardo Bonet: We now have both data science and machine learning teams within GitLab. I separate both because one helps the business make decisions, and the other uses machine learning and AI for product development. They are customers of what I build, so I have internal customers of what I build. 

But I built both so that we can use them internally, and external customers can, too. It’s great to have that direct dogfooding within the company. A lot of GitLab is built around dogfooding because we use our product for nearly everything.

Having them use the tooling as well, the model experiments, for example, was great. They were early users, and they gave me some feedback on what was working and what was not in Notebook diffs. So that’s great as well. It’s better to have them around.

Code reviews in the data science job flow

Aurimas: Are these machine learning teams using some other third-party tools, or are they fully solely relying on what you have built?

Eduardo Bonet: No, what I’ve built is insufficient for a full MLOps lifecycle. The teams are using other tools as well.

Aurimas: I guess what you’re building will replace what they are currently using?

Eduardo Bonet: If what I built is better than that specific solution that they need, yes, then hopefully, they will replace it with what I built.

Aurimas: So you’ve been at it for around one and a half years, right?

Eduardo Bonet: Yes.

Aurimas: Could you describe the success of your projects? How do you measure them? What are the statistics?

Eduardo Bonet: I have internal metrics that I use, for example, for Jupyter Notebook diffs or code reviews. The initial hypothesis is that data scientists want to have code reviews, but they can’t because the tooling is not there, so we deployed code reviews. It was the first thing that I worked on. There was a huge spike in code reviews after the feature was deployed—even if I had to hack the implementation a bit. 

I implemented my own version of diffs for Jupyter Notebooks, and we saw a huge, sustained spike. There was a jump and then a sustained number of reviews and comments on Jupyter Notebooks. That means the hypothesis was correct. They wanted to do code reviews, but they just didn’t have any way to do it.

But we also rely on a lot of qualitative feedback because I’m not looking at our current users; I’m looking at new users coming in. For that, I use a lot of social media to get an idea of what users want or whether they like the features, and I also chat with other folks. 

It’s funny because I went to the pub with ex-colleagues and a data scientist, and there was a bug on Jupyter. They almost made me take my laptop to fix the bug while there, and I fixed it the next week. But I see now more data scientists coming in and asking for data science stuff in GitLab than before.

Aurimas: You mentioned code reviews. Do I understand correctly that you mean being able to display Jupyter Notebook diffs? That would then result in code reviews because previously, you couldn’t do that.

Eduardo Bonet: Yes.

Piotr: Is it done in the way of pull requests? Like with pull requests or is more about, “Okay, here is a Jupyter Notebook” because  I see a few – let’s call them “jobs to be done” – around it.

For example, I’ve done something in a Jupyter Notebook, maybe some data exploration and model training within the notebook. I see results, and I want to get feedback, you know, on where to learn and where I should change something, like suggestions from colleagues. This is one use case that comes to my mind.

Second, and that’s something I have not seen, but maybe because this functionality was not available, is a pull request, a merge situation.

Eduardo Bonet: The focus was exactly on the merge request flow. When you push a change to Jupyter Notebook, and you create a merge request, you will see the diff of the Jupyter Notebook with the images displayed over there, in a simplified version.

I convert both Jupyter Notebooks to their markdown forms, do some cleanup because there’s a lot of stuff in there that’s not necessary, maximize information, reduce noise, and then diff those markdown versions. Then, you can comment and discuss the notebooks’ markdown versions. It doesn’t matter for the user—nothing changes for the user. There’s the push, and it’s over there.  

When I was in data science, it was not even about the ones who used notebooks for machine learning. Those are important, but it’s also the data scientists who are more focused on the business cases. The final artifact of their work is usually a report, and the notebook is usually the final part of their report—like the final technical part of their report.

When I was a data scientist, we would review each other’s documents—the final reports—and there would be graphs and stuff, but nobody would see how those graphs were generated. For example, what was the code? What was the equation? Was there a missing plus sign somewhere that could completely flip the decision being made in the end? Not knowing that is very dangerous.

I would say that for this feature, the users who can get the most out of it are not the ones who only focus on machine learning but those who are more on the business side of data science.

Piotr: This makes sense. This concept of pull requests and code review in the context of reporting makes perfect sense for me. I was not sure, for instance, in model building, I have not seen much of pull requests. Maybe if you have a shared or feature engineering library, then yes, pipelining, yes, but you wouldn’t do pipeline necessarily in notebooks—at least it wouldn’t be my recommendation, but yeah, it makes sense.

Aurimas: Even in machine learning, the experimentation environments benefit a lot before you actually push your pipeline to production, right?

Eduardo Bonet: Yeah. 

And there’s another concept about code review that was important to me: code review is where code culture grows. It’s a kickstarter to create a culture of development, a shared culture of development among your peers.

Data scientists don’t have that. It’s not that they don’t want to; it’s that if they don’t code review, they don’t talk about the code, they don’t share what is common, what is not, what are mistakes, best case or not. 

For me, code review is less about correctness and more about mentoring and discussing what is being pushed. 

I hope that with Jupyter code reviews, along with the regular code reviews and all of the things we have, we can push this code review or code culture to data scientists better–like allowing them to develop this culture themselves by giving them the necessary tools.

Piotr: I really like what you said. I’ve been an engineer almost all my life, and code review is one of my favorite parts of it. 

If you’re working as a team—again, not about correctness but about discovering how something can be done simpler or differently—also make sure that a team understands each other’s code and that you have it covered so you don’t depend on one person.

It is not obvious how to make it part of the process when you’re working on models, for me at least, but I’m really seeing that we are missing something here as MLOps practitioners.

Eduardo Bonet: The second part that comes to this, to the merge request, is the model experiments themselves. I’m building that second part independently of merge requests, but eventually, ideally, it will be part of the merge request flow.

So when you push a change to a model, it already runs hyperparameter tuning on your CI/CD pipelines. You already display on the merge request, along with the changes, the models, the potential models, and the potential performances of each model that you can select to deploy your model—your candidates, what I call each one a candidate.

From the merge request, you can select which model will go into production or become a model version consumed later. That’s the second part of the merge requests that we’re looking at.

Piotr: So you’re saying this will also be part of the report after hyperparameter optimization once there is a change? You will conduct hyperparameter optimization to determine the model’s potential quality after those changes. So you see that the level of merge.

We have something like that, right? When we are working on the code, you will get a report from the tests, at least the unit tests. Yeah, it passed. The security test passed, okay. It looks good…

Eduardo Bonet: In the same way that you have this for software development, where you have security scanning, dependency scanning, and everything else, you will have the report of the candidates being generated for that merge request.

Then, you have a view of what changed. You can track down where the change came from and how it impacts the model or the experiment over time. Once the merge request is merged, you can deploy the model.

CI/CD pipelines vs. ML training pipelines

Aurimas: I have a question here. It’s about making your machine learning training pipeline part of your CI/CD pipeline. If I hear correctly, you’re treating them as the same thing, correct?

Eduardo Bonet: There are multiple pipelines that you can take a look at, and there are multiple tools that do pipelines. GitLab pipelines are more thought out for CI/CD, after the code is in the repository. Other tools, like Kubeflow or Airflow, are better at running any pipeline. 

A lot of our users use GitLab for CI/CD once the code is there, and then they trigger the pipeline. They use GitLab to orchestrate triggering pipelines on Kubeflow or whatever tool they are using, like Airflow or something—it’s usually one of the two. 

Some people also only use GitLab pipelines, which I used to do as well when I was a machine learning engineer. I was using GitLab pipelines, and then I worked on migrating to Kubeflow, and then I regretted it because my models were not that big for my use case. It was fine to run on the CI/CD pipeline, and I didn’t need to deploy a whole other set of tooling to handle my use case—it was just better to leave it at GitLab.

We are working on improving the CI, our pipeline runner. In version 16.1, which is out now, we have runners with GPU support, so if you need GPU support, you can use GitLab runners. We need to improve other aspects to make CI better at handling the data science use case of pipelines because they start earlier than usual with regular—well, not regular—but software development in general.

Piotr: When you said GitLab runners support GPU now, or you can pick up one with GPU, we are, by the way, GitLab users as a company, but I was unaware of that, or maybe I misunderstood it. Do you also provide your customers with infrastructure, or are you a proxy over cloud providers? How does it work?

Eduardo Bonet: We provide those through a partnership. There are two types of GitLab users: self-managed, where you can deploy your own GitLab. Self-managed users were able to use their own GPU runners for a while.

What was released in this new version is what we provide on gitlab.com. If you’re a user of the SaaS platform, you can now use GPU-enabled runners as well.

The relationship between DevOps and MLOps

Piotr: Thanks for explaining! I wanted to ask you about it because, maybe half a year or more ago, I shared a blog post on the MLOps community Slack about the relationship between MLOps and DevOps. I had a thesis that we should think of MLOps as an addition to the DevOps stack rather than an independent stack that is inspired by DevOps but independent. 

You’re in a DevOps company—at least, that’s how GitLab presents itself today—you have many DevOps customers, and you understand the processes there. At the same time, you have extensive experience in data science and ML and are running an MLOps initiative at GitLab. 

What, in your opinion, are we missing in a traditional DevOps stack to support MLOps processes? 

Eduardo Bonet: For me, there is no difference between MLOps and DevOps. They are the same thing. DevOps is the art of deploying useful software, and MLOps is the art of deploying useful software that includes machine learning features. That’s the difference between the two. 

As a DevOps company, we cannot fall into the trap of saying, “Okay, you can just use DevOps.” There are some use cases. Some specific features are necessary for the MLOps workflow that is not present in traditional software development. That stems from the non-determinism of machine learning. 

When you write code, you have inputs and outputs. You know the logic because it was written down. You might not know the results, but the logic is there. In machine learning, you can define the logic for some models, but for most of them, you can just approximate the logic they learned from the data.

There’s the process of allowing the model to extract the patterns from the data, which is not present in traditional software development – so the models are like the developers. The models develop patterns from the input data to the output data. 

The other part is, “How do you know if it is doing what you’re supposed to be doing?” But to be fair, that is also present in DevOps—that’s why you do A-B testing and things like that on regular software. Even if you know what the change is, it doesn’t mean users will see it in the same way. 

You don’t know if it will be a better product if you deploy the change you have, so you do A/B testing, user testing, and tests, right? So that part is also present, but it’s even more important for machine learning because it’s the only way you know if it’s working.

With regular or traditional software, when you deploy the change, you at least know that if it is correct, you can test if the change is correct, even if you don’t know whether it moves the metrics or not. For machine learning, that is the only way you can usually implement tests, but these tests are non-deterministic. 

The regular testing stack that you use for software development doesn’t really apply to machine learning because, by definition, machine learning involves a lot of flaky tests. So, your way of determining if that is correct will be in production. You can, at most, proxy if it works the way you intended it to, but you can only know if it works the way intended at the production level. 

Machine learning puts stress on different places than traditional software development. It includes everything that traditional software development has, but it puts new stresses on different areas. And to be fair, every single way of development puts stresses on somewhere. 

For example, Android development puts its own stresses on how to develop and deploy. For example, you cannot know the version the user is using. That problem is not specific but very apparent in mobile development—ML is another domain where this applies. It will have its own stresses that will require its own tooling.

Piotr: Let’s talk more about examples. Let’s say that we have a SaaS company that has not used machine learning, at least on the production level, so far, but they are very sophisticated or follow the best practices when it comes to software development.

So let’s say they have GitLab, a dedicated SRE, engineering, and DevOps teams. They are monitoring their software on production using, let’s say Splunk. (I’m building the tech stack on the fly here.)

They are about to release two models to production: one recommender system and, second, chatbots for their documentation and SDK. There are two data science teams, but the ML teams are built of data scientists, so they are not necessarily skilled in MLOps or DevOps. 

You have a DevOps team and you’re a CTO. What would you do here? Should the DevOps team support them in moving into production? Should we start by thinking about setting up an MLOps team? What would be your practical recommendation here?

Eduardo Bonet: My recommendation doesn’t matter very much, but I would potentially start with the DevOps team supporting and identifying the bottlenecks within that specific company that the existing DevOps path doesn’t support. For example, retraining. Either way, to implement retraining, the DevOps team is probably the best one to work on. They might not know exactly what retraining is, but they know how the infrastructure is – they know how everything works over there. 

If there is enough demand eventually, the DevOps team might be split, and it might become an ML platform in itself. But if you don’t want to hire anyone, if you want to start lean, perhaps picking up someone from the DevOps team in the area to support your data scientists could be the best way of starting.

Piotr: The GitLab customer list is quite large. But let’s talk about those you met personally. Have you seen DevOps engineers or DevOps teams successfully supporting ML teams? Do you see any common patterns? How do DevOps engineers work? What is the path for a DevOps engineer to get familiar with MLops processes and be ready to be called an MLops engineer?

Eduardo Bonet: It usually fails when one does something and ships to the other to do their thing. Let’s say the data scientist spends a few months doing their modern model and then goes, “Oh, I have a model, deploy it.” That doesn’t work, really. They need to be involved early, but that’s true for software development as well. 

If you say that you’re developing something, some new feature, some new service, and then you deploy it, you make the entire service, and then you go to the DevOps team and say, “Okay, deploy this thing.” That doesn’t work. There are gonna be a lot of issues deploying that software. 

There’s a lot more stress in this when you talk about machine learning because fetching data can be slower, or there’s more processing, or I don’t know. The model can be heavy, so a pipeline can fail when it is loaded into memory during a run. It’s better if they are in the process. Doing anything, like not really working on it, but at the meetings and discussions, following the issues, following the threads, giving insight before, so that when the model is on a stage that it can be deployed, then it’s easier. 

But it’s also important that the model is not the first solution. So, deploy first, even if it’s a bad one. This bad classical software solution doesn’t perform as well and then improve – I see machine learning much more as an optimization for most cases instead of the first solution that you’ll employ to solve that.

I’ve seen it being successful. I’ve also seen data scientists’ teams trying to support themselves, succeeding and failing. DevOps teams succeeding and failing at supporting ML platforms succeeding and failing at support. It will depend on the company culture. It would depend on the people in this group city, but communication usually at least makes these problems a little bit less. Involve the people before, not when you are at the moment of deploying the thing.

End-to-end ML teams 

Aurimas: And what is your opinion about these end-to-end machine learning teams? Like fully self-service machine learning teams, can they manage the entire development and monitoring flow encapsulated in a single team? Because that’s what DevOps is about, right? Containing the flow of development in a single team.

Eduardo Bonet: I might not be the best person because I’m biased since I do end-to-end stuff. I like it. It reduces the number of hops you have to go and the amount of communication loss from team to team. 

I like multidisciplinary teams, even product ones. You have your backend, your front end, your PM, and everybody together, and then you build, you deploy, or something—you build a ship or kind of like the mentality that you are responsible for your own DevOps, and then there’s a DevOps platform that builds. 

In my opinion, I prefer when they take ownership of end-to-end, really going and saying, okay, we’re gonna go from talking to the customer to understanding what they need. Even the engineers, I like to see engineers talking to customers or our support, all of them deploying the feature, all of them shipping it, measuring it, and iterating over it.

Aurimas: What would be the composition of this team, which would be able to deliver a machine learning product?

Eduardo Bonet: It will have its data scientist or a machine learning engineer. Nowadays, I prefer to start more on the software than on the data science part. A machine learning engineer would start with the software. Then, the data scientist eventually makes it even better. So start with the feature you’re building—front-end and back-end—and then add your machine learning engineer and data scientist.

You can also do a lot more with the DevOps part. The important part is to ship fast, to ship something, even if it’s bad in the beginning, and iterate off that something bad rather than trying to find something that is good and just applying it six months later. But at this point, you don’t even know if the users want that or not. You deploy that really nice model that no one cares about.

For us, smaller iterations are better. You tend to deploy better products by shipping small things faster rather than trying to get to the good product because your definition of good is only on your head. Your users have another definition of “good,” and you only know their definition of good by putting things for them to use or test. And if you do it in small chunks, they can consume it better than if you just say, okay, there is this huge feature here for you. Please test it.

Building a native experiment tracker at GitLab from scratch

Aurimas: I have some questions related to your work at GitLab. One of them is that you’re now building native capabilities in GitLab, including experiment tracking. I know that it’s kind of implemented via the MLflow client, but you manage all of the servers underneath yourself. 

How did you decide not to bring a third-party tool and rather build this natively?

Eduardo Bonet: I usually don’t do it because I don’t like re-implementing stuff myself, but GitLab is self-managed – we cater to our self-managed customers, and GitLab is a Rails, mostly Rails monolith. The codebase is Rails, and it doesn’t use microservices. 

I could deploy MLflow behind a feature flag, like installing GitLab, and it will set up MLflow simultaneously. But then I would have to handle how to install it on all the different places that GitLab is installed at, which are a lot – I’ve seen an installation on a mainframe or something – and I don’t want to handle all those installations.

Second, I want to integrate across the platform. I want model experiments to be something other than their own vertical feature. I want this to be integrated with the CI. I want this to be integrated with the merge request. I want to be integrated with issues.

If the data is in the GitLab database, it’s much simpler to cross-reference all these things. For example, I deployed integration with the CI/CD last week. If you create your candidate or run through GitLab CI, you can pass a flag, and we’ll already connect the candidate to the CI job. You can see the merge request, the CI, the logs, and everything else. 

You want to be able to manage that on our side since it’s better for the users if we own this on the GitLab side. It does mean I had to strip out many of the MLflow server’s features, so, for example, there are no visualizations in GitLab. And I’ll be adding them over time. This will come. I had to be able to deploy something useful, and over time, we’ll be adding. But that’s what’s the reasoning behind re-implementing the backend while still using the MLFlow client.

Piotr: As part of the iterative process, is this what you’re calling “minimal viable change”?

Eduardo Bonet: Yeah, it’s even a little bit below that, the minimum, because now that it’s available, users can tell what’s needed to become the minimum for them to be useful.

Piotr: As a product team, we are inspired by GitLab. Recently, I was asked whether the “minimal viable value” for change that would bring value is too big to be done in one sprint, but we must have—I think it was something around webhooks and setting some foundations for webhooks—a system that can repeat the call if the system that you receive the call is down.

The challenge was about providing something, some foundation, that would bring value to the end user. But how would you do it at GitLab? 

For instance, to bring value to the user, you need to set up a kind of backend and implement something in the backend that wouldn’t be exposed to the user. Would it fit into a sprint at GitLab?

Eduardo Bonet: It does. A lot of what I did was not visible or really useful. I spent five months working on these model experiments until I could say I could onboard the first user. That was not dogfooding. So it was five months. 

I still had to find ways of getting feedback, for example, with the videos I share now and then to discuss.  Even if it’s just discussing the way you are going or the vision, you can feel whether people want that vision or not, or whether there are better ways to achieve that vision. But it’s work that has to be done, even if it’s not visible.

Not all work will be visible. Even if you go iterative, you can still do work that is not visible, but it needs to be done. So, I had to refactor how packages are handled in our model experiments and our experiment tracking. That’s more of a change that would make my life easier over time than the user’s life, but it was still necessary.

Piotr: So there is no silver bullet because we are struggling with this type of approach and are super curious about how you do it. At first glance, it sounds, for me at least, that every change has to bring some value to the user. 

Eduardo Bonet: I don’t think every change has to bring value to the user because then you fall into traps. This puts some major stresses on your decision-making, such as biases toward short-term things that need to be delivered, and it pushes things like code quality, for example, away from that line of thinking. 

But both ways of thinking are necessary. You cannot use only one. If you only use “minimum viable change” all the time, you will have a minimum viable change in there. That’s not what users really want. They want a product. They want something tangible. Otherwise, there’s no product. That’s why software engineering is challenging.

MLOps vs. LLMOps

Piotr: We are recording podcasts, in 2023 it would be strange not to ask about it. We’re asking because everybody is asking questions about large language models. 

I’m not talking about the impact on humanity, even though all of those are fair questions. Let’s talk more tactically, from your perspective and current understanding of how businesses can use large language models, foundational models, in productions.

What similarities do you see to MLOps? What will stay the same? What is completely unnecessary? What types of “jobs to be done” are missing in the traditional, more traditional, recent MLOps stack?

So let’s do kind of a diff. We have an MLOps stack. We had DevOps and added MLOps, right? There was a diff, we discussed that. Now, we are also adding large language models to the picture. What is the diff?

Eduardo Bonet: There are two components here. When you’re talking of LLMOps, you can think of it as a prompt plus the large model that you’re using as a model itself, like the conjunction of them as a model. From there on, it will behave very much the same as a regular machine learning model, so you’re going to need the same observability levels, the same things that you need to take care of when deploying it in production.

On the create side, though, only now are we seeing prompts being taken as their own artworks that you need to version, discuss, and provide the right ways of changing, measuring, explore, that they will have different behaviors and different models, and that any change.

I’ve seen some companies start to implement a prompt registry where the product manager can go and change the prompt without needing the backend or frontend to go into the codebase. That is one of them, and that’s an early one. Right now, at least, you only have a prompt that you probably populate with data, meta prompts, or second-layer prompts. 

But the next level is prompt-generating prompts, and we haven’t explored that level yet. There’s another whole level of Ops that we don’t know about yet. So, how can you manage prompts that generate prompts or pass on flags? For example, I can have a prompt where I pass an option that appends something to the prompt. Okay. Be short, be small, be concise, for example. 

Prompts will become their programming language, and functions are defined as prompts. You pass arguments that are prompts themselves to these functions.

How do you manage the agents in your stack? How do you manage versions of agents, what they’re doing right now, and their impact in the end when you have five or six different agents interacting? There are a lot of challenges that we have yet to learn about because it’s so early in the process. It’s been two months since this became usable as a product, so it’s very, very early.

Piotr: I just wanted to add the observation that in most use cases, if it’s in production, there is a human in the loop. Sometimes, the human in the loop is a customer, right?  Especially if we are talking about the chat type of experience.

But I’m curious to see use cases of foundational models in the context where humans are not available, like predictive maintenance, demand prediction, and predatory scoring—things that you would like to truly automate without having humans in the loop. How will it behave? How would we be able to test and validate those – I’m not even sure whether we should call them models, prompts, or agent configuration.

Another question I’m curious to hear your thoughts on: How will we, and if yes, how will we connect foundational models with more classical deep learning and machine learning models? Will it be connected via agents or differently? Or not at all?

Eduardo Bonet: I think it will be through agents because agents are a very broad abstraction. You can include anything as an agent. So it’s really easy to say agents because, well, you can do that with agents—our policies or whatever.

But that’s how you provide more context. For example, a search is very complicated when you have too many labels that you cannot encode in a prompt. You need either an easy way of finding, like a dumb way of running a query, that you have an agent or a tool—some of them also call this “tool.”. You give your agent tools. 

This can be as simple as running a query or saying something or more complicated as requesting an API that predicts. For example, the agent will learn to pass the right parameters to this API. You’ll still use generative AI because you’re not coding the whole pipeline. But for some parts, it makes sense, even if you have something working.

Perhaps it’s better if you split off some inter-deterministic chunks that you know, what, what’s the output of that specific tool that you want to give your agent access to.

Piotr: So, my last question—I would play devil’s advocate here—is: Maybe GitLab should skip the MLOps part and just focus on the LLMOps part. It’s going to be a bigger market. Will we need MLOps when we use large language models? Does it make sense to invest in it?

Eduardo Bonet: I think so. We’re still learning the boundaries of when to apply ML, classic ML, and why every model has its own places where it’s better and where it’s not. LLMs are also part of this.

There will be cases where regular ML is better. For example, you might first deploy your feature with LLM, then improve the software, and then improve with machine learning, so ML becomes the third level of optimization.

I don’t think LLM will kill ML. Nothing kills anything. People have been saying that Ruby, COBOL, and Java will die. Decision trees would be dead because now we have neural networks. Even if it’s just to keep the simplicity, sometimes you don’t want those more complicated features. You want something that you can control, that you know what was the input and the output.

MLOps is a better focus for now, at the beginning, until we start learning what LLMOps are because we have a better understanding of how this fits into GitLab itself. But it’s something we are thinking about, like how to use it, because we’re also using LLMs internally. 

We are dogfooding our own problems with how to deploy AI-backed features. We are learning with it, and yes, those could become a product eventually, those could become prompt management, could become a product eventually, but at this point, even for us to handle our own models, the model registry is more of a concern rather than prompt the writer to show or whatever.

Aurimas: Eduardo, it was really nice talking with you. Do you have anything that you would like to share with our listeners?

Eduardo Bonet: The model experiments we’ve been discussing are available for our users on GitLab as 16.0. I’ll leave a link to the documentation if you want to test it out. 

If you want to follow what I do, I usually post a short YouTube video about my advancements every two weeks or so. There’s also a playlist that you can follow. 

If you’re in Amsterdam, drop by the MLOps community meetup we organize.

Aurimas: Thank you, Eduardo. I’m super glad to have had you here. And also thank you to everyone who was listening. And see you in the next episode.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.