ML/AI platforms provide the ecosystem for building, deploying, and managing the lifecycle of machine-learning models and AI services.
There is no one-size-fits-all approach to implementing an ML/AI platform:
Building an in-house platform can be fast for specific use cases and help maximize return on investment quickly.
Buying a platform or platform components saves internal development effort, but it can be a long bureaucratic process and difficult to assess third-party components.
Most organizations ultimately decide on a buy-and-build solution, where third-party components and tools are integrated into a custom platform.
An ML/AI platform provides a coherent collection of tools and frameworks to build, deploy, and manage machine learning (ML) models and AI services. Rather sooner than later, teams and organizations looking to streamline their ML workflows and bring AI-driven products to market more quickly face the question: Should they build a custom platform or buy one?
Given the steep rise in interest in AI technology, the big cloud companies are investing heavily in ML/AI platforms as an integral part of their product landscape. At the same time, there is a wide range of specialized tools and a thriving open-source ecosystem.
The AI landscape is vast and complex, making it difficult to navigate the market. Further, each company is unique in terms of the kind of data used, the core business model, compliance standards, development processes, deployment infrastructure, and the technology stack.
Over the past decade, I’ve helped my teams build ML/AI platforms from scratch and architected platforms integrating different cloud vendors and SaaS solutions. I’ve participated in countless vendor assessments and have recently led an initiative to replace an in-house solution with a SaaS tool.
In this article, I’ll summarize all that I have learned to help you figure out what to think about when deciding whether to buy or build an ML/AI platform.
Build vs buy: benefits and drawbacks
Despite what marketing will have you believe, you won’t find a ready-to-use ML/AI platform off the shelf. In my experience, you will always need to combine multiple tools to arrive at a solution that fits your needs.
Thus, “buying” an ML/AI platform means acquiring several different products, even if one of them is an “end-to-end solution.” For the purpose of this article, I’m defining “buying” as whenever you spend money purchasing or licensing a tool. Otherwise, I’ll consider it “building.”
When faced with the “build vs buy” decision, at first, the options can seem straightforward:
- Buy: It is fast to start producing value and cheaper in total, as no upfront development is needed. However, third-party products might be less flexible in the long run.
- Build: Costly and slow to get off the ground, but a perfect match with the business needs.
Unfortunately, it’s not that simple. Buying an ML/AI platform can be slow, bureaucratic, and technically demanding to test. Conversely, building an ML/AI platform can be quick for experienced teams and offer a good return on investment right away.
The distinction between “build” and “buy” is also not as clear as it seems at first glance: Usually, you already have an existing codebase and internal processes that you need to move to the new platform. And even an end-to-end SaaS solution requires configuration and customization.
Where does open-source software (OSS) fit here?
Many popular ML/AI tools are open-source, and it’s highly unlikely that any team building an ML/AI platform will not use some open-source component.
Open-source tools are appealing because you can acquire them without licensing fees and start using them immediately. However, even though you don’t pay for OSS software, it’s still a third-party component you don’t control. While it’s theoretically possible to create an in-house fork and fully adapt a tool to your needs, I’ve rarely seen this happen in reality.
I like to include open-source software in the “build” category, as it necessarily entails doing your own integration and maintenance work. Widely used open-source projects often come with documentation and a helpful volunteer community. However, there is usually no support available beyond that. Thus, the work required to adopt an OSS tool into your platform does not differ in many ways from integrating a component you have written from scratch.
When it comes to open-source tools and platforms (rather than libraries and frameworks), many of them are backed by businesses that offer paid support and/or a SaaS version. This allows you to delegate the effort to host and maintain a tool to a third party later on.
You can also consider starting with a managed service to get running quickly and later switching to a self-hosted version if your requirements exceed the customizability. In line with my definitions for “build” and “buy” above, I’ll categorize using a SaaS version of an OSS tool as “buy.”
Benefits of building a custom ML/AI platform
- Customization: You can tailor the platform to meet specific organizational needs and standards, which is particularly relevant when your data is not in a standard modality or format. For example, several tools are biased toward tabular data.
- Efficiency: You can optimize the platform to streamline development and deployment. This might involve specialized algorithms, specific pipelines, or particular requirements for model deployment.
- Integration: You can seamlessly integrate with existing systems and workflows since you control the APIs and adapters. This can be crucial if your company upholds high standards of security and compliance.
- Governance and compliance: You can enforce governance policies and ensure compliance. Requirements can come from data privacy regulations (GDPR and HIPAA) and security standards like SOC2, as well as from within your organization.
Benefits of buying an ML/AI platform
- Cost: Buying an ML/AI platform eliminates the need for upfront investments in development work and infrastructure. It also reduces ongoing maintenance and management, which consume significant resources and often require new personnel.
- Time: Buying or licensing an ML/AI platform can be quick. You can gain a serious competitive advantage by shortening the time-to-market of your AI services.
- Expertise and support: Licensing plans for ML/AI platforms often include access to dedicated support and experts who help with implementation, troubleshooting, and updates.
- Focus on core competencies: By purchasing an ML/AI platform, you can focus your resources on your core business. This is especially true for organizations where AI technology is not the product but a means to an end.
Drawbacks of building a custom ML/AI platform
- Complexity: Developing and maintaining a platform is complex and resource-intensive. It requires an in-house team of specialists from several areas, such as data engineering, ML engineering, security, procurement, and operations.
- Cost: Designing, developing, and deploying a custom platform is a heavy investment. The costs of hosting, maintaining, and evolving a platform rarely remain constant but tend to increase over time.
- Performance, scalability, and versatility: Designing and developing high-scale solutions that can grow with your business is very complex. Even when your team has prior experience building ML/AI platforms, it is easy to underestimate the range of expertise and effort required.
- Risk of overengineering and scope creep: It is easy to incorporate unnecessary features or complexity that delivers little or no value. Adding more and more features can delay the delivery and slow future development, reducing utility and increasing costs.
- Security and privacy: When building an internal platform, you must deal with all the complexity of this topic yourself. You must understand and implement different standards imposed by customers, domain-specific regulations, and the law.
Drawbacks of buying an ML/AI platform
- Vendor lock-in: Buying an ML/AI platform makes you dependent on that particular vendor, restricts flexibility, and impedes delivery velocity. Migrating off a core component of your platform later on can be highly complex, especially if you cannot afford production downtime.
- Customization and integration limitations: Many commercial ML/AI platforms offer an extensive range of built-in features. Yet, they may not align with your needs, and their constraints will require compromises. Further, operating a platform where you only use a fraction of the features can ultimately be worse than working around the lack of one specific capability.
- Data privacy and security risks: Entrusting data to a third-party vendor carries inherent data privacy and safety risks. Enforcing your own standards can prove challenging or downright impossible. A typical way out is installing a SaaS solution in your own cloud infrastructure, which gives you more flexibility to meet your needs but brings with it many of the drawbacks of the “build” option.
- Limited control over technology and roadmap: You won’t have control over a vendor’s underlying technology stack, roadmap, or priorities. This can impact your organization’s future strategic decisions and innovation.
Factors to consider in the build vs buy decision-making process
Investing in an ML/AI platform is a major decision for any company. Whether you build or buy it, the decision entails a multi-year commitment.
Therefore, it is crucial to comprehend all factors to make an informed and sound decision. In the following, I’ll guide you through key areas to investigate and gain clarity on.
Technical expertise and organizational maturity
Building software is always challenging. This platitude is particularly true in the case of an ML/AI platform. Beyond managing a code base, you’ll also need to manage the data layer and models.
While traditional software only changes behavior when code changes, machine-learning models behave differently when input data changes. You will also likely be working with GPUs and huge container images. The stochastic nature of ML models requires new approaches to testing, monitoring, and observability.
You might also find yourself in an organization with proficient data science teams that can handle the intricacies I just outlined but struggle with the basics of software, infrastructure, and data engineering. That said, building software can be a net positive for small companies, startups, or new teams, as it offers an opportunity to level up on transferable skills and establish workflows.
Here are some questions to ask when exploring this area:
- Organizational maturity: How far are you on your ML/AI journey? Do you have a clear roadmap of what features and capabilities you will require in a year? Five years? The next decade?
- Skills and experience: Do you have sufficient software and infrastructure engineering expertise within your team? Do you need to hire for additional skills, and how long will it take? Do you have established DevOps practices? Is there an existing team that could handle platform operations?
- Non-functional requirements: Do you understand what it takes to integrate or build a platform in your current infrastructure? What about control management, versioning, tenant isolation, and scalability? Do you have someone who can assess, design, and/or implement it?
- Investing in expertise: Which specialties do you need to hire for? Is it realistic to currently find these skills in your labor market? Is it worth training your current team on a particular skill? Will it pay off for the organization to acquire competency in a certain area?
Costs
The costs of buying or building and maintaining an ML/AI platform are certainly first-of-mind for many decision-makers.
Building software from scratch is often expensive, as we have to factor in infrastructure, purchasing or licensing components, and salaries.
Infrastructure costs typically are the ongoing costs for cloud resources, which can be lowered through long-term contracts. Some organizations choose to invest in on-premise infrastructure or place their own hardware into a third-party data center.
When assessing the costs of buying a component or tool, it’s crucial to remember just how expensive the salaries of an internal team to build and run the platform would be. Companies that have not operated a large-scale platform are often surprised to discover what it costs to have people on call 24/7 to fulfill service-level agreements (SLAs).
You should also consider hiring times, opportunity costs associated with a later platform go-live, maintenance efforts, and getting the team up to speed on a new technology stack.
Here are some cost categories and questions you should consider:
- Salaries and consulting fees: Do you have a forecast of your hiring needs? Do you need more senior or junior folks? Do you need to hire locally or globally? What salaries do you have to budget for? Will you need to work with recruiters? Can you consider a contract-to-hire approach to minimize risks and speed up development?
- Licensing and subscription fees: How will the cost change as more users are onboarded to a tool and the number of managed models increases? Will additional revenue offset the cost increase of adding more customers to the platform?
- Storage capacity and cost: What data do you need to store as part of your platform (e.g., performance metrics, experiment metadata, datasets, model files, epoch snapshots, containers)? Do you need to store all that information forever, or can you implement retention policies? How much data will you store per day, month, or year going forward? Can you make use of cold storage?
- Technical debt: Are you clear where and when you will take on technical debt, particularly when building your own solution? Do you have a way to track, value, and prioritize tech debt? Have you factored in the costs of potentially re-architecting the platform due to unforeseen requirements?
- CI/CD and development tools: Do you have a reliable estimate of the CI/CD costs? For example, what will it cost to regularly check containers for vulnerabilities and run integration tests on each PR? How expensive will dev tools, GPUs, and container or package registries be?
- Infrastructure cost: Do you know what infrastructure you’ll need? Do you have experience monitoring cloud costs? Will your cloud or platform provider be able to provision enough resources? Does the SaaS product you’re looking into allow you to purchase additional capacity?
Time to market
There is an opportunity cost associated with spending time building your own ML/AI platform. While doing it, you are not allocating internal resources to areas that provide a more immediate competitive advantage. Building an in-house solution can detract from what’s essential for your organization.
Conversely, purchasing software or licensing a platform can be time-consuming. Zoning in on possible options, thoroughly assessing them, and negotiating terms can stretch over several months. If your organization is subject to compliance or regulatory requirements, vetting third-party tools can easily take up to a year.
Whether you’re building or buying the ML/AI platform, you need to consider the duration of the recruitment process and onboarding time. Hiring AI talent tends to be a slow process in today’s volatile labor market. You’ll also likely need a significant amount of time to get your existing team up to speed on the new platform or tech stack.
Here are some questions to get you started thinking about time-to-market:
- Available team capacity: Do you have enough talent internally (quantity and quality)? Can you hire quickly enough to fill current skill gaps and increase your team’s capacity? Where can you bring in external consultants or contractors to augment your team in the short term?
- Requirements: Do you need a managed solution, or does your team prefer to install and operate the product? Which non-functional requirements are essential for your use case? Do you have a defined baseline with which to compare a new solution?
- Procurement process: Do you have people in your organization who know how to conduct a vendor assessment? Which stakeholders have to be involved, and when? Can you start using an OSS version of a tool while you’re negotiating a SaaS contract?
- Development process: Who will lead and steer the platform development? Does your organization have a proven track record of delivering complex software projects on time? Is your team working in an established development process, or do you need to set up development teams from scratch?
Maintenance and customization needs
Maintenance and operations are significant resource drains, especially when building your own ML/AI platform.
You must keep a roadmap and plan for new features. You also need to provide support for existing functionalities, which in some cases requires a dedicated team. Since there is no third party or community around a custom solution, you’ll need to write and maintain all documentation internally.
While focusing on a build-or-buy decision, it’s easy to forget to plan to support the platform’s end users. Neglecting this aspect can quickly become a source of friction or lead to a platform not being utilized properly.
Another consideration is the level of customization required to meet organizational requirements. A custom solution offers greater flexibility in this regard, whereas buying software may offer built-in features with limited customization options.
Here are some questions to get you started:
- Customization: What customizations do you need for a given tool? Are the customizations possible as part of the standard offering, or do you need to negotiate an addition to the contract? Can you implement customizations in-house, or do you need to pay the vendor or a third-party consultant?
- Platform product management: Who will steer platform product development in the long run? What is the process to place, evaluate, and decide on feature requests? How will you work with a vendor to keep up with product updates and ensure you are not surprised by breaking changes?
- Support for end users: How will you onboard data scientists and ML engineers to the platform? Who is responsible for hosting workshops or maintaining educational material? How will you solicit feedback? Do you need to embed platform specialists with other teams?
Vendor reputation and target market
Understanding a vendor’s business and outlook is crucial. Small vendors may have less stable roadmaps and are likelier to disappear or pivot. However, they might be more willing to work with you on specific needs. Large vendors offer more long-term reliability but often provide generic solutions that need extensive customization. They can also be slow to deliver new features if a product is not their flagship offering.
Assessing a tool’s or platform’s customer base, ecosystem, and community is equally vital. Generally, it’s preferable if you’re a vendor’s target customer or the core audience of an OSS project. This ensures the vendor is prepared to work with businesses of your size and maturity. It also makes it more likely that future changes to a product will fall in line with your needs.
I recommend finding a balance and moving forward with that. It’s unlikely you will find a perfect fit for your use case.
- Long-term perspective: What does the product’s roadmap look like? Can you be sure that the tool or platform will be available for the years to come? Will the platform grow with your needs and future requirements? What role does the platform you’re looking for play in a vendor’s portfolio?
- Other users: Who else is using the platform or tool? Are they comparable to your organization and team? How large is the user base of the OSS component you’re considering? What are users saying in online communities or other online places? Are there current or former users in your personal network? Can you attend a conference or a meetup to get first-hand insights?
- Open-source project governance: Is the OSS project backed by a business or foundation? Does the project rest on the shoulders of a sufficiently large number of maintainers? Are security issues and bugs appropriately handled? What is the process for suggesting or contributing new features?
Compliance, privacy, and governance
In my experience, more technically-minded folks often underestimate this factor. However, implementing compliance requirements can drain your entire delivery capacity if not planned for in advance.
It usually requires a company-wide effort to achieve compliance with standards and regulations. If done properly, it can open new opportunities and even entire markets as customers seek offers that meet their requirements.
Compliance entails following good software development practices within your team and platform. Typically, specific standards and processes are integrated into the platform to ensure enforcement. Many of these standards adhere to security by design principles, such as minimizing attack surface area, the principle of least privilege, and the separation of concerns.
If you are subject to strict regulations, there’s likely no way around hosting the platform on your own (cloud) infrastructure. This will allow you full control over your data and enforce any additional standards.
Here are some questions you should know the answer to before deciding on an ML/AI platform:
- Regulations: Are you subject to regulations like FedRAMP, SOC 2, HIPAA, or GDPR? What exact requirements do you need to fulfill on a technical and organizational level?
- On-premise deployment: Are you restricted to self-hosting the platform? Can a platform be installed on your own cloud infrastructure, e.g., in a dedicated VPC? Does a third-party platform support custom registries for containers and packages?
- Data privacy and governance: What demands are imposed by regulations or customers? Are you allowed to store your customer’s data on infrastructure hosted by a third party? Do you need to control which data can be used for which models? Do you need to track how data is used in models and downstream?
- Application security: Do you need capabilities to deal with model hallucinations, prompt injection, or malicious user input? What measures and processes do you need to implement to prevent, detect, and mitigate security issues? What guarantees can a vendor or OSS project give?
- Technical measures: Do you need to enforce security standards at the CI level? Do you need to scan your container images regularly? Do you need to have internal base images?
My recommendation
In my experience, irrespective of your initial approach, you will converge to a buy-and-build solution. Usually, organizations looking for an ML/AI platform already have an existing codebase and established internal workflows. Thus, you will always need to integrate with the existing organizational processes and infrastructure.
Adapting and customizing an ML/AI platform to meet internal requirements is always necessary and will be an ongoing process, no matter if you buy a platform or build it from scratch. The need for these changes typically comes from specific data ingestion or deployment requirements.
Cost savings will soon become a top priority. As your organization expands, you will feel pressure from management to keep your cloud bill and salaries in check. In this situation, it pays off if your initial cost estimates were well-founded and you can clearly demonstrate the platform’s value to the business.
If I had to summarize everything I’ve learned over the years into a single sentence, my recommendation is to start as small as possible and scale up as needed. Despite the best efforts, your initial budget estimates and understanding of the requirements are incomplete.
If you quickly get a basic platform up and running, you can learn from your users’ feedback and correct the course without wasting time and money. To do so, you must iterate and get to market quickly. Focusing on a full-featured platform too early is counterproductive to this. Instead, prioritize shipping core features for business-critical use cases.
Conclusion
The decision to build or buy an ML/AI platform is significant and will impact your company’s capacity to deliver for years to come. By embarking on the process clear-eyed and carefully evaluating various factors, you can make an informed decision that will likely turn out to be the right one.
As I’ve argued, a buy-and-build approach represents the best strategy. Following this philosophy, you can either start by building software tailored to your immediate needs or buying software with minimal investment.
The key to long-term success is never to consider a platform “done.” Instead, be open to adapting to new internal processes, compliance needs, and the ever-changing landscape of MLOps tools.
Feel free to reach out to me on LinkedIn, I’m more than happy to answer any input you might have.
Explore more content topics:
Source link
lol