The Mythical Pivot Point from Buy to Build for Data Platforms

TL;DR: with data-intensive architectures, there often comes a pivotal point where building in-house data platforms makes more sense than buying off-the-shelf solutions.

The Mystical Pivot Point

Buying off-the-shelf data platforms is a popular choice for startups to accelerate their business, especially in the early stages. However, is it true that companies that have already bought never need to pivot to build, just like service providers had promised? There are reasons for both sides of the view:

Image by Author

Need to Pivot: The cost of buying will eventually exceed the cost of building, as the cost grows faster when you buy.
No need to Pivot: The platform’s requirements will continue to evolve and increase the cost of building, so buying will always be cheaper.

It is such a puzzle, yet few articles have discussed it. In this post, we will delve into this topic, analyzing three dynamics that increase the reasons for building and two strategies to consider when deciding to pivot.

Dynamics	Pivot Strategies
– Growth of Technical Credit – Shift of Customer Persona – Misaligned Priority	– Cost-Based Pivoting – Value-Based Pivoting

Growth of Technical Credit

It all began outside the scope of the data platform. Want it or not, to improve efficiency or your operation, your company needs to build up Technical Credits at three different levels. Realising it or not, they will start making building easier for you.

What is technical credit? Check out this artile published in ACM.

Those three levels of Technical Credits are:

Technical Credits	Key Purposes
Cluster Orchestration	Enhance efficiency in managing multi-flavor Kubernetes clusters.
Container Orchestration	Enhance efficiency in managing microservices and open-source stacks
Function Orchestration	Enhance efficiency by setting up an internal FaaS (Function as a Service) that abstracts all infrastructure details away.

For cluster orchestration, there are typically three different flavors of Kubernetes clusters.

Clusters for microservices

Clusters for streaming services

Clusters for batch processing

Each of them requires different provision strategies, especially in network design and auto-scaling. Check out this post for an overview of the network design differences.

Network Design Differences for Different Types of K8s Clusters. Image by Author

For container orchestration efficiency, one possible way to accelerate is by extending the Kubernetes cluster with a custom resource definition (CRD). In this post, I shared how kubebuilder works and a few examples built with it. e.g., an in-house DS platform by CRD.

A DS platform built with CRD. Image by Author

For the function orchestration efficiency, it required a combination of the SDK and the infrastructure. Many organisations will use scaffolding tools to generate code skeletons for microservices. With this inversion of control, the task for the user is simply filling up the rest-api’s handler body.

In this post on Toward Data Science, most services in the MLOps journey are built using FaaS. Especially for model-serving services, machine learning engineers only need to fill in a few essential functions, which are critical to feature loading, transformation, and request routing.

The following table shares the Key User Journey and Area of Control of different levels of Technical Credits.

Technical Credits	Key User Journey	Area of Control
Cluster Orchestration	Self-serve on creating multi-flavour K8s clusters.	– Policy for Region, Zone, and IP CIDR Assignment – Network Peering – Policy for Instance Provisioning – Security & OS harden – Terraform Modules and CI/CD pipelines
Container Orchestration	Self-serve on service deployment, open-source stack deployment, and CRD building	– GitOps for Cluster Resources Releases – Policy for Ingress Creation – Policy for Customer Resource Definition – Policy for Cluster Auto Scaling – Policy for Metric Collection and Monitoring – Cost Tracking
Function Orchestration	Focus solely on implementing business logic by filling pre-defined function skeletons.	– Identity and Permission Control – Configuration Management – Internal State Checkpointing – Scheduling & Migration – Service Discovery – Health Monitoring

With the growth of Technical Credits, the cost of building will reduce.

However, the transferability differs for different levels of Technical Credits. From bottom to top, it becomes less and less transferable. You will be able to enforce consistent infrastructure management and reuse microservices. However, it is hard to reuse the technical credit for building FaaS across different topics. Furthermore, declining building costs do not mean you need to rebuild everything yourself. For a complete build-vs-buy trade-off analysis, two more factors play a part, which are:

Shift of Customer Persona
Misaligned Priority

Shift of Customer Persona

As your company grows, you will soon realize that persona distribution for data platforms is shifting.

When you are small, the majority of your users are Data Scientists and Data Analysts. They explore data, validate ideas, and generate metrics. However, when more data-centric product features are released, engineers begin to write Spark jobs to back up their online services and ML models. Those data pipelines are first-class citizens just like microservices. Such a persona shift, making a fully GitOps data pipeline development journey acceptable and even welcomed.

Misaligned Priority

There will be misalignments between SaaS providers and you, simply because everyone needs to act in the best interest of their own company. The misalignment initially appears minor but might gradually worsen over time. Those potential misalignments are:

Priority	SaaS provider	You
Feature Prioritisation	Benefit of the Majority of Customers	Benefits of your Organisation
Cost	Secondary Impact(potential customer churn)	Direct Impact(need to pay more)
System Integration	Standard Interface	Customisable Integration
Resource Pooling	Share between their Tenants	Share across your internal system

For resource pooling, data systems are ideal for co-locating with online systems, as their workloads typically peak at different times. Most of the time, online systems experience peak usage during the day, whereas data platforms peak at night. With higher commitments to your cloud provider, the benefits of resource pooling become more significant. Especially when you purchase yearly reserved instance quotas, combining both online and offline workload gives you stronger bargaining power. SaaS providers, however, will prioritise pivoting to serverless architecture to enable resource pooling among their customers, thereby improving their profit margin.

Pivot! Pivot! Pivot?

Even with the cost of building declining and misalignments rising, building will never be an easy option. It requires domain expertise and long-term investment. However, the good news is that you don’t have to perform a complete switch. There are compelling reasons to adopt a hybrid approach or step-by-step pivoting, maximizing the return on investment from both buying and building. There might be two ways moving forward:

Cost-Based Pivoting
Value-Based Pivoting

Disclaimer: I hereby present my perspective. It presents some general principles, and you are encouraged to do your own research for validation.

Approach One: Cost-Based Pivoting

The 80/20 rule also applies well to the Spark jobs. 80% of Spark jobs run in production, while the remaining 20% are submitted by users from the dev/sandbox environment. Among the 80% of jobs in production, 80% are small and straightforward, while the remaining 20% are large and complex. A premium Spark engine distinguishes itself mostly on large and complex jobs.

Want to understand why Databricks Photon performs well on complex spark jobs? Check out this post by Huong.

Additionally, sandbox or development environments require stronger data governance controls and data discoverability capabilities, both of which require quite complex systems. In contrast, the production environment is more focused on GitOps control, which is easier to build with existing offerings from the Cloud and the open-source community.

If you can build a cost-based dynamic routing system, such as a multi-armed bandit, to route less complex Spark jobs to a more affordable in-house platform, you can potentially save a significant amount of cost. However, with two prerequisites:

Platform-agnostic Artifact: A platform like Databricks may have its own SDK or notebook notation that is specific to the Databricks ecosystem. To achieve dynamic routing, you must enforce standards to create platform-agnostic artifacts that can run on different platforms. This practice is crucial to prevent vendor lock-in in the long term.
Patching Missing Components (e.g., Hive Metastore): It is an anti-pattern to have two duplicated systems side by side. But it can be necessary when you pivot to build. For example, open-source Spark can not leverage Databricks’ Unity Catalog to its full capability. Therefore, you may need to develop a catalog service, such as a Hive metastore, for your in-house platform.

Please also note that a small proportion of complex jobs may account for a large portion of your bill. Therefore, conducting thorough research for your case is required.

Approach Two: Value-Based Pivoting

The second pivot approach is based on how the dose pipeline generates values for your company.

Operational: Data as Product as Value
Analytical: Insight as Values

The framework of breakdown is inspired by this article, MLOps: Continuous delivery and automation pipelines in machine learning. It brings up an important concept called experimental-operational symmetry.

We classify our data pipelines in two dimensions:

Based on the complexity of the artifact, they are classified into low-code, scripting, and high-code pipelines.
Based on the value it generates, they are classified into operational and analytical pipelines.

High-code and operational pipelines require staging->production symmetry for rigorous code review and validation. Scripting and analytical pipelines require dev->staging symmetry for fast development velocity. When an analytical pipeline carries an important analytical insight and needs to be democratized, it should be transitioned to an operational pipeline with code reviews, as the health of this pipeline will become critical to many others.

The total symmetry, dev -> stg -> prd, is not recommended for scripting and high-code artifacts.

Let’s examine the operational principles and key requirements of these different pipelines.

Pipeline Type	Operational Principle	Key Requirements of the Platform
Data as Product(Operational)	Strict GitOps, Rollback on Failure	Stability & Close Internal Integration
Insight as Values(Analytical)	Fast Iteration, Rollover on Failure	User Experience & Developer Velocity

Because of the different ways of yielding value and operation principles, you can:

Pivot Operational Pipelines: Since internal integration is more critical for the operational pipeline, it makes more sense to pivot those to in-house platforms first.
Pivot low-code Pipelines: The low-code pipeline can also be easily switched over due to its low-code nature.

At Last

Pivot or Not Pivot, it is not an easy call. In summary, these are practices you should adopt regardless of the decision you make:

Pay attention to the growth of your internal technical credit, and refresh your evaluation of total cost of ownership.
Promote Platform-Agnostic Artifacts to avoid vendor lock-in.

Of course, when you indeed need to pivot, have a thorough strategy. How does AI change our evaluation here?

AI makes prompt->high-code possible. It dramatically accelerates the development of both operational and analytical pipelines. To keep up with the trend, you might want to consider buying or building if you are confident.
AI demands higher quality from data. Ensuring data quality will be more critical for both in-house platforms and SaaS providers.

Here are my thoughts on this unpopular topic, pivoting from buy to build. Let me know your thoughts on it. Cheers!