and medium companies achieve success in building Data and ML platforms, building AI platforms is now profoundly challenging. This post discusses three key reasons why you should be cautious about building AI platforms and proposes my thoughts on promising directions instead.

Disclaimer: It is based on personal views and does not apply to cloud providers and data/ML SaaS companies. They should instead double down on the research of AI platforms.

Where I am Coming From

In my previous article From Data Platform to ML Platform in Toward Data Science, I shared how a data platform evolves into an ML platform. This journey applies to most small and medium-sized companies. However, there was no clear path for small and medium-sized companies to continue developing their platforms into AI platforms yet. Leveling up to AI platforms, the path forked into two directions:

  • AI Infrastructure: The “New Electricity” (AI Inference) is more efficient when centrally generated. It is a game for big techs and large model providers.
  • AI Applications Platform: Cannot build the “beach house” (AI platform) on constantly shifting ground. The evolving AI capability and emerging new development paradigm make finding lasting standardization challenging.

However, there are still directions that are likely to remain important even as AI models continue to evolve. It is covered at the end of this post.

High Barrier of AI Infrastructure

While Databricks is maybe only several times better than your own Spark jobs, DeepSeek could be 100x more efficient than you on LLM inferencing. Training and serving an LLM model require significantly more investment in infrastructure and, as importantly, control over the LLM model’s structure.

Image Generated by OpenAI ChatGPT 4o

In this series, I briefly shared the infrastructure for LLM training, which includes parallel training strategies, topology designs, and training accelerations. On the hardware side, besides high-performance GPUs and TPUs, a significant portion of the cost went to networking setup and high-performance storage services. Clusters require an additional RDMA network to enable non-blocking, point-to-point connections for data exchange between instances. The orchestration services must support complex job scheduling, failover strategies, hardware issue detection, and GPU resource abstraction and pooling. The training SDK needs to facilitate asynchronous checkpointing, data processing, and model quantization.

Regarding model serving, model providers often incorporate inference efficiency during model development stages. Model providers likely have better model quantification strategies, which would produce the same model quality with a significantly smaller model size. Model providers are likely to develop a better model parallel strategy due to the control they have over the model structure. It can increase the batch size during LLM inference, which effectively increases GPU utilization. Additionally, large LLM players have logistical advantages that enable them to access cheaper routers, mainframes, and GPU chips. More importantly, stronger model structure control and better model parallel capability mean model providers can leverage cheaper GPU devices. For model consumers relying on open-source models, GPU deprecation could be a bigger concern.

Take DeepSeek R1 as an example. Let’s say you’re using p5e.48xlarge AWS instance which provide 8 H200 chips with NVLink connected. It will cost you 35$ per hour. Assuming you are doing as well as Nvidia and achieve 151 tokens/second performance. To generate 1 million output tokens, it will cost you $64(1 million / (151 * 3600) * $35). How much does DeepSeek sell its token at per million? 2$ only! DeepSeek can achieve 60 times the efficiency of your cloud deployment (assuming a 50% margin from DeepSeek).

So, LLM inference power is indeed like electricity. It reflects the diversity of applications that LLMs can power; it also implies that it is most efficient when centrally generated. Nevertheless, you should still self-host LLM services for privacy-sensitive use cases, just like hospitals have their electricity generators for emergencies.

Constantly shifting ground

Investing in AI infrastructure is a bold game, and building lightweight platforms for AI applications comes with its hidden pitfalls. With the rapid evolution of AI model capabilities, there is no aligned paradigm for AI applications; therefore, there is a lack of a solid foundation for building AI applications.

Image Generated by OpenAI ChatGPT 4o

The simple answer to that is: be patient.

If we take a holistic view of data and ML platforms, development paradigms emerge only when the capabilities of algorithms converge.
Domains Algorithm Emerge Solution Emerge Big Platforms Emerge
Data Platform 2004 — MapReduce (Google) 2010–2015 — Spark, Flink, Presto, Kafka 2020–Now — Databricks, Snowflake
ML Platform 2012 — ImageNet (AlexNet, CNN breakthrough) 2015–2017 — TensorFlow, PyTorch, Scikit-learn 2018–Now — SageMaker, MLflow, Kubeflow, Databricks ML
AI Platform 2017 — Transformers (Attention is All You Need) 2020–2022 —ChatGPT, Claude, Gemini, DeepSeek 2023–Now — ??

After several years of fierce competition, a few large model players remain standing in the Arena. However, the evolution of the AI capability is not yet converging. With the advancement of AI models’ capabilities, the existing development paradigm will quickly become obsolete. Big players have just started to take their stab at agent development platforms, and new solutions are popping up like popcorn in an oven. Winners will eventually appear, I believe. For now, building agent standardization themselves is a tricky call for small and medium-sized companies. 

Path Dependency of Old Success

Another challenge of building an AI platform is rather subtle. It is about reflecting the mindset of platform builders, whether having path dependency from the previous success of building data and ML platforms.

Image Generated by OpenAI ChatGPT 4o

As we previously shared, since 2017, the data and ML development paradigms are well-aligned, and the most critical task for the ML platform is standardization and abstraction. However, the development paradigm for AI applications is not yet established. If the team follows the previous success story of building a data and ML platform, they might end up prioritizing standardization at the wrong time. Possible directions are:

  • Build an AI Model Gateway: Provide centralised audit and logging of requests to LLM models.
  • Build an AI Agent Framework: Develop a self-built SDK for creating AI agents with enhanced connectivity to the internal ecosystem.
  • Standardise RAG Practices: Building a Standard Data Indexing Flow to lower the bar for engineer build knowledge services.

Those initiatives can indeed be significant. But the ROI really depends on the scale of your company. Regardless, you’re gonna have the following challenges:

  • Keep up with the latest AI developments.
  • Customer adoption rate when it is easy for customers to bypass your abstraction.

Suppose builders of data and ML platforms are like “Closet Organizers”, AI builders now should act like “Fashion Designers”. It requires embracing new ideas, conducting rapid experiments, and even accepting a level of imperfection.

My Thoughts on Promising Directions

Even though so many challenges are ahead, please be reminded that it is still gratifying to work on the AI platform right now, as you have substantial leverage which wasn’t there before:

  • The transformation capability of AI is more substantial than that of data and machine learning.
  • The motivation to adopt AI is way more potent than ever.

If you pick the right direction and strategy, the transformation you can bring to your organisation is significant. Here are some of my thoughts on directions that might experience less disruption as the AI model scales further. I think they are equally important with AI platformisation:

  • High-quality, rich-semantic data products: Data products with high accuracy and accountability, rich descriptions, and trustworthy metrics will “radiate” more impact with the growth of AI models.
  • Multi-modal Data Serving: OLTP, OLAP, NoSQL, and Elasticsearch, a scalable knowledge service behind the MCP server, may require multiple types of databases to support high-performance data serving. It is challenging to maintain a single source of truth and performance with constant reverse ETL jobs.
  • AI DevOps: AI-centric software development, maintenance, and analytics. Code-gen accuracy is greatly increased over the past 12 months.
  • Experimentation and Monitoring: Given the increased uncertainty of AI applications, the evaluation and monitoring of these applications are even more critical.

These are my thoughts on building AI platforms. Please let me know your thoughts on it as well. Cheers!

Share.

Comments are closed.