You Don’t Own Your Data, But AI Does—and That’s the Problem

Few people truly understand the revolutionary shift unfolding before their eyes when it comes to AI. It’s not just that our tools and software have become smarter — it’s that we’ve started developing software in an entirely new way.

This is understandable, of course, since there hasn’t been any dramatic change in either hardware or software. Our programs still run on digital CPUs and GPUs, and they’re still written in traditional programming languages like Python. So, where exactly is the revolutionary change?

It’s worth taking a look at the source code of large language models like GPT-2, Grok, or Meta’s LLaMA. Even to a layperson, one striking thing is how short and relatively simple this code is — which is surprising, considering the vast knowledge and problem-solving intelligence these models possess. This is when we begin to truly grasp why this is a real revolution, and why we can say that the way we develop software has fundamentally changed.

In an artificial intelligence system, the runtime code is just a marginal part of the system — the real knowledge and intelligence come from the dataset used for training. Data is the new source code!

That’s precisely why this new form of software has been called Software 2.0 by Andrej Karpathy — and I think that’s a very fitting name.

Open weight ≠ open source

There are several freely available open-source models that anyone can download, run, or even modify. Examples include LLaMA, Grok, and the recently much-discussed Chinese model DeepSeek.

These models typically consist of a few Python files and several massive weight matrices (each several gigabytes in size). While it’s true that these models can be further developed — fine-tuned, quantized, distilled, and so on — they still can’t truly be considered open-source in the classical sense. This is because we don’t have access to the datasets used to train them.

It’s more accurate to call these open-weight models rather than open-source models, since the truly valuable component — the training data — remains in the hands of the publishing companies (Meta, xAI, etc.).

True open-source AI is built on open data.

Who owns the data?

Large language models are typically built by first creating a foundation model, which is then fine-tuned for a specific purpose (e.g., chatting, as with ChatGPT). This foundation model is trained on data produced by humanity and made publicly available — through websites, books, YouTube videos, and social media. Since this data wealth is the result of our collective work, it would be logical to treat these datasets as public domain resources, freely accessible to everyone.

For this reason, many services have explicitly decided to prohibit AI model developers from using their content.

Personally, I don’t fully agree with this approach, as I believe it hinders progress. I would much prefer a fair-use model that allows publicly available data to be used for AI training — on the condition that the resulting dataset and model must be made freely accessible in return.

Since no legal framework like this currently exists, and there’s no incentive for AI companies to develop truly open-source models, this responsibility falls to the community.

Decentralized storage — the ideal home for open datasets

But what would an open dataset built by a global community actually look like? That’s far from a trivial question, as there are significant ideological and cultural differences between people across different regions of the world. For this reason, it’s impossible to create a single dataset from publicly available global knowledge that everyone would agree on. Beyond that, it’s crucial that such a dataset is not owned by anyone, that access cannot be restricted, that data cannot be retroactively modified, and that no one has the power to censor it.

Given these criteria, the best choice is an immutable decentralized storage system, such as IPFS or Ethereum Swarm. These solutions use content-addressing (where the address of the data is a hash generated from its content), making unauthorized content modification virtually impossible. Storage is distributed across multiple nodes, ensuring secure and censorship-resistant access where data availability cannot be restricted.

These systems have another extremely useful feature: they store content in blocks. Since the address of a piece of content is derived from its hash, if the same block appears in multiple files, it only needs to be stored once. In this way, both IPFS and Swarm function similarly to a Git repository, where versioning is automatic, and forking is cheap. This is ideal in cases where we want to store multiple datasets that differ only slightly (e.g., by less than 1%). If someone disagrees with the content of a dataset, they can create a new version without needing to make a full copy — just the changes are stored. Exactly like when we fork a project on GitHub to modify something.

How blockchain can support the creation of open datasets

Blockchain and decentralized storage complement each other well. On one hand, decentralized storage makes it possible to store large amounts of data with a level of security comparable to blockchain storage. On the other hand, the blockchain can provide the incentive system and governance layer for decentralized storage. A good example is Ethereum Swarm, which could not function without a blockchain, since its incentive system — essential for the network’s optimal operation — is implemented through smart contracts running on the blockchain.

In the case of open datasets, blockchain-based DAOs could decide what gets included in a dataset. The system could function similarly to Wikipedia, where administrators ensure that false information doesn’t make it into the encyclopedia. Of course, it’s often not clear-cut what counts as false information. Wikipedia has no real solution for this issue — but in a decentralized, blockchain-based system, forks come into play.

If someone disagrees with the content of a dataset, they can create their own fork and launch a new DAO to manage the alternative version.

Decentralized Training

If data is the new source code, then in the case of Software 2.0 (artificial intelligence), training is the equivalent of compiling the program. In traditional software development, this compilation is done locally by developers on their own machines. In AI systems, however, training is an extremely energy- and computation-intensive task. Training a large language model can cost millions of dollars and requires massive computer clusters — a major challenge for community-driven models.

One option is for the community to raise funds and rent computing power from a cloud provider for centralized training. Another possibility is decentralized training, where members donate computing capacity either for free (as a public good) or in exchange for compensation.

However, decentralized training is far from a trivial task. One challenge is that large models cannot be trained on a single node — multi-node training is required, which demands high-volume communication between nodes. This communication must be optimized for training to be efficient. Fortunately, several startups are working on this issue. One notable example is Exo Labs, which has developed a protocol called DiLoCo, designed to enable training over an internet-connected network of nodes.

Another challenge — common to all open decentralized systems (blockchains, decentralized storage, etc.) — is the issue of trust. Since anyone can freely contribute their own devices to the system, there’s no guarantee they will act honestly. A malicious actor, for instance, could use unauthorized data instead of the DAO-approved dataset, thereby “contaminating” the model.

In these systems, trust is replaced by computational guarantees. The more security we want in an untrusted network of nodes, the more computational power is required. A good example of this is blockchain, where each node publishing a new block also validates all computations in the chain leading up to it.

This approach, however, doesn’t work for AI training, so we must explore other solutions. Here are three potential approaches:

Consensus-based Validation

One approach is to have each computation performed by multiple (e.g., three) randomly selected nodes. If the results don’t match, the dishonest node loses its staked deposit. The advantage of this method is that it provides relatively high security. The downside is that it triples the required computing power.

Zero-Knowledge Proofs

With zero-knowledge proof (ZKP) technology, one can prove that a computation was performed — and do so in a way that the proof itself is cheap to verify. This technique is used in systems like zkRollups, where a zkSNARK proves that valid transactions were executed on a Layer 2 chain. The drawback is that generating the proof is computationally expensive, especially as the number of multiplications in the computation increases. This means that with current ZKP technology, training AI models this way would require drastically more computing power. Still, ZKPs are an actively researched area, and in the future, they may become efficient enough for distributed training.

Optimistic Decentralized Machine Learning

Optimistic decentralized machine learning works similarly to optimistic rollups. Computation is assumed to be correct unless someone submits a fraud-proof to show otherwise. In practice, the training node logs each step of the process — including the initial weight matrix, training data, and resulting weight matrix. If the log also records the random seeds, the entire computation becomes deterministic and reproducible.

Validator nodes can then randomly sample segments of the training log and verify them. If any inconsistencies are found, the training node loses its stake. This method has the lowest computational cost: it doesn’t require expensive zero-knowledge proof generation, and unlike consensus-based validation, only randomly selected parts of the computation need to be re-verified. This makes it the most efficient of the three approaches.

Finally, decentralized training requires a “node marketplace” — a platform where available computing resources can be discovered and utilized. An example is Aleph Cloud, which, like other cloud providers, offers computational capacity — but it is a decentralized platform designed to provide scalable storage, computing, and database services through a network of distributed nodes. It uses an ERC20 token to pay for the services, so it can be easily integrated with other blockchain-based solutions. Aleph nodes use trusted execution environments, so validation is less relevant in this case.

Decentralized Inference

For large-scale models, not only is training non-trivial due to the high computational demands but running the model (inference) is also challenging. This is especially true for reasoning models, where results emerge only after multiple consecutive forward passes — meaning the total computational power required for inference can far exceed that of training.

Since running a neural network works the same way as during training (inference is forward phases, whereas training involves many forward and backward phases), optimistic decentralized machine learning can also be applied here.

The main challenge in this context is privacy. Technologies like Homomorphic Encryption and Multiparty Computation (MPC) can help protect private data. At the same time, hardware performance continues to grow exponentially, and new techniques — such as 1.5-bit neural networks and distilled Mixture-of-Experts (MoE) models like DeepSeek — are increasingly making it possible to run these networks locally.

I believe that in the long run, we will be able to run such models locally — or at the very least, within privately rented trusted environments.

Conclusion

By now, it’s clear to most people that AI is going to bring about revolutionary changes. It will reshape our world in ways we can hardly imagine — and that’s without even mentioning the impact of humanoid robots. What’s absolutely crucial is who holds the power over AI. Will it remain centralized in the hands of a few large corporations, or will it become a shared public good that benefits all of humanity?

This makes one question central to our future: Will truly decentralized AI emerge?

Building such a system requires more than just technical innovation — it demands open datasets, decentralized storage, blockchain-based governance, and incentive mechanisms that allow communities to contribute and collaborate freely. It also needs sustainable solutions for decentralized training and inference, ensuring both efficiency and privacy.

If we succeed, we won’t just democratize AI — we’ll lay the groundwork for a new digital commons, where intelligence itself is co-created, transparent, and open to all.