Your DNA Is a Machine Learning Model: It’s Already Out There

that avoiding Dna testing services like 23andMe or Ancestry will help you protect your most confidential data. However, in reality, that control has gradually weakened.

With today’s genomic data and advanced inference methods, people can reconstruct your genetic profile without requiring your input. This isn’t something that might happen; it is happening now. It’s a typical result of machine learning being used on large sets of family-related data.

Today, genomic systems are more like teams working together than standalone archives. When there are enough genetically close people represented in the data, distant cousins and second-degree relatives, the model can make guesses about your traits, the risks you have and even parts of your DNA. What’s happening is not the theft of data, but the way data is grouped statistically.

This article explains the technical changes that make this possible, links them to common ML approaches and discusses what it means when biology becomes as predictable as behaviour.

The Golden State Killer Was Predicted, Not Found

When police apprehended the Golden State Killer in 2018, they did not match his DNA to anything in the database. As an alternative, they put the crime scene DNA on GEDmatch and identified a relative, a third cousin. After that, they built a partial family tree and spotted the suspect using both genetic triangulation and pedigree inference.

What allowed for the arrest was not the presence of data, but how it was stored. When enough relatives shared their genetic data, researchers were able to reconstruct what the target’s Genome might look like. In essence, this is a graph search problem in which the biological network has few labels and the search is limited by recombination and inheritance patterns.

The case wasn’t built on finding an exact match. It applied the idea from nearest-neighbour classification, which posits that similarity is determined based on shared haplotype blocks and probabilistic lineage for relational data.

It wasn’t only a significant advance in forensics. It served as a reminder that your DNA is now connected to other people’s data in ways you might not have agreed to.

DNA Inference Is Nearest-Neighbour Search in a Biologically-Constrained Hyperdimensional Space

In machine learning, we usually picture nearest-neighbour (k-NN) classification with points in Euclidean space that have clear, numeric features. Genomic inference follows the same pattern, except the feature space includes biological connections as well.

Each person in human genomics is represented as a list of millions of single-nucleotide polymorphisms (SNPs), which are often coded as 0, 1, or 2 to indicate the number of each allele present. Although the raw data can include over 1 million features, PCA and IBD are used to reduce the data, ensuring that genetic similarities are preserved.

In effect, this space acts as a structure that matters biologically, influenced by population organisation, shared history and evolutionary pressures. Genetic similarity scores, including kinship coefficients, IBD segments or F_ST distances, now substitute Euclidean distance.

In this case, investigators perform a nearest-neighbour query over the genotype space of GEDmatch, measuring similarity by examining shared haplotype blocks and recombination patterns, rather than using cosine distance or L2 norm.

When a third cousin is found, the search goes backwards on the genealogy graph using biological rules to identify possible genomes that might connect the relatives to the unknown person.

The process works by combining a constrained k-NN search, a graph traversal and probabilistic filtering.

k-NN finds nodes that are the closest genetically
Pedigree graphs outline the limitations of a search.
Statistical imputation models replace missing variants.

Instead of giving a classification, the result is a new genotype.

It is more than just standard inference. This engineering approach utilises family relationships to understand the phenotype. That means your DNA can be reconstructed almost completely, even if you’ve not had your genome sequenced before, because the genetic area around you is full of data.

In data science, this is known as feature leakage caused by latent graph proximity. In contrast to a password or an email address, it’s not possible to reset your genome.

DNA Inference: Two Statistical Approaches. (Image by author)

Polygenic Risk Scores Are Genomic Ensembles

I discovered polygenic risk scores (PRS) during my work on predictive models. At that time, my team was working on risk classification by behaviour. Still, I found that PRS resembled our approach, only instead of using surveys or wearables, it utilised large numbers of SNPs spread throughout the genome.

A PRS is the sum of weighted values from a large, but sparse set of features. Most of the time, these scores are produced using LASSO or elastic net penalised regression techniques, using GWAS summary statistics. A few models, such as Bayesian shrinkage or methods that account for linkage disequilibrium (for example, LDpred or PRS-CS), are designed to address the issue of SNP correlations.

What’s often overlooked by those not working in genetics is that trained models are able to generalise on their own. If your relatives’ genomic data is present and linked to health outcomes, the model will be able to estimate the risk in your genome without ever examining it.

To put it another way, PRS works like a team of biologists recommending music. Genetically similar individuals are used to help you find your place in a trait space. If the model finds many people around you with a specific disease who share the same genotype, it will start to warn you about that risk even if you did not take part in the study.

But once prediction enters the loop, it opens the door not just for scientific insight, but for manipulation. The same models that inform can also be exploited.

What Happens When Adversarial Actors Enter the Loop?

The moment we treat DNA databases as predictive systems, we also inherit their vulnerabilities. Once genomes become queryable, inferable, and connected across public and commercial platforms, adversarial behaviour becomes a modelling risk, not just an ethical one.

Genomic backsolving as inverse modelling

Suppose enough of your relatives have uploaded their genomes to open databases. In that case, an attacker can perform inverse inference, reconstructing likely segments of your DNA based on shared haplotypes and known inheritance patterns. This isn’t hypothetical: researchers have demonstrated that it’s possible to approximate a person’s genome with >60% accuracy using third-cousin-level data.

It’s not that far removed from model inversion attacks in machine learning, where someone reconstructs training data from model outputs. Only here, the “model” is the relational structure of a population.

Shadow scoring and risk pricing

Insurers and data brokers may not access your raw DNA, but with access to demographic data and public kinship graphs, they can predict your polygenic risk scores through proxy modelling. Even without violating GINA (the U.S. Genetic Information Nondiscrimination Act), they could use external inferences to re-rank you silently, affecting credit, health products, or eligibility profiles.

It’s a genomically informed version of algorithmic redlining, and it can operate invisibly.

Adversarial relatives and genomic poisoning

What if someone intentionally uploads manipulated genomes to poison a target’s inferred profile? Because these systems rely on statistical consistency across relatives, altering or faking segments could bias inference engines. Imagine someone nudging your inferred genome to raise your risk for a condition, or falsely aligning you with a crime scene sequence.

Adversarial modelling risks across inference, scoring, and data integrity. **(Image by author)**

Conclusion

This article was written to unpack a reality that’s easy to miss, even for those of us working in machine learning: genomic data doesn’t need to be collected directly to be modelled accurately.

Across the piece, I explored how genomic inference operates like nearest-neighbour classification, how polygenic risk scoring resembles ensemble regression, and how relational graph structures allow your DNA to be reconstructed using statistical proximity. If you’ve ever built collaborative filtering systems, you already understand the logic behind these methods, but probably didn’t expect it to apply to something as personal as your genome.

That’s the deeper point. This isn’t just a privacy story. It’s a modelling story about how the structure of biological data makes inference not only possible, but inevitable. Whether you’ve sequenced your DNA or not, you are now part of the model, because the people connected to you have already fed it enough.

In an era of large-scale inference systems, it’s no longer enough to ask who owns data. We have to ask who owns the patterns, because patterns generalise, and generalisation doesn’t need permission.