Finding cancer patterns through massive integrated multi-omics LLM pipelines
We mask values across gene expression, copy number, and clinical data. When the model reconstructs them, it reveals the biological functions connecting these layers—patterns we never explicitly taught it.
Can an LLM learn cancer biology by reconstructing masked multi-omics data?
Traditional machine learning learns from labels. But what if the structure itself is the teacher?
If a model can reconstruct masked gene expression values using only copy number variation and clinical data, it has implicitly learned how these biological layers connect.
The LLM becomes a massive non-parametric function approximator over biological relationships—learning patterns we never explicitly encoded.
Traditional Approach
Train on labeled data. The model learns mappings from input to output but not the underlying structure.
Our Approach
Mask data, force reconstruction. The model learns biological functions to fill gaps—revealing how omics layers connect.
We don't train the model on biology papers. We train it to reconstruct masked data. The biology emerges from the structure.
Four data layers, one learned function
Each layer provides a different view into cancer biology. The model learns how they connect.
Mask, Reconstruct, Discover
A simple protocol that reveals what the model has learned about biology.
Mask
Hide 10% of values across gene expression, CNV, and clinical features. These become reconstruction targets.
Integrate
Feed remaining multi-omics data to the LLM with knowledge graph context for grounding.
Reconstruct
Model predicts masked values using cross-modal relationships it has learned.
Validate
Compare predictions to held-out ground truth. Accurate reconstruction proves learned functions.
Patterns the model found without being told
The striking finding isn't just that LLMs can reconstruct masked values—it's what their predictions reveal about biological relationships we didn't explicitly encode.
CNV-Expression Coupling
The model strongly weights copy number when predicting expression. It learned gene dosage effects—more copies means more transcription.
"More copies → More expression"Stage-Specific Constraints
Validation confidence varies by tumor stage. Early-stage tumors are predictable; late-stage shows higher uncertainty—biological heterogeneity.
"Early = tight bounds, Late = chaos"Pathway Co-Regulation
Genes in the same biological pathway predict each other's expression. The model learned regulatory networks from reconstruction.
"Same pathway → mutual prediction"Patient Neighborhoods
Cohort similarity improves reconstruction accuracy. Age and stage define "biological neighborhoods" where patients share expression patterns.
"Similar patients → similar biology"From discovery to deployment: Evidence-grounded imputation
These discoveries power a practical tool. When you have missing cancer data, the model uses its learned biological functions to validate imputations—and tells you which ones to trust.
Claim Validation
Each imputed value is treated as a claim requiring evidence.
Cross-Modal Consistency
Validates expression using CNV and clinical signals.
Similar-Patient Priors
Leverages cohort similarity for validation.
Uncertainty Quantification
Confidence bounds on every prediction.
KG Grounding
Prevents hallucination with biomedical KB.
Ensemble Filtering
Compares multiple imputation methods.
Why this approach works
Structure as Teacher
We don't need labeled examples of "correct biology." The multi-omics structure itself teaches the model how layers connect through reconstruction.
Non-Parametric Flexibility
No assumed functional form. The LLM learns whatever relationships exist in the data, however complex or non-linear.
Grounded Inference
Knowledge graph context prevents hallucination. The model uses only supplied evidence, never invented biology.
Uncertainty Quantification
Know what you don't know. Every prediction comes with confidence bounds, not false certainty.
Ready to discover patterns in your multi-omics data?
Stop imputing blind. Start learning the biology.