Cancer Research Discovery

Finding cancer patterns through massive integrated multi-omics LLM pipelines

We mask values across gene expression, copy number, and clinical data. When the model reconstructs them, it reveals the biological functions connecting these layers—patterns we never explicitly taught it.

MULTI-OMICS INPUT EXPRESSION ? ? CNV ? ? CLINICAL Age: ? Stage: III ? = Masked values LLM f(x) Non-parametric function approximator LEARNED PATTERNS EXPRESSION CNV CLINICAL Age: 62 Stage: III Reconstructed with confidence
The Question

Can an LLM learn cancer biology by reconstructing masked multi-omics data?

Traditional machine learning learns from labels. But what if the structure itself is the teacher?

If a model can reconstruct masked gene expression values using only copy number variation and clinical data, it has implicitly learned how these biological layers connect.

The LLM becomes a massive non-parametric function approximator over biological relationships—learning patterns we never explicitly encoded.

Traditional Approach

Train on labeled data. The model learns mappings from input to output but not the underlying structure.

Our Approach

Mask data, force reconstruction. The model learns biological functions to fill gaps—revealing how omics layers connect.

"

We don't train the model on biology papers. We train it to reconstruct masked data. The biology emerges from the structure.

"
Multi-Omics Integration

Four data layers, one learned function

Each layer provides a different view into cancer biology. The model learns how they connect.

LLM f(x) Integration EXPRESSION CNV CLINICAL KNOWLEDGE Expression Amp Del Clinical KG
The Experiment

Mask, Reconstruct, Discover

A simple protocol that reveals what the model has learned about biology.

1

Mask

Hide 10% of values across gene expression, CNV, and clinical features. These become reconstruction targets.

2

Integrate

Feed remaining multi-omics data to the LLM with knowledge graph context for grounding.

3

Reconstruct

Model predicts masked values using cross-modal relationships it has learned.

4

Validate

Compare predictions to held-out ground truth. Accurate reconstruction proves learned functions.

PATIENT P1 BRCA1: 8.2 TP53: MASKED EGFR: 4.1 CNV: +2 chr17 Stage: III LLM Reconstruction f(Expr, CNV, Clinical, KG) → TP53 prediction RECONSTRUCTED BRCA1: 8.2 TP53: 6.7 EGFR: 4.1 CNV: +2 chr17 Stage: III LEARNED CNV amp + Stage III → elevated TP53
Discoveries

Patterns the model found without being told

The striking finding isn't just that LLMs can reconstruct masked values—it's what their predictions reveal about biological relationships we didn't explicitly encode.

CNV +2 EXPR HIGH

CNV-Expression Coupling

The model strongly weights copy number when predicting expression. It learned gene dosage effects—more copies means more transcription.

"More copies → More expression"
STAGE I Tight bounds vs STAGE IV Wide variance

Stage-Specific Constraints

Validation confidence varies by tumor stage. Early-stage tumors are predictable; late-stage shows higher uncertainty—biological heterogeneity.

"Early = tight bounds, Late = chaos"
DNA REPAIR PATHWAY BRCA1 RAD51 PALB2

Pathway Co-Regulation

Genes in the same biological pathway predict each other's expression. The model learned regulatory networks from reconstruction.

"Same pathway → mutual prediction"
SIMILAR P1 DISTANT Age: 55-60 Stage: II

Patient Neighborhoods

Cohort similarity improves reconstruction accuracy. Age and stage define "biological neighborhoods" where patients share expression patterns.

"Similar patients → similar biology"
The Application

From discovery to deployment: Evidence-grounded imputation

These discoveries power a practical tool. When you have missing cancer data, the model uses its learned biological functions to validate imputations—and tells you which ones to trust.

Claim Validation

Each imputed value is treated as a claim requiring evidence.

Cross-Modal Consistency

Validates expression using CNV and clinical signals.

Similar-Patient Priors

Leverages cohort similarity for validation.

Uncertainty Quantification

Confidence bounds on every prediction.

KG Grounding

Prevents hallucination with biomedical KB.

Ensemble Filtering

Compares multiple imputation methods.

Philosophy

Why this approach works

? 6.7 ? CNV amplification → Expression Stage ← I to IV The Learned Function Surface f(CNV, Stage, ...) → Expression Learned surface Masked (unknown) Reconstructed KEY INSIGHT The LLM learns this surface implicitly from reconstruction loss.

Structure as Teacher

We don't need labeled examples of "correct biology." The multi-omics structure itself teaches the model how layers connect through reconstruction.

Non-Parametric Flexibility

No assumed functional form. The LLM learns whatever relationships exist in the data, however complex or non-linear.

Grounded Inference

Knowledge graph context prevents hallucination. The model uses only supplied evidence, never invented biology.

Uncertainty Quantification

Know what you don't know. Every prediction comes with confidence bounds, not false certainty.

Ready to discover patterns in your multi-omics data?

Stop imputing blind. Start learning the biology.