Cancer Research Discovery

Finding cancer patterns through massive integrated multi-omics LLM pipelines

We mask values across gene expression, copy number, and clinical data. When the model reconstructs them, it reveals the biological functions connecting these layers—patterns we never explicitly taught it.

The Question

Can an LLM learn cancer biology by reconstructing masked multi-omics data?

Traditional machine learning learns from labels. But what if the structure itself is the teacher?

If a model can reconstruct masked gene expression values using only copy number variation and clinical data, it has implicitly learned how these biological layers connect.

The LLM becomes a massive non-parametric function approximator over biological relationships—learning patterns we never explicitly encoded.

Traditional Approach

Train on labeled data. The model learns mappings from input to output but not the underlying structure.

Our Approach

Mask data, force reconstruction. The model learns biological functions to fill gaps—revealing how omics layers connect.

We don't train the model on biology papers. We train it to reconstruct masked data. The biology emerges from the structure.

Multi-Omics Integration

Four data layers, one learned function

Each layer provides a different view into cancer biology. The model learns how they connect.

The Experiment

Mask, Reconstruct, Discover

A simple protocol that reveals what the model has learned about biology.

Mask

Hide 10% of values across gene expression, CNV, and clinical features. These become reconstruction targets.

Integrate

Feed remaining multi-omics data to the LLM with knowledge graph context for grounding.

Reconstruct

Model predicts masked values using cross-modal relationships it has learned.

Validate

Compare predictions to held-out ground truth. Accurate reconstruction proves learned functions.

Discoveries

Patterns the model found without being told

The striking finding isn't just that LLMs can reconstruct masked values—it's what their predictions reveal about biological relationships we didn't explicitly encode.

CNV-Expression Coupling

The model strongly weights copy number when predicting expression. It learned gene dosage effects—more copies means more transcription.

"More copies → More expression"

Stage-Specific Constraints

Validation confidence varies by tumor stage. Early-stage tumors are predictable; late-stage shows higher uncertainty—biological heterogeneity.

"Early = tight bounds, Late = chaos"

Pathway Co-Regulation

Genes in the same biological pathway predict each other's expression. The model learned regulatory networks from reconstruction.

"Same pathway → mutual prediction"

Patient Neighborhoods

Cohort similarity improves reconstruction accuracy. Age and stage define "biological neighborhoods" where patients share expression patterns.

"Similar patients → similar biology"

The Application

From discovery to deployment: Evidence-grounded imputation

These discoveries power a practical tool. When you have missing cancer data, the model uses its learned biological functions to validate imputations—and tells you which ones to trust.

Claim Validation

Each imputed value is treated as a claim requiring evidence.

Cross-Modal Consistency

Validates expression using CNV and clinical signals.

Similar-Patient Priors

Leverages cohort similarity for validation.

Uncertainty Quantification

Confidence bounds on every prediction.

KG Grounding

Prevents hallucination with biomedical KB.

Ensemble Filtering

Compares multiple imputation methods.

Philosophy

Why this approach works

Structure as Teacher

We don't need labeled examples of "correct biology." The multi-omics structure itself teaches the model how layers connect through reconstruction.

Non-Parametric Flexibility

No assumed functional form. The LLM learns whatever relationships exist in the data, however complex or non-linear.

Grounded Inference

Knowledge graph context prevents hallucination. The model uses only supplied evidence, never invented biology.

Uncertainty Quantification

Know what you don't know. Every prediction comes with confidence bounds, not false certainty.

Ready to discover patterns in your multi-omics data?

Stop imputing blind. Start learning the biology.

View on GitHub Get in touch

Finding cancer patterns through massive integrated multi-omics LLM pipelines

Can an LLM learn cancer biology by reconstructing masked multi-omics data?

Traditional Approach

Our Approach

Four data layers, one learned function

Gene Expression

Copy Number Variation

Clinical Features

Knowledge Graph

Mask, Reconstruct, Discover

Mask

Integrate

Reconstruct

Validate

Patterns the model found without being told

CNV-Expression Coupling

Stage-Specific Constraints

Pathway Co-Regulation

Patient Neighborhoods

From discovery to deployment: Evidence-grounded imputation

Claim Validation

Cross-Modal Consistency

Similar-Patient Priors

Uncertainty Quantification

KG Grounding

Ensemble Filtering

Why this approach works

Structure as Teacher

Non-Parametric Flexibility

Grounded Inference

Uncertainty Quantification

Ready to discover patterns in your multi-omics data?