Target discovery

How do you find synthetic lethal targets without screening everything?

There are billions of possible gene combinations and no lab can test them all, can an AI agent actually prioritise the right ones, and then prove them out?

Single-target drugs fail to resistance, so teams turn to combinations - but two genes make ~200M pairs and three make ~1.3 trillion. Against a known CRISPR screen, an untrained LLM scored synthetic-lethal pairs at 73.8% (AUC 0.768), beating purpose-built tools. We scaled to ~400,000 pairs, layered novelty and feasibility, and collapsed them to 11 hypotheses now prioritised for experimental validation - each first grounded in real omic data by Hydra.

By PharosBioUpdated June 15, 2026

Who this is for: Target-discovery and computational-biology teams facing a combinatorial search space too large for exhaustive lab screening.

In collaboration with

Boston Children's Hospital

possible three-gene combinations - no lab screens that: 1.3T
best untrained LLM accuracy (AUC 0.768), beating SLant & MAGICAL: 73.8%
accuracy held on a post-cutoff, unseen screen: ~60%
pairs scored, then collapsed to testable hypotheses: 400k → 11

The problem with combination target discovery

Most cancers are heterogeneous, so single-target drugs frequently fail - tumours become resistant, or the agent is too toxic. Combinations are the established answer, and synthetic lethality is the cleanest version: two genes that can each be knocked out alone with little consequence, but kill the cell when lost together. It's how PARP inhibitors like olaparib work - lethal only in cells that already carry a BRCA1/2 mutation.

But the combinatorics explode: one gene is ~20,000 candidates, two genes ~200 million pairs, three genes over 1.3 trillion. No screen tests a trillion combinations. Cracking this needs agents at three layers - reasoning to nominate candidates, analysis to verify them against real omic data, and experimental validation to close the loop. The question teams actually google is narrower: can an agent meaningfully filter and prioritise, so scarce screening capacity goes only to the combinations most likely to be real, novel, and feasible?

What teams in this space search for

How do I find synthetic lethal targets without screening everything?
Can AI predict which gene combinations are synthetically lethal?
How do I prioritise combination targets to beat drug resistance?

The solution

How we solved it with Hydra

The prompt we gave HydraModule: Synthetic-Lethality Scoring module

“Score these ~10,000 gene pairs for synthetic lethality and tell me how an LLM compares against a known CRISPR screen. Then widen to ~400,000 pairs of clinically relevant genes, add novelty and feasibility scores, and shortlist the strongest candidates. For the survivors, validate them against real omic data, tumour-versus-normal expression, dependencies, and which patient populations would benefit.”

Gene A

Gene B

Viableone gene still covers the cell

Gene A

Gene B

Viableone gene still covers the cell

Gene A

Gene B

Lethalboth lost together

Olaparib exploits exactly this - a PARP inhibitor that is lethal only in cells already carrying a BRCA1/2 mutation. A two-gene problem with a known answer, which makes it the ideal test of whether an agent can reason over combinations.

Synthetic lethality is a two-gene problem with a known answer - which is exactly why it is the right test of whether an agent can reason over combinations.

Candidate space

Gene X + Gene Y

~400,000 pairs

Clinically relevant genes, RPE1 TP53-deficient background.

Reasoning layer

LLM

Scores every pair - no training on CRISPR knockouts.

Scored on three axes

Synthetic lethality−1 → +1
Novelty1 - 10
Feasibility1 - 10

+ a written rationale for why each pair is promising.

Shortlist

11 pairs

Top synthetic-lethality score with novelty and feasibility both above 5 - now prioritised for validation.

The screening loop: ~400,000 gene pairs are scored by an LLM on synthetic lethality, novelty, and feasibility, then filtered to the 11 strongest candidates - collapsing a space no lab could screen into a testable shortlist.

What Hydra ran

Reasoning layer: to test whether an LLM can navigate this space, we used a problem with a known answer. We took Olivieri et al.'s 2020 CRISPR screen - which gene knockouts sensitise cells to olaparib - as ground truth, then gave the same ~10,000 pairs to several open-weight LLMs and asked each to score every pair from −1 to +1 (−1 rescue, 0 no effect, +1 lethal). To rule out memorisation, we re-ran the whole thing on a CRISPR screen released after the models' training cutoff.

Scaling layer: we widened to ~400,000 pairs of clinically relevant genes (in an RPE1 TP53-deficient background) and layered two more agent-assigned scores - novelty (an agent with literature access checks whether a similar pair was already tested) and feasibility (pan-essential genes, for instance, are harder to validate). Filtering to the top 100 by synthetic-lethality score with novelty and feasibility both above 5 left 11 pairs.

Validation layer: each surviving pair goes to Hydra - our general agentic tool. Give it a hypothesis and optional data and it plans the bioinformatics, runs it across hundreds of skills (DepMap, TCGA, cBioPortal, Ensembl…), and returns a ready-to-read report on whether the hit is worth testing and which patients would benefit. The 11 pairs are now prioritised for experimental validation.

What it found

The best untrained model reached 73.8% accuracy (AUC 0.768) - beating the purpose-built tools SLant and MAGICAL - despite never being trained to predict CRISPR knockouts. On the unseen, post-cutoff screen it held ~60%: a modest drop that points to real generalisation, not data leakage. It echoes coding agents - no one trained them to spot concurrency or security bugs; the capability emerged. Above-random performance on screens the model never saw hints at something similar happening in biology.

Scaling collapsed ~400,000 pairs to 11 testable hypotheses - impossible to surface by hand. But 11 knockout pairs are a starting point, not an answer: are the genes differentially expressed in tumour versus healthy tissue, in which tumour types, for which patients? That is what Hydra answers. On the TP53 + TUBGCP2 pair it queried Ensembl (the genes sit on different chromosomes, so cis-codeletion is impossible), pulled DepMap (TUBGCP2 is pan-essential - broad inhibition would be systemically toxic), and found no real TP53-TUBGCP2 correlation. Yet the same chain surfaced a better angle: glioblastoma frequently shows chromosome-10 monosomy alongside TUBGCP2 overexpression, leaving a single copy - a plausible, tumour-selective target.

What we learned

An LLM score is a powerful prior, not a verdict: specificity is low (false positives), and the models show a bimodal preference for particular scores - so survivors must be grounded in real data before anyone touches a pipette. That is the division of labour: the scoring module ranks the space, Hydra validates the shortlist on omics, and only then is a pair worth testing experimentally. The same reason → analyse → validate loop applies anywhere combinatorics outrun lab capacity.

Hydra turns a one-line prompt into hundreds of analyses. A single TP53/TUBGCP2 run produced over 190 results in orchestrated mode and synthesised them into a summary paper, so the volume stays readable. What used to take a bioinformatician days now runs overnight across dozens of programs - the scientist comes back in the morning to verify and judge, not to execute.

Benchmark & validation

Against the 2020 CRISPR screen by Olivieri et al., the untrained LLMs are scored by ROC-AUC. The best, gpt-oss-20B, reaches 0.768 - well clear of the purpose-built tools SLant and MAGICAL, which sit at chance.

AUC ranking

axis starts at 0.45

gpt-oss-20B

0.768

Qwen2.5-72B

0.735

Qwen2.5-32B

0.715

gpt-oss-120B

0.715

Qwen2.5-7B

0.573

MAGICAL

0.550

Random baseline

0.502

SLant

0.501

Model	AUC	F1	Balanced acc.	Optimal threshold
gpt-oss-20B	0.768	0.044	0.733	0.30
Qwen2.5-72B	0.735	0.012	0.664	0.80
Qwen2.5-32B	0.715	0.017	0.695	0.70
gpt-oss-120B	0.715	0.036	0.709	0.55
Qwen2.5-7B	0.573	0.008	0.571	0.90
MAGICAL	0.550	0.016	0.584	0.19
Random baseline	0.502	0.006	0.500	0.00
SLant	0.501	0.060	0.500	0.00

Discrimination on the screen: higher AUC means better ranking of lethal versus non-lethal pairs. F1 is low across the board because lethal pairs are rare (heavy class imbalance), which is why AUC and balanced accuracy are the headline metrics. Numbers from the preprint.

Demo - Hydra validating the TP53 + TUBGCP2 pair

One of the 11 shortlisted pairs, handed to Hydra. From a single prompt it runs cell-line and patient evidence in orchestrated mode, produces 190+ results, and writes a summary paper. Here it rules out the naive mechanism and surfaces the glioblastoma chromosome-10 angle instead.

Read the preprintThe full method, benchmark, and 11 shortlisted pairs - on bioRxiv

What you get

An untrained LLM scored synthetic lethality at 73.8% (AUC 0.768), beating SLant and MAGICAL
≈60% accuracy held on a post-training-cutoff screen, controlling for data leakage
~400,000 candidate pairs collapsed to 11, now prioritised for experimental validation
Each pair grounded by Hydra in DepMap, TCGA and cBioPortal before any experiment
A reusable reason → analyse → validate loop for any combinatorial discovery problem

Data sources used

Olivieri et al. 2020 CRISPR olaparib-sensitisation screen (ground truth)
A post-training-cutoff CRISPR screen (data-leakage control)
DepMap (gene dependencies & essentiality)
TCGA / cBioPortal (tumour expression & alterations)
Ensembl (genomic coordinates) + primary literature (novelty)

Figures reflect analyses PharosBio ran on public datasets and public benchmarks. Named competitors, collaborators, and logos are withheld at this stage; the methods and results shown are real and repointable to your own target.

Sources & methods

PharosBio preprint, bioRxiv 2026: https://www.biorxiv.org/content/10.64898/2026.01.28.702211v2
Ground-truth screen: Olivieri et al., 2020 (CRISPR olaparib sensitisation)
Compared tools: SLant; MAGICAL
Omic validation: DepMap; TCGA / cBioPortal; Ensembl

Frequently asked questions

What is synthetic lethality, and why use it as the test case?

Synthetic lethality is when two genes can each be lost individually with little effect, but losing both together kills the cell - the principle behind PARP inhibitors like olaparib in BRCA1/2-mutant cancers. It's an ideal benchmark because it's a two-gene problem with a known, experimentally-validated answer (here, Olivieri et al.'s 2020 olaparib-sensitisation screen).

How did you get from ~400,000 pairs to 11?

We scored ~400,000 pairs of clinically relevant genes (RPE1 TP53-deficient background) for synthetic lethality, then layered two agent-assigned scores: novelty (has a similar pair been tested before?) and feasibility (pan-essential genes are harder to validate). Filtering to the top 100 by SL score with novelty and feasibility both above 5 left 11 pairs - now prioritised for experimental validation.

Isn't 73.8% just data leakage from training?

We re-tested on a CRISPR screen released after the models' training cutoff. Accuracy held at ≈60% on genuinely unseen data - a modest drop that indicates real generalisation rather than memorisation, much like coding agents spotting bugs they were never explicitly trained on.

What does Hydra add on top of the LLM scores?

Omic grounding. An LLM score is a prior, not a verdict. Hydra plans and runs the bioinformatics for each pair across DepMap, TCGA, cBioPortal and Ensembl. On TP53 + TUBGCP2 it ruled out the naive mechanism (different chromosomes; TUBGCP2 pan-essential) but surfaced a tumour-selective angle: glioblastoma's chromosome-10 monosomy plus TUBGCP2 overexpression leaves a single, targetable copy.

Where can I read the full method?

The full method, benchmark, and the 11 shortlisted pairs are described in our bioRxiv preprint, linked from this page.

Run this analysis on your question

Hydra plans, executes, and validates, so you reach a defensible answer in hours, not weeks.

Try Hydra Work with us

Related case studies

Safety signals in combinationsRead Why one ADC wonRead Augmenting in vivo studiesRead