Synthegy, developed at EPFL, uses LLMs to rank synthesis routes against chemist-defined goals, matching expert judgments 71.2% of the time.
The framework was validated against 36 independent chemists across 368 evaluations.
The experiments reached alignment rates comparable to inter-expert agreement.
Designing a molecule from scratch is one of chemistry's hardest problems. It's not just about knowing what atoms to connect—it's about knowing the right order of reactions, when to protect sensitive parts of the molecule, and how to avoid dead ends that could ruin months of lab work.
Traditionally, that knowledge lives in the heads of experienced chemists. Now, a team at EPFL wants to put it into a language model.
Researchers led by Philippe Schwaller published a paper this week in Matter describing Synthegy, a framework that uses large language models as reasoning engines for chemical synthesis planning. The key insight is subtle but important: rather than asking AI to generate molecules, the team uses AI to evaluate synthesis routes that traditional software already produces.
Here's how it works: A chemist types in a goal in plain English, something like "form the pyrimidine ring in the early stages." Existing retrosynthesis software—which works by breaking target molecules into simpler pieces—then generates dozens or hundreds of possible synthesis routes.
Synthegy converts each route into text and hands it to an LLM, which scores every route on how well it matches the chemist's instruction. The best ones float to the top, with written explanations of why.
"When making tools for chemists, the user interface matters a lot, and previous tools relied on cumbersome filters and rules," said Andres M. Bran, lead author of the study, in a statement from EPFL.
The system was validated in a double-blind study involving 36 independent chemists who reviewed 368 route pairs. Their selections matched Synthegy's 71.2% of the time, a number that's roughly in line with how often expert chemists agree with each other. Senior researchers (professors and research scientists) agreed with Synthegy more often than PhD students, suggesting the system captures the same strategic intuitions that come with experience.
The researchers tested several AI models, including GPT-4o, Claude, and DeepSeek-r1. AI has been making inroads in drug discovery for years, but most approaches focus on narrowly trained models for specific tasks. Synthegy is designed to be modular—it can plug into any retrosynthesis engine on the backend, and any capable LLM on the reasoning side. Gemini-2.5-pro scored highest in the benchmark, while DeepSeek-r1 seems to be a strong open-source alternative that can run locally.
The framework also handles a second problem: reaction mechanism elucidation. This is the question of why a chemical reaction happens—what electron movements take place at each step. Synthegy breaks reactions into elementary moves and has the LLM assess each candidate step for chemical plausibility. On simple reactions like nucleophilic substitutions, the best models achieved near-perfect accuracy.
The potential use cases are broad. Drug discovery is the obvious one. AI has already shown promise predicting cancer treatment outcomes, but the same approach applies anywhere chemists need to design new materials or optimize industrial reactions. One practical detail: evaluating 60 candidate routes with Synthegy takes roughly 12 minutes and costs about $2–3 in API fees.
The paper acknowledges current limits. LLMs sometimes misread the direction of a reaction in its text representation, leading to wrong feasibility calls. Smaller models perform no better than random guessing. Routes longer than 20 steps are harder to track coherently.