Just know stuff, proteinML edition

Posted on May 20, 2025

Acknowledgement: A big thank you to Johanna Haffner of ETH Zürich, to owl_posting, and to my colleague Jonathan Ziegler, for giving feedback on this!

So I previously wrote Just know stuff. (Or, how to achieve success in a machine learning PhD.) Success in technical subjects requires, in large part, knowing a lot of stuff about that subject.

Two years have gone by since then, and in the mean time the field of bioML has become this whole thing. Isomorphic are betting on structure prediction for small molecule therapeutics. Four of the top 20 big pharma now rely on Cradle’s ML-guided lead optimization. Xaira unironically raised a $1B ‘seed round’. A host of startups are specialising along individual verticals: Evolutionary Scale develop foundation models, Nabla are targeting multi-pass membrane proteins, Axiom are tackling liver toxicity, and so on and so on.

Machine learning is genuinely pretty good at tackling many (though certainly not all) of the problems in biology. And that means it’s time for a whole new just know stuff list, reflecting what I work on today! Specifically, ML-for-proteins. (A JKS list for all of bioML would too large for this margin to contain.)

The goal of this post is two-fold:

  • For those starting out in proteinML: this is a useful starting point to turn ‘unknown unknowns’ into ‘known unknowns’. Like the original, this list is not intended to be comprehensive, it’s more of a sample of interesting starting points.

  • For everyone else: randomly scroll through this list and see what catches your eye! Go down a Wikipedia rabbit hole. Show off to your friends about some cool facts you learnt. (Did you know there’s a kind of jellyfish cancer that might have just… budded off and turned into its own species? Wild.)

I originally wrote this as part of an introduction for those joining Cradle as new junior hires. The expectation is that they already have exceptional software and ML experience, but potentially only limited bio experience. Correspondingly the contents of this list are skewed too: the ML is a bit more specialised whilst the biology is a bit more introductory.

Alright. Let’s go! 🚀

Data

Amateurs talk models, professionals talk data. (We’ll still discuss the shiny Nobel-prize-winning deep learning foundation models later too.) ‘What can we measure’ is pretty much the first question we ask in bioML (or any flavour of ML really), as that is also what we can hope to model. The big proteinML-adjacent categories here are:

  • Proteins:
    • Sequences: the chemical structure of a protein. These look like MALLSEYGLFLAKI.... Know the properties of each of the 20 amino acids that go into these.
      • Large datasets available; cheap to measure. Useful for learning what is evolutionarily conserved, which is usually a hint for what’s important.
    • Structures: the 3d positions of the atoms in a protein.
      • Medium-size datasets available; generally expensive to measure.
      • Hard to measure or not a well-defined concept for many kinds of proteins (e.g. membrane-bound, intrinsically disordered).
      • This determines what properties the protein has. On which note…
    • Properties/function:
      • activity (of an enzyme)
      • affinity (to ligands)
      • aggregation (with itself or other proteins)
      • expression
      • solubility (in water, …)
      • stability (thermostability, biostability, solvent stability)
      • immunogenicity
      • viscosity
      • …and so on.
      • Look up each of the above.
  • DNA (or RNA) sequences. These look like CATTTACTGTATGCTCGTCGG... (or CAUUUACUGUAUGCUCGUCGG...).
  • Omics:
    • This is the general word for measuring ‘most of X’ in a cell or cell culture. Generally a lot of data gathered at one point in time, destroying the cell in the process. Metabolomics, proteomics, genomics, transcriptomics, … . Collectively: ‘multiomics’.
    • RNAseq (‘transcriptomics’): a measure of RNA abundance; corresponds to how much of each RNA was produced (i.e. how much each gene was expressed).
      • Comes in two main flavours, single-cell RNAseq (scRNAseq) and bulk RNAseq.
      • Spatial transcriptomics: 2d picture of amount based on location of protein in a cell or location of cell in a tissue.

There are many large(-ish) datasets of these kinds of properties. For example here are some of the big protein datasets – take a stroll through them and see what’s available:

  • BFD
  • colabfold
  • UniProt, UniRef
    • I’ve linked these as they’re a great starting point. Try downloading either the XML or fasta files, and parsing them to build yourself a large dataset of proteins.
  • Observed antibody space (OAS)
  • SwissProt
  • PDB
  • Alphafold Protein Structure Database (AFDB), ESM Metagenomic Atlas (synthetic data; still very useful)

Deep learning models

BioML really took off with the invention of AlphaFold2, which won a Nobel prize. There are many other users of deep-learning-for-biology, but due to higher-quality and more-abundant data, protein models are particularly popular.

These are typically attention-based architectures, often with a focus (but not exclusively) on structure prediction. Both autoregressive and diffusions are popular framings.

At Cradle we describe a hierarchy of model sophistication: structure predictors → single property predictors → multiple property predictors → generators with property conditioning → multi-model workflows that tie these all together.

  • AlphaFold2 (AF2), AlphaFold3. Sequence-to-structure models.
    • There’s been a bit of a cottage industry of AlphaFold3 reproductions: Boltz-1, Chai-1, Protenix.
    • In particular credit to Boltz-1 for being the first AF3 reproduction with a commerically-available license, prompting others to similarly relicense.
    • Take a look through their codebases, these are multi-person multi-month efforts to build.
  • Evolutionary scale modelling 2 (ESM2), ESM3, ESMFold, ESM-C. These come in different sizes, e.g. ESM2-8M or ESM2-15B.
    • Implement the ESM2 architecture yourself. (This is the exercise I’m suggesting here that I’ve found most personally valuable.)
    • Often used as sequence-to-embedding models.
    • In particular ESM2-650M (yes, an earlier smaller one!) is notorious for hitting a good balance between compute cost and model efficacy, so despite the development of more recent models, then this is frequently still the workhorse.
    • These models do not consume multiple sequence alignments (MSAs, see later). This seems to make them a bit weaker, but substantially faster to evaluate. (Computing MSAs takes a while. There has been recent work on computing MSAs on GPUs, however: mmseqs2-gpu)
  • ProteinMPNN
    • This is the opposite of the above! A structure-to-sequence (‘inverse folding’) model.
  • EvoDiff, Chroma, RFDiffusion, RFDiffusion All-Atom, RFAntibody.
    • Take a look at Figure 3 in the Chroma paper for some of the most fun-looking protein structures you’ll ever see.
    • There’s a pretty standard RFDiffusion → ProteinMPNN → AlphaFold2 workflow for de novo generation: create a backbone, fill in the residues, look at the structure.
  • MSATransformer
  • Picking a few other interesting examples amongst many: AlphaFlow, AlphaBind, and FrameFlow.

Equally important are the things you can do with a foundation model, beyond just running them forward and predicting structure/sequence/etc. I particularly liked the recent BoltzDesign1 paper. I thought Chroma was cool for their strong emphasis on classifier-guided diffusion.

On model design, I particularly recommend this just-released blog post from Pascal Notin. It’s been noted for a while now that ‘bigger is not better’ in proteinML; this post goes further by offering a nicely nuanced take around what kinds of models do well and which do not.

Finally on deep learning models, here are a few not-protein models for variety:

  • Genetic models: those trained on ACGT strings of nucleotides. Notable examples include Enformer and Evo.
    • These are actually dominated by convolutional(ly-inspired) layers, as the input sequence is typically much larger than is seen for protein sequences. (See the ‘numbers to have in your head’ below.)
  • Gene models: different to genetic models, usually trained on scRNAseq data. Examples include Geneformer, scGPT, and…
  • Docking models, e.g. DiffDock is well-known.

Finally on to architectural stuff. Mostly standard transformer/diffusion stuff, but in addition take note of both triangular attention (which has O(n^3) computational complexity in the length of the protein!) and SO(3) invariance.

Computation

Other than the big ML models, that is.

  • Multiple sequence alignment (MSA). Useful as evolution seems to be good at exploring what works and what doesn’t work, so this is valuable information to give to a model. Know that this is computationally expensive and is usually approximated.
    • MAFFT (good for aligning a batch of sequences)
    • MMseqs2 (good for aligning a sequence against a database)
    • Try building some useful Python APIs around these commandline tools.
    • Know both the Needleman–Wunsch and Smith–Waterman algorithms.
    • Sequence clustering
    • Optional: guide trees.
  • Frame-aligned point error (FAPE), predicted local difference distance test (pLDDT).
  • Molecular dynamics. Used to simulate an approximation to actual physics. (Not clear how good those approximations are though.)
    • Rosetta, OpenMM.
    • Famously championed by D. E. Shaw Research and their Anton supercomputers.
    • Neural network forcefields are the current hotness, and they seem to work quite well. These look a lot like neural differential equations.
    • Optional: learn a little of the internals of how MD works. Try implementing a toy version.
      • Sympletic ODE solvers with timesteps on the order of femtoseconds.
  • Riffing on the above, neural differential equations are super good for the many things that are dynamical systems.
    • Try implementing a toy model on some synthetic data. Demo here.
  • Pharmacokinetics/pharmacodynamics models e.g. via nonlinear mixed effects. These dynamical systems are used when we want to understand how our shiny new protein actually interacts with a patient’s physiology. They’re an essential part of getting any drug approved. Frequently requires integrating ODEs and in particular stiff ODEs.
  • Lab automation: PyLabRobot and their community.
    • There’s a line of work on interacting with wet labs through LLMs that I think is amazingly cool.
    • If you’re fortunate enough to have some time to play on a lab robot, then definitely try getting something like this (or an equivalent tool) set up.
  • Other useful libraries: RDKit, pymol, Biotite (better than Biopython!), pyarrow (for all the variable-length data we have, although just passing around a triple of (data, offset, size) numpy arrays is often great for many applications).

‘Reference point’ companies

Okay, there are loads of companies out there, and there’s no way that I can usefully list them all. So this is not a list of my favourite companies. Rather, each entry on this list is selected for being an exemplar for one particular thing. Such a list helps to offer a representative look across the slice of things that companies do in this space.

  • A-Alpha Bio (Yeast mating)
  • Adaptyv Bio (Next-generation CRO)
  • Cradle (ML-guided lead optimization)
  • Dyno Tx (AAVs)
  • Evolutionary Scale (protein foundation models)
  • Opentrons (lab automation hardware)
  • Oxford Nanopore (NGS)
  • Plasmidsaurus (Plasmid screening)
  • Profluent (ML-guided CRISPR)
  • Twist Bioscience (DNA synthesis)
  • Unite Labs (lab automation software)

Traditionally, pharma companies have tended to follow the ‘fully integrated pharma company’ model: they own pretty much every part of the stack in-house. These days we’re starting to see much greater sophistication and specialisation, and correspondingly the need for more specialised companies to tackle each piece.

Academic groups

Here are a few of the heavy hitters I particularly like:

  • Charlotte Deane lab (Oxford). Seemingly single-handedly responsible for actually keeping the world’s antibody data organised.
  • David Baker lab (Washington). Won a Nobel prize. Pretty neat.
  • Tommi Jaakkola lab (MIT). Very good computationally, which is unusual in academic circles (!)
  • Pranam Chatterjee lab (Duke). Does a lot of sequence modelling (in a field largely dominated by structural work), which is really cool.

There are others, of course, but as usual this list is about offering a few reference points, not all of them! Apologies to any other academic not on this list! :)

Places to follow

I recommend reading basically everything that they produce on biology:

Biology

The big (oversimplified) picture: lots of things are interacting inside cells, and we want to intervene in this process. Sometimes these are cells in your body (maybe we’re giving you a drug), and sometimes these are cells in vats (maybe we’re making a drug in these cells).

Most of the time, ‘intervene’ means ‘design something that sticks to something that is already there’. For example:

  • Stick to some collection of substrates, to get them to interact. (When designing a bispecific T-cell engager: to bring together something in your immune system and cancer cells it should target.)
  • Stick to some chemical (including ‘big’ chemicals like proteins) involved in metabolic reactions, to prevent that part of the metabolic pathway from occuring.
  • Stick to an invading pathogen, so that it can’t damage you. (When designing an scFv: by sticking to the part of the pathogen that makes it dangerous. When designing an IgG: by marking the pathogen for another part of your immune system to destroy.)

It’s kind of amazing that we can interact with biology with tools as crude as: sticky things sticking to other things.

Go read this book.

A computer scientist’s guide to cell biology, by Cohen and Cohen. A wonderfully readable 100 pages.

The four big progressions

  • Central dogma: DNA → RNA → Proteins. Only sort-of true: obligatory XKCD.
  • Protein structure: primary → secondary → supersecondary → tertiary → quaternary.
  • Protein function: Sequence determines structure determines function
  • Drug development pipeline: lead identification → lead optimisation → clinical trials.
    • Also basically applies to non-drug development. E.g. enzymes that go into detergents.

Every time I give someone a verbal introduction to synthetic biology, these four progressions are where I start. They give the map of how we tend to think about things. As always in biology, asterisks apply, and these are simplifications.

Proteins

We’re here to do proteinML. Here’s the crash course in proteins.

  • Each of the 20 standard amino acids has different properties. Go learn them.
    • Glycine is the smallest (e.g. useful as a filler or a linker).
    • Cysteines can make disulfide bonds (e.g. useful or problematic by bonding together structures).
    • Tryptophan has the coolest name.
    • Nonstandard amino acids exist too, typically by doing some kind of chemistry on one of the main 20.
  • BLOSUM matrices.
  • Secondary structures: alpha helices, beta sheets, hairpins.
  • Post-translational modifications (PTMs).
  • Some major protein categories: antibodies, peptides, enzymes, transcription factors, membrane proteins. Know in what contexts we care about them.
  • Allosteric effects. Hemoglobin is a great example: as it picks up oxygen, its shape changes to make it even better at picking up oxygen.
  • His-tags
  • ‘Developability’ is an imprecise word about how likely a candidate protein is to succeed in later rounds of optimisation. Go look up some of the in silico approximations people come up with.
  • Now here’s why it matters: here are some important examples of protein function and specific important proteins.
    • Drugs, detergents, pesticides, fluorescents, gene regulation, cell signaling, flagella motors, …
    • Green fluorescent protein (GFP). This uses one of those beta barrels from earlier.
    • Ubiquitin
    • MHC. MHC-peptide complexes.
    • Titin
    • MCF (‘makes caterpillars floppy’)! …okay this one’s not important but it’s just way too much fun not to include.

“What is your favourite protein?" Try finding a fun answer to this question! (Mine is on the previous line.)

Antibodies

Wheeeee these are some of the most important kinds of proteins out there (in the spirit of the earlier discussion: they are sticky proteins that are good at sticking to precisely one thing, so e.g. they make excellent drugs) and they have their own zoo of terminology. So again, crash course:

  • There’s a zoo of different formats.
    • Fun challenge, try making yourself a poster describing the different kinds.
    • Look up the ‘periodic table of antibodies’ if you want to feel intimidated.
    • Some of the more important: Immunoglobin G (IgG), antibody fragment (Fab), single-chain variable fragment (scFV).
  • Learn about llama immunisation: a way to generate candidate antibodies that bind to something.
    • You can also get candidates in other ways, but this is clearly the most amusing option the first time you hear it. Discovered by accident, of course.
    • Combined with the release of the LLaMA language model, this has made llamas an unexpectedly large part of my life these days.
  • VHHs, i.e. a-piece-of-a-llama-antibody. Also called ‘nanobodies’, which is technically a trademark of Sanofi, but has become a generic word.
  • Specificity, i.e. what they bind to. Bivalent, multivalent, Bispecific (bsAb), multispecific, polyreactivity, cross-reactivity. These similar terms have different meanings. (Despite which almost no-one seems to get all of these right. ‘Bisomething, multimumblemumble’. Good enough.)
  • Antibody-drug conjugates (ADCs), radionuclide antibody conjugate (RACs).
  • Immunogenicity. Anti-drug antibodies (ADAs).
    • If your drug is itself an antibody (it frequently is) then you may generate anti-antibody-antibodies.
  • Antibody numbering. Kabat, Chothia, IMGT. There are several others in use. Obligatory XKCD here too.
  • Each antibody piece is formed of two opposing beta sheets. This makes antibody structures very easy to visually identify.
  • Antibodies typically bind with their complementarity-determining regions (CDRs).
    • These are highly random due to VDJ genetic recombination.
    • CDRs are held on to by the framework region of the antibody. (Those are then attached to the rest of the antibody.)
    • You can dramatically adjust binding by mutating the framework region too! Cradle won a competition doing it. Twelve times over, in fact. Technically this has been known for a while but it only started becoming more widespread knowledge in the past year or so.

DNA, RNA, anything genetic

Challenge: which way round does DNA spiral? How many base-pairs are there per twist? Once you know it you can’t unsee it when it’s drawn wrong in art!

  • Ribozymes. Because the simple rule that ‘enzymes are proteins’ is the kind of simple statement that biology laughs at.
  • Codons (three nucleotides in a row), codon tables. Start codons, stop codons, wobble base pairs.
    • Frame shifts.
    • It really turns out that the language of life is base 64.
  • Hairpins, aka ‘when you accidentally get sticky tape stuck to itself’.
  • Promoters, enhancers, repressors. Look up gene regulatory networks and transcription factors.
  • Learn about introns and exons.
    • At this point (now that you’ve seen all of the above), you can start to appreciate that biology works in the same way that spaghetti code can be said to work.
  • CRISPR, or possibly the most metal way of ‘creating an immune system’: copy-paste the invading virus into your own genome. Humans later stole this as a mechanism for genetic editing.
  • Adeno-associated viruses (AAVs).
  • Next-generation sequencing (NGS), e.g. nanopore.
  • mRNA, tRNA, snRNA, snoRNA, …. If you knew them all you’d be a zoologist.
  • Plasmids.
  • Chromatin, histones, DNA accessibility
  • Epistasis (fancy word to mean when mutation A and mutation B combine nonlinearly).
  • How is DNA made? It’s ordered from Twist.
    • I suppose technically the answer is techniques like chemical synthesis and oligo pools.

A lot of protein design ends up bumping into interesting details arriving from this level of the ‘biology tech stack’.

Hosts, metabolism, production

Okay, based on all of the above maybe we’ve designed a super-cool protein. Now we need to make it. We take our DNA from the previous step and we place it into one of:

  • E. coli
  • Yeast
  • Chinese hamster ovaries (CHO)
  • Human cell lines (HEK293)
  • or a few other cell lines, but the above are some of the most common.

And then we put those cells in a vat and let them grow, producing our protein. There are also cell-free (‘TXTL’) methods, although these are still less common. Mammalian cells (e.g. CHO, HEK) are more finickity to work with, but are needed for certain more complicated proteins, like antibodies.

  • DNA knock-ins, knock-outs
  • Overflow metabolism. “Grr stop making acetate and start making my product instead."
  • Strain optimization
    • Often goes hand-in-hand with process optimization. How to feed your cells and keep them happy.
  • Look up the process of transfecting E. coli with a plasmid, and what each section of the DNA in that plasmid does.

Perhaps set up a home lab and genetically engineer yourself to cure your lactose interolance.

Assays

Assays! How do we actually measure biology in a wet lab. I’ve given these a special heading because they are that important. With the advent of bioML, assay design has become more important than ever. (a) We need high-quality data to feed in to the models. (b) The models are now better than us at a lot of the design work, and we’d just be getting in their way if we tried to do it ourselves.

Despite which… I’m not going to talk about them any further now. There’s a whole side to wet-lab biology that I want to be sure I understand better myself to do it justice. I’d love to see (maybe by me if no-one beats me to it) a post giving the computationalist’s guide to many of the more common types of wet lab assays out there. What do they measure, how they are designed, what technology unlocked their development, what are their failure modes. My guess is that jointly optimising our wet lab assays and ML tech stack is an opportunity that is still mostly untapped.

Miscellaneous

Again, remember that the purpose of this post is not to be comprehensive! Rather, just to offer a collection of initial signposts. For the sake of some broader reading, here a few other miscellaneous things of interest and that come up in day-to-day life during protein design.

  • Fluorescence-activated cell sorting (FACS)
  • Polymerase chain reaction (PCR)
    • Error-prone PCR
    • Before thermocyclers existed, people amplified their DNA by plunging glass tubes in pots filled with hot water and swishing them around manually (!)
  • Gene Ontology (GO); GEARS
  • Alanine scanning, deep mutational scanning.
  • N-terminal acetylation; C-terminal amidation.
  • Steric hindrance
  • Directed evolution
    • Somewhat superseded by ML-guided lead optimisation these days, though.

Numbers to have in your head

Rules of thumb help you estimate the scale of a challenge:

  • ~20k kinds of proteins in humans.
  • Protein sizes range from ~10 residues to ~45000 residues (‘PKZILLA-1’) long. The average residue weight is 110 Daltons so you can estimate length from weight by moving the decimal point two places.
  • Antibody binding:
    • 1mM is good enough to try optimizing (it’s about the upper cut-off for assays like SPR)
    • 6nM is good enough to sell.
    • 1nM is great.
    • anything in pM is probably overkill, go optimise some other thing.
  • $50k and 1 month to experimentally determine a single protein structure.
    • (Via Cryo-EM or X-ray crystallography. I’ll be honest that new structure measurements aren’t something I work with very often, so I might be a bit off on these numbers. There are also economies of scale if e.g. you’re measuring the same protein bound to multiple ligands.)
  • 96-well, 192-well, or 384-well plates are a typical throughput at which to measure the property of protein samples.
    • The wells at the edge of the plate often exhibit some extra variance in their measurements.
  • 2 weeks to 6 months to get a round of plate measurements.
  • 10 cents per base pair when synthesising DNA. That works out to around $5k-$50k per plate. Gene synthesis is often the bulk of experimental cost when developing proteins.
    • This goes down by orders of magnitude if using combinatorial methods – at the cost of getting random distributions over your DNA sequences, rather than precisely controlled choices.

Conclusion

Phew!

That was a long list.

And yet, if you’re already adjacent to this field then this list probably still feels woefully incomplete. There are so many other interesting topics at the interface of proteins and ML (virtual cells, miniaturisation/Raygun, all-atom modelling, cell differentiation and pseudotime, modelling reactor scale-up challenges, …).

So hopefully that means I’ve hit a reasonable compromise.

To recap, the goal of this list is to offer a set of reference points for an introduction to proteinML: some ML, some bio, plus some commerical and academic reference points. It’s not a textbook, but it is a curriculum of sorts. If you’re a researcher or PhD student in the target audience – currently-in-ML, aspiring-in-bioML – then I hope this post might be useful to you!

As the above text mentions, I currently work as an ML researcher at Cradle in Zürich. So you know, obligatory disclaimer.
If you’re curious, we’re a Series B scaleup, we use ML to lead-optimize proteins way faster than folks ever could before, and now the rocketship is taking off and so we’re all hanging on for dear life.