Show HN: An open-source implementation of AlphaFold3
https://github.com/Ligo-Biosciences/AlphaFold3Hi HN - we’re the founders of Ligo Biosciences and are excited to share an open-source implementation of AlphaFold3, the frontier model for protein structure prediction.
Google DeepMind and their new startup Isomorphic Labs, are expanding into drug discovery. They developed AlphaFold3 as their model to accelerate drug discovery and create demand from big pharma. They already signed Novartis and Eli Lilly for $3 billion - Google’s becoming a pharma company! (https://www.isomorphiclabs.com/articles/isomorphic-labs-kick...)
AlphaFold3 is a biomolecular structure prediction model that can do three main things: (1) Predict the structure of proteins; (2) Predict the structure of drug-protein interactions; (3) Predict nucleic acid - protein complex structure.
AlphaFold3 is incredibly important for science because it vastly accelerates the mapping of protein structures. It takes one PhD student their entire PhD to do one structure. With AlphaFold3, you get a prediction in minutes on par with experimental accuracy.
There’s just one problem: when DeepMind published AlphaFold3 in May (https://www.nature.com/articles/s41586-024-07487-w), there was no code. This
brought up questions about reproducibility (https://www.nature.com/articles/d41586-024-01463-0) as well as complaints from the scientific community (https://undark.org/2024/06/06/opinion-alphafold-3-open-sourc...).
AlphaFold3 is a fundamental advance in structure modeling technology that the entire biotech industry deserves to be able to reap the benefits from. Its applications are vast, including:
- CRISPR gene editing technologies, where scientists can see exactly how the DNA interacts with the scissor Cas protein;
- Cancer research - predicting how a potential drug binds to the cancer target. One of the highlights in DeepMind’s paper is the prediction of a clinical KRAS inhibitor in complex with its target.
- Antibody / nanobody to target predictions. AlphaFold3 improves accuracy on this class of molecules 2 fold compared to the next best tool.
Unfortunately, no companies can use it since it is under a non-commercial license!
Today we are releasing the full model trained on single chain proteins (capability 1 above), with the other two capabilities to be trained and released soon. We also include the training code. Weights will be released once training and benchmarking is complete. We wanted this to be truly open source so we used the Apache 2.0 license.
Deepmind published the full structure of the model, along with each components’ pseudocode in their paper. We translated this fully into PyTorch, which required more reverse engineering than we thought!
When building the initial version, we discovered multiple issues in DeepMind’s paper that would interfere with the training - we think the deep learning community might find these especially interesting. (Diffusion folks, we would love feedback on this!) These include:
- MSE loss scaling differs from Karras et al. (2022). The weighting provided in the paper does not downweigh the loss at high noise levels.
- Omission of residual layers in the paper - we add these back and see benefits in gradient flow and convergence. Anyone have any idea why Deepmind may have omitted the residual connections in the DiT blocks?
- The MSA module, in its current form, has dead layers. The last pair weighted averaging and transition layers cannot contribute to the pair representation, hence no grads. We swap the order to the one in the ExtraMsaStack in AlphaFold2. An alternative solution would be to use weight sharing, but whether this is done is ambiguous in the paper.
More about those issues here: https://github.com/Ligo-Biosciences/AlphaFold3
How this came about: we are building Ligo (YC S24), where we are using ideas from AlphaFold3 for enzyme design. We thought open sourcing it was a nice side quest to benefit the community.
For those on Twitter, there was a good thread a few days ago that has more information:
https://twitter.com/ArdaGoreci/status/1830744265007480934.
A few shoutouts:
A huge thanks to OpenFold for pioneering the previous open source implementation of AlphaFold
We did a lot of our early prototyping with proteinFlow developed by Lisa at AdaptyvBio we also look forward to partnering with them to bring you the next versions!
We are also partnering with Basecamp Research to supply this model with the best sequence data known to science.
Matthew Clark (https://batisio.co.uk) for his amazing animations!
We’re around to answer questions and look forward to hearing from you!
- Who would've thought only releasing pseudo-code isn't good enough...glad to see the scientific immune system fighting back against closed-source science. Your move Google.
-- snolbert Reply - How dare they make money with something that is not advertising!
-- nolist_policy Reply - There's nothing wrong with trade secrets, but that's business not science.
-- throwaway48476 Reply - I mean it shouldn't be enough to publish in nature. The whole point of science is that it can be validated. It's totally fine that they're hosting their models for free on closed servers with limits, even though it's not exactly the most ergonomic.
-- lofatdairy Reply - It was already validated by winning CASP and the paper by Paul Adams (https://www.nature.com/articles/s41592-023-02087-4) which, although it reads like criticism is actually high praise. Everything the model can do, will be (or already has) replicated by the open community.
Also, for work of the highest art (of which AF3 is an example), publication in nature really is the fundamental unit of scientific currency because it ensures all their competitors will get hyped up and work extra-hard to disprove it.
-- dekhn Reply - The paper by Paul Adams used an earlier version of AlphaFold that was publicly available, not AlphaFold 3 which is not.
-- natechols Reply - My statement is correct; both AF papers were published in nature, and both won casp. AF3 is superior to AF2 which means if adams wrote another paper, it would be on increasingly less interesting fine details.
-- dekhn Reply - This seems really neat!
DeepMind and AlphaFold are clearly moving in a closed-source direction, since they created Isomorphic Labs as a division of Alphabet essentially focused on doing this stuff closed source. In theory it seems nice for academic tools to have an open source version, although I'm not familiar enough with this field to point to a specific benefit of it.
So what's your plan for the company itself, do you intend to continue working on this open source project as part of your business model, or was it more of a one-off? Your website seems very nonspecific about what exactly you intend to be selling.
-- lacker Reply - Our long term goal is to design enzymes for chemical manufacturing. We decided to build AlphaFold3 because we had seen how useful AlphaFold2 had been for the protein design field. No one else was building it fast enough for us, so we decided we should do it ourselves. We are committed to training and open-sourcing the full version with ligand and nucleic acid prediction capabilities as well since it is so useful for the biotech industry.
-- EdHarris Reply - If I'm understanding correctly, the model code itself is only a tiny proportion of the challenge. The training compute and training data are far bigger parts.
Google has access to training compute on a scale perhaps nobody else has.
-- londons_explore Reply - Is that really the case though? Available compute sounds unlikely to be the limiting factor here, compared to data which is way scarcer than what's being used to train LLMs, and I suspect Google used mostly publicly available data for training unless they signed deals beforehand with biotechnology companies which have access to more data. That's possible of course, but that doesn't feel very google-y.
-- littlestymaar Reply - Thanks for releasing this, I've been looking forward to a truly open version I can use in a commercial setting. What a way to launch the company!
-- boldlybold Reply - Thanks!
-- EdHarris Reply - Hi, how are predictions verified? Does one still do experimental techniques (X-ray crystallography, cryogenic-em etc.) one you have the prediction? Or are predictions so close to reality you can progress without experiment?
-- dwayne_dibley Reply - The predictions can be verified by comparing the predicted structure to the experimentally solved structure, either crystal or cryoEM. The model is still training and improving, we will release the benchmarking results after it's complete.
-- EdHarris Reply - Have you considered publishing your own paper about your implementation? It would make it easier to cite in the literature later on. Would major journals accept such a paper? I would assume they would if they really had questions about reproducibility.
-- fngjdflmdflg Reply - OpenFold, which was AlphaFold2's open-source implementation was published in Nature Methods. We will prepare a similar publication once the model is more mature and when we have a nice set of experiments showing the model's interesting properties.
-- EdHarris Reply - You probably want to change the name of this implementation as it's not truly AlphaFold3. I wouldn't be surprised if you got a C&D from DM for using the name.
-- dekhn Reply - Yes this is a good point. We are actively speaking with our counsel to check this. Thanks for flagging, though.
-- EdHarris Reply - I did a very brief stint on computational proteomics. That stuff is absolutely next level.
-- benreesman Reply - Amazing! What kind of things did you work on?
-- EdHarris Reply - My job was mostly mundane machine learning: classification over very large categorical sets.
I never had anything more than a dim intuition of the serious chemistry going on before the bytes got to me.
-- benreesman Reply - Where were you working? That sounds super interesting
-- MylesHollowed Reply - I was a contractor for like a month so I’m not at liberty to talk about the details.
There are a number of companies doing innovative things around quantifying proteins and their concentrations in various samples.
I had the privilege to rub elbows with folks working on such cool stuff.
-- benreesman Reply - I’m a big fan of what you folks are doing by the way.
Haskell (and Nix) people are fond of talking about “constraints as power”.
https://github.com/Ligo-Biosciences/AlphaFold3/blob/ebdf3b12...
-- benreesman Reply - Are you familiar with ColabFold?
https://github.com/sokrypton/ColabFold
-- inciampati Reply - What's your next step? Why did you decide to focus on enzyme design?
-- ck_one Reply - We think enzymes are super cool! You can build molecular assembly lines at the atomic scale with them. Many pharmaceuticals are already manufactured with enzymes such as the diabetes drug Januvia. Engineering them is a big bottleneck though - takes years and millions of dollars. We want to speed this up with AI-powered design. Next step is ligand-protein prediction capability of AlphaFold3, which is also super useful for modelling enzyme-substrate interactions.
-- EdHarris Reply - Possibly because it dovetails with pharma mfg and [potentially] food mfg. Could see a case made for enzymatically brewed 'meat inks' [very sorry for this term ;p] for 3d printing the next gen of lab meats.
-- ricopags Reply