The De-Hyped Journey of AlphaFold — Simplified

Amgad Muhammad
The Startup
Published in
4 min readDec 3, 2020

--

source: https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery

Deepmind made the news again with a new breakthrough in protein folding using AI. Here is a de-hyped description of the journey and why it is a big deal.

What is the problem?

Proteins are made of amino acids and they are responsible for almost every function in our body, digestion, muscles, sensing temperature … etc.

In biology shape and function go hand in hand, this means that protein shape determines its function.

  • Y-shaped proteins are utilized by the immune system as antibodies.
  • Collagen proteins, which are responsible for skin elasticity, are shaped like cords.

Unfortunately knowing the sequence of amino acids that make a protein doesn’t necessarily mean that the protein will fold into the right shape so it can carry on its function, and this is known as protein folding problem. Also, misfolded proteins are the reasons for diseases or death of organs.

According to Levinthal’s paradox, it would take longer than the age of the known universe to randomly enumerate all possible configurations of a typical protein before reaching the true 3D structure. Think of an origami that can be folded in 10^300 ways.

Why is this important?

AlphaFold is trying to solve the protein folding problem by using machine learning. Given a sequence of amino acids can ML predict the 3D-shape of the protein hence predicting its function?

If we can solve the protein folding problem, then given a string of amino acids we can:

1- Reverse engineer the misfolded proteins to zero in the genes that are causing the misfolding.

2- Engineer Y-shaped proteins for our immune systems.

3- Build Collagen proteins to fight old age.

4- Synthesize new proteins to fix misfolded proteins and cure diseases.

The list of possibilities can go on….

AlphaFold 1 Approach and Results:

As there are many ways a protein can fold, the problem had to be constrained so it can be solved. One way of constraining the problem is amino acids contact prediction.

Imagine that you have a ribbon (amino acid sequence) and you folded this ribbon on itself multiple time, then you used a pen on one spot of the folded ribbon to stain it. If you unfold the ribbon, all the places which were in contact with the pen will have the same stain. That’s how amino acids in contact evolve together and share similar characteristics and if you can find amino acids which are in contact you limit the way the sequence can fold into the final protein shape.

That’s how it was typically done. What Deepmind did differently (very simplified version) is that:

  1. They used a different way to find amino acids in-contact, they used visual features distances instead (using the good old convolution neural network) which proved to be much more accurate than contact prediction (if you accurately predict the distance between amino acids, smaller distance means similar amino acids, you can solve the contact prediction problem easily).
  2. The distances are then passed to gradient descend to start folding.
source: https://www.nature.com/articles/s41586-019-1923-7.epdf?author_access_token=Z_KaZKDqtKzbE7Wd5HtwI9RgN0jAjWel9jnR3ZoTv0MCcgAwHMgRx9mvLjNQdB2TlQQaa7l420UCtGo8vYQ39gg8lFWR9mAZtvsN_1PrccXfIbc6e-tGSgazNL_XdtQzn1PHfy21qdcxV7Pw-k3htw%3D%3D

AlphaFold was benchmarked on CASP, a biennial global competition which became the gold standard for assessing predictive techniques in protein folding using the TM score.

The TM score — ranging between 0 and 1 — measures the degree of match of the overall (backbone) shape of a proposed structure to a native structure.

AlphaFold created high-accuracy structures (with template modelling (TM) scores 0.66 of 0.7 or higher) for 24 out of 43. whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43

AlphaFold 2 Approach and Results:

The paper is not published yet, but based on some speculations the changes are:

  1. They replaced the CNN with a Transformer, a new deep learning architecture that was introduced in 2017 — No surprise here, as we see in the filed more and more researcher are using Transformers to replace CNN and LSTM.
  2. They used end-to-end learning to replace some of the auxiliary losses, instead of extracting distances and then using gradient descend for folding in two disconnected steps.

According to the news, they scored 90/100 in prediction accuracy which could mean that the new method had TM scores of 0.6 or higher for 38/43 tested structures.

Once the paper is published I’ll update the approach and the results for AlphaFold 2, but I hope I helped you understand what the problem is and why it is such a big deal.

--

--

Amgad Muhammad
The Startup

INTJ with passion for AI. Author of award-winning publications in the field of deep learning and computer vision.