Proteins are integral biomolecules responsible for “executing” the instructions contained in DNA. As DNA is a sequence of nucleotides, the proteins created by DNA are also sequences. In humans, an alphabet of 20 amino acids form the characters of protein strings, which vary in length from a few dozen to a few thousand. Deciphering the “language” of proteins is far from an easy task, and even large-scale molecular simulations fail to accurately portray the structure that protein sequences form.
This project aims to develop and deploy an end-to-end hardware-accelerated NLP system to expedite drug discovery using unsupervised learning.