Leverage Natural Language Processing for Drug Discovery

Accelerate the learning of protein language representations by designing and deploying an NLP system to expedite drug discovery.

Proteins are integral biomolecules responsible for “executing” the instructions contained in DNA. As DNA is a sequence of nucleotides, the proteins created by DNA are also sequences. In humans, an alphabet of 20 amino acids form the characters of protein strings, which vary in length from a few dozen to a few thousand. Deciphering the “language” of proteins is far from an easy task, and even large-scale molecular simulations fail to accurately portray the structure that protein sequences form.

This project aims to develop and deploy an end-to-end hardware-accelerated NLP system to expedite drug discovery using unsupervised learning.

Protein language models have enabled breakthrough approaches to protein structure prediction, function annotation, and drug discovery. A primary limitation to the widespread adoption of these powerful models is the high computational cost associated with the training and inference of these models, especially at longer sequence lengths. We present the architecture, microarchitecture, and hardware implementation of a protein design and discovery accelerator, ProSE (Protein Systolic Engine). ProSE has a collection of custom heterogeneous systolic arrays and special functions that process transfer learning model inferences efficiently. The architecture marries SIMD-style computations with systolic array architectures, optimizing coarse-grained operation sequences across model layers to achieve efficiency without sacrificing generality. ProSE performs Protein BERT inference at up to 6.9× speedup and 48× power efficiency (performance/Watt) compared to one NVIDIA A100 GPU. ProSE achieves up to 5.5 × (12.7×) speedup and 173× (249×) power efficiency compared to TPUv3 (TPUv2).

Publications

*Eyes Robson, *Ceyu Xu, and Lisa Wu Wills, "ProSE: The Architecture and Design of a Protein Discovery Engine". ACM Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2022. *First co-authors.