This library demonstrates the performance of a series of classifiers on the task of predicting whether a given amino acid sequence is one or more of the 3 target peptides.
Specifically, this library compares the performance of classic ML classifiers trained on vastly different feature representations (of the amino acid sequences) - ranging from One-Hot embeddings to pre-trained embeddings from large protein language models.
Classification Tasks: Given a sequence of amino acids, classify whether the resulting peptide is one of the following.
- Anticancer peptide (ACP)
- DNA-binding protein
- Antimicrobial peptide (AMP)
peptide
- Git clone the repo.
- Create a new conda environment using the environment.yml file.
cd peptide
conda env create -n peptide -f environment.yml
- To try things out, install this library in editable mode.
pip install -e .
ProSE
To create LSTM embeddings ..
- Clone the ProSE repo
- Then complete setup instructions for ProSE detailed on their repo, also summarized below.
- Download pre-trained embedding models.
- Create conda environment and install dependencies.
ESM
To create Transformer embeddings ..
- Clone the ESM repo
- Install ESM and its dependencies detailed on their repo, one option summarized below.
- Create new conda environment with python 3.9
- In that conda environment, pip install
torch
andfair-esm==0.5.0
Read through any (or all) of the following quick start guides to get a general idea. Then try running any of them as detailed below:
- Option 1: run any of the following jupyter noteboooks in the
nbs
folder.03_onehot.ipynb
04_lstm.ipynb
05_transformer.ipynb
- Option 2: just open a jupyter notebook and copy, paste & run cell-by-cell from any of the following quick start guides.
Settings file and DATASTORE
- The steps demonstrated in these notebooks use default locations for datastore, etc as detailed in Basics.
- The first cell in every notebook imports these settings for convenience.
- So if you intend to use default settings, make sure to place the datasets in the DATASTORE as detailed next.
Note: The settings file and default folder structure will be created by either executing from peptide.basics import *
in a cell or executing the first cell in any of the above notebooks. This will create a DATASTORE
variable pointing to the path ~/.peptide/datasets
.
Copy Datasets Into DATASTORE
- Copy dataset directories into the location pointed to by the
DATASTORE
global variable.- for example
~/.peptide/datasets
- for example
- Resulting folder structure must be
~/.peptide/datasets/acp/train_data.csv
~/.peptide/datasets/amp/all_data.csv
~/.peptide/datasets/dna_binding/train.csv
This library is created using the awesome nbdev v1, soon to be upgraded to nbdev v2.
Pretrained embeddings used in this library are from the following papers
- LSTM - Protein Sequence Embeddings (ProSE) - Multi-task and masked language model-based protein sequence embedding models - GitHub
Bepler, T., Berger, B. Learning the protein language:evolution, structure, and function. Cell Systems 12, 6 (2021). https://doi.org/10.1016/j.cels.2021.05.017> Bepler, T., Berger, B. Learning protein sequence embeddings using information from structure. International Conference on Learning Representations (2019). https://openreview.net/pdf?id=SygLehCqtm
- Transformer - Evolutionary Scale Modeling (ESM) - GitHub
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus bioRxiv 622803; doi:https://doi.org/10.1101/622803