acp_data = ACPDataset(DATA_STORE)
amp_data = AMPDataset(DATA_STORE)
dna_data = DNABindDataset(DATA_STORE)
Details of the pretrained model:
Protein Sequence Embeddings (ProSE)
- Multi-task and masked language model-based protein sequence embedding models.
- https://github.com/tbepler/prose
- Procedure for generating fasta files using the BioPython library is detailed here in the data module.
- fasta files are created by the
generate_fasta_files()
method in the dataset classes for example -acp_data.generate_fasta_files()
In order to cerate embeddings in bulk from fasta files ..
- the ProSE Github repo needs to be cloned
- a conda environment for it needs to be created as detailed in the repo (link above)
- then in that conda environment at the root of the cloned repo, the following commands need to be run
Sample python commands to generate embeddings from pretrained model in the ProSE codebase are as follows:
- The following command will generate
average
pooled features from the pretrained ProSE modelfor the sequences in the fasta file provided as input. - Remove
-d 0
from the command to run on CPU.python embed_sequences.py -d 0 --pool avg -o ~/.peptide/datasets/acp/lstm/acp_avgpool_test.h5 ~/.peptide/datasets/acp/fasta/acp_test.fasta # loading the pre-trained ProSE MT model # writing: /home/vinod/.peptide/datasets/acp/lstm/acp_avgpool_test.h5 # embedding with pool=avg
- Sample command for
max
poolingpython embed_sequences.py -d 0 --pool max -o ~/.peptide/datasets/acp/lstm/acp_maxpool_train.h5 ~/.peptide/datasets/acp/fasta/acp_train.fasta # loading the pre-trained ProSE MT model # writing: /home/vinod/.peptide/datasets/acp/lstm/acp_maxpool_train.h5 # embedding with pool=max
- For AMP - some type of pooling needs to be done on the non-truncated sequences as
train
andtest
have different max seq lengthspython embed_sequences.py -d 0 --pool avg -o ~/.peptide/datasets/amp/lstm/amp_avgpool_train.h5 ~/.peptide/datasets/amp/fasta/amp_train.fasta # loading the pre-trained ProSE MT model # writing: /home/vinod/.peptide/datasets/amp/lstm/amp_avgpool_train.h5 # embedding with pool=avg
- Truncated data set
max
pool commandpython embed_sequences.py -d 0 --pool max -o ~/.peptide/datasets/amp/lstm/amp_maxpool_test_seqlen_150.h5 ~/.peptide/datasets/amp/fasta/amp_test_seqlen_150.fasta # loading the pre-trained ProSE MT model # writing: /home/vinod/.peptide/datasets/amp/lstm/amp_maxpool_test_seqlen_150.h5 # embedding with pool=max
- Same as AMP - some pooling needed for the full non-truncated sequences
python embed_sequences.py -d 0 --pool avg -o ~/.peptide/datasets/dna/lstm/dna_avgpool_test.h5 ~/.peptide/datasets/dna/fasta/dna_test.fasta # loading the pre-trained ProSE MT model # writing: /home/vinod/.peptide/datasets/lstm/avg/dnabind_avgpool_test.h5 # embedding with pool=avg
- Truncated sequence example with
max
poolingpython embed_sequences.py -d 0 --pool max -o ~/.peptide/datasets/dna/lstm/dna_maxpool_train_seqlen_300.h5 ~/.peptide/datasets/dna/fasta/dna_train_seqlen_300.fasta # loading the pre-trained ProSE MT model # writing: /home/vinod/.peptide/datasets/dna/lstm/dna_maxpool_train_seqlen_300.h5 # embedding with pool=max
The procedure for reading the generated embeddings from the H5 file is detailed here in the data module.
X_train, y_train, X_test, y_test = acp_data.get_lstm_emb('acp_avgpool_train.h5', 'acp_avgpool_test.h5')
X_train.shape, y_train.shape, X_test.shape, y_test.shape
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)
visualize_2pcs(X_train_pca, y_train)
visualize_3pcs(X_train_pca, y_train)
Evaluation on full data
train_predict(X_train, y_train, X_test, y_test)
Evaluation on reduced data
X_test_pca = pca.transform(X_test)
train_predict(X_train_pca, y_train, X_test_pca, y_test)
X_train, y_train, X_test, y_test = acp_data.get_lstm_emb('acp_avgpool_train.h5', 'acp_avgpool_test.h5')
acp_avgpool_learner = Learner(X_train, y_train, X_test, y_test)
# max pool Learner
X_train, y_train, X_test, y_test = acp_data.get_lstm_emb('acp_maxpool_train.h5', 'acp_maxpool_test.h5')
acp_maxpool_learner = Learner(X_train, y_train, X_test, y_test)
acp_avgpool_learner.pipeline.steps
_, _ = acp_avgpool_learner.train()
Run grid search on max pooled embedding to compare results
_, _ = acp_maxpool_learner.train()
acp_maxpool_learner.run_label_spreading()
acp_avgpool_learner.run_label_spreading()
In case of label spreading - avg pooling performs better
acp_avgpool_learner.predict()
acp_maxpool_learner.predict()
XGB shows improvement when using max pooled embeddings over avg pooled
Saving results.
acp_avgpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/acp_lstm_avgpool_learner.csv')
acp_maxpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/acp_lstm_maxpool_learner.csv')
X_pca = acp_avgpool_learner.pick_k()
acp_avgpool_learner.analyze_clusters(X_pca, k=6)
X_train, y_train, X_test, y_test = amp_data.get_lstm_emb('amp_avgpool_train.h5', 'amp_avgpool_test.h5')
X_train.shape, y_train.shape, X_test.shape, y_test.shape
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)
visualize_2pcs(X_train_pca, y_train)
visualize_3pcs(X_train_pca, y_train)
Evaluation on full data
train_predict(X_train, y_train, X_test, y_test)
Evaluation on reduced data
X_test_pca = pca.transform(X_test)
train_predict(X_train_pca, y_train, X_test_pca, y_test)
X_train, y_train, X_test, y_test = amp_data.get_lstm_emb('amp_maxpool_train.h5', 'amp_maxpool_test.h5')
amp_maxpool_learner = Learner(X_train, y_train, X_test, y_test)
# avg
X_train, y_train, X_test, y_test = amp_data.get_lstm_emb('amp_avgpool_train.h5', 'amp_avgpool_test.h5')
amp_avgpool_learner = Learner(X_train, y_train, X_test, y_test)
amp_avgpool_learner.pipeline.steps
_, _ = amp_avgpool_learner.train()
Run grid search on max pooled embedding to compare results
_, _ = amp_maxpool_learner.train()
amp_maxpool_learner.run_label_spreading()
amp_avgpool_learner.run_label_spreading()
No winner
amp_avgpool_learner.predict()
amp_maxpool_learner.predict()
Again XBG shows improvement with max pooled embeddings
Save results.
amp_avgpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/amp_lstm_avgpool_learner.csv')
amp_maxpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/amp_lstm_maxpool_learner.csv')
X_pca = amp_maxpool_learner.pick_k()
amp_maxpool_learner.analyze_clusters(X_pca, k=4)
X_train, y_train, X_test, y_test = dna_data.get_lstm_emb('dna_avgpool_train.h5', 'dna_avgpool_test.h5')
X_train.shape, y_train.shape, X_test.shape, y_test.shape
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)
visualize_2pcs(X_train_pca, y_train)
visualize_3pcs(X_train_pca, y_train)
Evaluation on full data
train_predict(X_train, y_train, X_test, y_test)
Evaluation on reduced data
X_test_pca = pca.transform(X_test)
train_predict(X_train_pca, y_train, X_test_pca, y_test)
X_train, y_train, X_test, y_test = dna_data.get_lstm_emb('dna_avgpool_train.h5', 'dna_avgpool_test.h5')
dna_maxpool_learner = Learner(X_train, y_train, X_test, y_test)
# avg
X_train, y_train, X_test, y_test = dna_data.get_lstm_emb('dna_avgpool_train.h5', 'dna_avgpool_test.h5')
dna_avgpool_learner = Learner(X_train, y_train, X_test, y_test)
dna_avgpool_learner.pipeline.steps
_, _ = dna_avgpool_learner.train()
Run grid search on max pooled embedding to compare results
dna_maxpool_learner.pipeline.steps
_, _ = dna_maxpool_learner.train()
dna_maxpool_learner.run_label_spreading()
dna_avgpool_learner.run_label_spreading()
dna_avgpool_learner.predict()
dna_maxpool_learner.predict()
Again XBG shows improvement with max pooled embeddings.
Save results.
dna_avgpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/dna_lstm_avgpool_learner.csv')
dna_maxpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/dna_lstm_maxpool_learner.csv')
X_pca = dna_maxpool_learner.pick_k()
dna_maxpool_learner.analyze_clusters(X_pca, k=6)