Load Data

acp_data = ACPDataset(DATA_STORE)
amp_data = AMPDataset(DATA_STORE)
dna_data = DNABindDataset(DATA_STORE)

Pretrained Model - ProSE

Details of the pretrained model:

Protein Sequence Embeddings (ProSE)

Multi-task and masked language model-based protein sequence embedding models.
https://github.com/tbepler/prose

fasta + BioPython

Procedure for generating fasta files using the BioPython library is detailed here in the data module.
fasta files are created by the generate_fasta_files() method in the dataset classes for example - acp_data.generate_fasta_files()

Create LSTM Embeddings From fasta

In order to cerate embeddings in bulk from fasta files ..

the ProSE Github repo needs to be cloned
a conda environment for it needs to be created as detailed in the repo (link above)
then in that conda environment at the root of the cloned repo, the following commands need to be run

Sample python commands to generate embeddings from pretrained model in the ProSE codebase are as follows:

ACP

The following command will generate average pooled features from the pretrained ProSE modelfor the sequences in the fasta file provided as input.

Remove -d 0 from the command to run on CPU.

python embed_sequences.py -d 0 --pool avg -o ~/.peptide/datasets/acp/lstm/acp_avgpool_test.h5 ~/.peptide/datasets/acp/fasta/acp_test.fasta
# loading the pre-trained ProSE MT model
# writing: /home/vinod/.peptide/datasets/acp/lstm/acp_avgpool_test.h5
# embedding with pool=avg

Sample command for max pooling

python embed_sequences.py -d 0 --pool max -o ~/.peptide/datasets/acp/lstm/acp_maxpool_train.h5 ~/.peptide/datasets/acp/fasta/acp_train.fasta
# loading the pre-trained ProSE MT model
# writing: /home/vinod/.peptide/datasets/acp/lstm/acp_maxpool_train.h5
# embedding with pool=max

AMP

For AMP - some type of pooling needs to be done on the non-truncated sequences as train and test have different max seq lengths

python embed_sequences.py -d 0 --pool avg -o ~/.peptide/datasets/amp/lstm/amp_avgpool_train.h5 ~/.peptide/datasets/amp/fasta/amp_train.fasta
# loading the pre-trained ProSE MT model
# writing: /home/vinod/.peptide/datasets/amp/lstm/amp_avgpool_train.h5
# embedding with pool=avg

Truncated data set max pool command

python embed_sequences.py -d 0 --pool max -o ~/.peptide/datasets/amp/lstm/amp_maxpool_test_seqlen_150.h5 ~/.peptide/datasets/amp/fasta/amp_test_seqlen_150.fasta
# loading the pre-trained ProSE MT model
# writing: /home/vinod/.peptide/datasets/amp/lstm/amp_maxpool_test_seqlen_150.h5
# embedding with pool=max

DNA Binding

Same as AMP - some pooling needed for the full non-truncated sequences

python embed_sequences.py -d 0 --pool avg -o ~/.peptide/datasets/dna/lstm/dna_avgpool_test.h5 ~/.peptide/datasets/dna/fasta/dna_test.fasta
# loading the pre-trained ProSE MT model
# writing: /home/vinod/.peptide/datasets/lstm/avg/dnabind_avgpool_test.h5
# embedding with pool=avg

Truncated sequence example with max pooling

python embed_sequences.py -d 0 --pool max -o ~/.peptide/datasets/dna/lstm/dna_maxpool_train_seqlen_300.h5 ~/.peptide/datasets/dna/fasta/dna_train_seqlen_300.fasta
# loading the pre-trained ProSE MT model
# writing: /home/vinod/.peptide/datasets/dna/lstm/dna_maxpool_train_seqlen_300.h5
# embedding with pool=max

Get Embeddings - Read From H5

The procedure for reading the generated embeddings from the H5 file is detailed here in the data module.

Anti Cancer Peptide Dataset (ACP)

ACP - Dimensionality Reduction (PCA) vs Full Data

X_train, y_train, X_test, y_test = acp_data.get_lstm_emb('acp_avgpool_train.h5', 'acp_avgpool_test.h5')
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1378, 6165), (1378,), (344, 6165), (344,))

pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
    f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)

X_train_pca.shape: (1378, 50)
Explained variance ratio of the first 10 principal components:
[0.28138205 0.13824382 0.0479676  0.0339361  0.03260109 0.01959571
 0.01888285 0.0157898  0.01357551 0.01306792]

visualize_2pcs(X_train_pca, y_train)

visualize_3pcs(X_train_pca, y_train)

Evaluation on full data

train_predict(X_train, y_train, X_test, y_test)

Evaluation on reduced data

X_test_pca = pca.transform(X_test)
train_predict(X_train_pca, y_train, X_test_pca, y_test)

ACP - Get Data and Learners (Avg vs Max Pooling)

X_train, y_train, X_test, y_test = acp_data.get_lstm_emb('acp_avgpool_train.h5', 'acp_avgpool_test.h5')
acp_avgpool_learner = Learner(X_train, y_train, X_test, y_test) 

# max pool Learner
X_train, y_train, X_test, y_test = acp_data.get_lstm_emb('acp_maxpool_train.h5', 'acp_maxpool_test.h5')
acp_maxpool_learner = Learner(X_train, y_train, X_test, y_test)

ACP - Grid Search (Supervised Learning)

acp_avgpool_learner.pipeline.steps

[('pca', PCA(n_components=50)), ('classifier', 'passthrough')]

_, _ = acp_avgpool_learner.train()

Run grid search on max pooled embedding to compare results

_, _ = acp_maxpool_learner.train()

ACP - Label Spreading (Semi Supervised Learning)

acp_maxpool_learner.run_label_spreading()

              precision    recall  f1-score   support

           0       0.71      0.71      0.71       172
           1       0.71      0.70      0.71       172

    accuracy                           0.71       344
   macro avg       0.71      0.71      0.71       344
weighted avg       0.71      0.71      0.71       344

acp_avgpool_learner.run_label_spreading()

              precision    recall  f1-score   support

           0       0.75      0.73      0.74       172
           1       0.74      0.76      0.75       172

    accuracy                           0.74       344
   macro avg       0.74      0.74      0.74       344
weighted avg       0.74      0.74      0.74       344

In case of label spreading - avg pooling performs better

ACP - Prediction Results

Comparing prediction performance between avg and max pooled embedding

acp_avgpool_learner.predict()

acp_maxpool_learner.predict()

XGB shows improvement when using max pooled embeddings over avg pooled

Saving results.

acp_avgpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/acp_lstm_avgpool_learner.csv')
acp_maxpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/acp_lstm_maxpool_learner.csv')

ACP - KMeans Clustering (Unsupervised Learning)

X_pca = acp_avgpool_learner.pick_k()

n_clusters: 2 -- avg silhouette score: 0.3598043620586395
n_clusters: 3 -- avg silhouette score: 0.2560119032859802
n_clusters: 4 -- avg silhouette score: 0.19085708260536194
n_clusters: 5 -- avg silhouette score: 0.18616080284118652
n_clusters: 6 -- avg silhouette score: 0.18351595103740692
n_clusters: 7 -- avg silhouette score: 0.1913120150566101
n_clusters: 8 -- avg silhouette score: 0.18931667506694794
n_clusters: 9 -- avg silhouette score: 0.19355764985084534
n_clusters: 10 -- avg silhouette score: 0.17628590762615204

acp_avgpool_learner.analyze_clusters(X_pca, k=6)

Cluster counts: Counter({0: 488, 2: 391, 1: 298, 4: 257, 5: 208, 3: 80})

Antimicrobial Peptide Dataset (AMP)

AMP - PCA vs Full Data

X_train, y_train, X_test, y_test = amp_data.get_lstm_emb('amp_avgpool_train.h5', 'amp_avgpool_test.h5')
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((3234, 6165), (3234,), (808, 6165), (808,))

pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
    f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)

X_train_pca.shape: (3234, 50)
Explained variance ratio of the first 10 principal components:
[0.16400845 0.15573642 0.04399751 0.03372195 0.02795825 0.02183319
 0.01885984 0.01691326 0.01427688 0.01275587]

visualize_2pcs(X_train_pca, y_train)

visualize_3pcs(X_train_pca, y_train)

Evaluation on full data

train_predict(X_train, y_train, X_test, y_test)

Evaluation on reduced data

X_test_pca = pca.transform(X_test)
train_predict(X_train_pca, y_train, X_test_pca, y_test)

AMP - Get Data and Learners (Avg vs Max Pooling)

X_train, y_train, X_test, y_test = amp_data.get_lstm_emb('amp_maxpool_train.h5', 'amp_maxpool_test.h5')
amp_maxpool_learner = Learner(X_train, y_train, X_test, y_test)

# avg
X_train, y_train, X_test, y_test = amp_data.get_lstm_emb('amp_avgpool_train.h5', 'amp_avgpool_test.h5')
amp_avgpool_learner = Learner(X_train, y_train, X_test, y_test)

AMP - Grid Search (Supervised Learning)

amp_avgpool_learner.pipeline.steps

[('pca', PCA(n_components=50)), ('classifier', 'passthrough')]

_, _ = amp_avgpool_learner.train()

Run grid search on max pooled embedding to compare results

_, _ = amp_maxpool_learner.train()

AMP - Label Spreading (Semi Supervised Learning)

amp_maxpool_learner.run_label_spreading()

              precision    recall  f1-score   support

           0       0.89      0.96      0.93       411
           1       0.96      0.88      0.92       397

    accuracy                           0.92       808
   macro avg       0.93      0.92      0.92       808
weighted avg       0.92      0.92      0.92       808

amp_avgpool_learner.run_label_spreading()

              precision    recall  f1-score   support

           0       0.90      0.95      0.93       411
           1       0.95      0.89      0.92       397

    accuracy                           0.92       808
   macro avg       0.92      0.92      0.92       808
weighted avg       0.92      0.92      0.92       808

No winner

AMP - Prediction Results

Comparing prediction performance between avg and max pooled embedding

amp_avgpool_learner.predict()

amp_maxpool_learner.predict()

Again XBG shows improvement with max pooled embeddings

Save results.

amp_avgpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/amp_lstm_avgpool_learner.csv')
amp_maxpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/amp_lstm_maxpool_learner.csv')

AMP - KMeans Clustering (Unsupervised Learning)

X_pca = amp_maxpool_learner.pick_k()

n_clusters: 2 -- avg silhouette score: 0.4392211437225342
n_clusters: 3 -- avg silhouette score: 0.30619361996650696
n_clusters: 4 -- avg silhouette score: 0.3071313202381134
n_clusters: 5 -- avg silhouette score: 0.20854830741882324
n_clusters: 6 -- avg silhouette score: 0.19198007881641388
n_clusters: 7 -- avg silhouette score: 0.188777893781662
n_clusters: 8 -- avg silhouette score: 0.18562379479408264
n_clusters: 9 -- avg silhouette score: 0.18900585174560547
n_clusters: 10 -- avg silhouette score: 0.15258486568927765

amp_maxpool_learner.analyze_clusters(X_pca, k=4)

Cluster counts: Counter({1: 2363, 0: 670, 2: 605, 3: 404})

DNA Binding Dataset

DNA - PCA vs Full

X_train, y_train, X_test, y_test = dna_data.get_lstm_emb('dna_avgpool_train.h5', 'dna_avgpool_test.h5')
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((14189, 6165), (14189,), (2272, 6165), (2272,))

pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
    f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)

X_train_pca.shape: (14189, 50)
Explained variance ratio of the first 10 principal components:
[0.14483151 0.08522432 0.04459034 0.0348114  0.03089704 0.0273882
 0.02518505 0.02346268 0.02010697 0.01673359]

visualize_2pcs(X_train_pca, y_train)

visualize_3pcs(X_train_pca, y_train)

Evaluation on full data

train_predict(X_train, y_train, X_test, y_test)

Evaluation on reduced data

X_test_pca = pca.transform(X_test)
train_predict(X_train_pca, y_train, X_test_pca, y_test)

DNA - Get Data and Learners (Max vs Avg Pooling)

X_train, y_train, X_test, y_test = dna_data.get_lstm_emb('dna_avgpool_train.h5', 'dna_avgpool_test.h5')
dna_maxpool_learner = Learner(X_train, y_train, X_test, y_test)

# avg
X_train, y_train, X_test, y_test = dna_data.get_lstm_emb('dna_avgpool_train.h5', 'dna_avgpool_test.h5')
dna_avgpool_learner = Learner(X_train, y_train, X_test, y_test)

DNA - Grid Search (Supervised Learning)

dna_avgpool_learner.pipeline.steps

[('pca', PCA(n_components=50)), ('classifier', 'passthrough')]

_, _ = dna_avgpool_learner.train()

Run grid search on max pooled embedding to compare results

dna_maxpool_learner.pipeline.steps

[('pca', PCA(n_components=50)), ('classifier', 'passthrough')]

_, _ = dna_maxpool_learner.train()

DNA - Label Spreading (Semi Supervised Learning)

dna_maxpool_learner.run_label_spreading()

              precision    recall  f1-score   support

           0       0.89      0.72      0.79      1119
           1       0.77      0.91      0.83      1153

    accuracy                           0.81      2272
   macro avg       0.83      0.81      0.81      2272
weighted avg       0.83      0.81      0.81      2272

dna_avgpool_learner.run_label_spreading()

              precision    recall  f1-score   support

           0       0.89      0.71      0.79      1119
           1       0.77      0.91      0.83      1153

    accuracy                           0.81      2272
   macro avg       0.83      0.81      0.81      2272
weighted avg       0.83      0.81      0.81      2272

DNA - Prediction Results

Comparing prediction performance between avg and max pooled embedding

dna_avgpool_learner.predict()

dna_maxpool_learner.predict()

Again XBG shows improvement with max pooled embeddings.

Save results.

dna_avgpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/dna_lstm_avgpool_learner.csv')
dna_maxpool_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/dna_lstm_maxpool_learner.csv')

DNA - KMeans Clustering (Unsupervised Learning)

X_pca = dna_maxpool_learner.pick_k()

n_clusters: 2 -- avg silhouette score: 0.13949118554592133
n_clusters: 3 -- avg silhouette score: 0.15438680350780487
n_clusters: 4 -- avg silhouette score: 0.14487028121948242
n_clusters: 5 -- avg silhouette score: 0.12199439108371735
n_clusters: 6 -- avg silhouette score: 0.11718294024467468
n_clusters: 7 -- avg silhouette score: 0.1381966769695282
n_clusters: 8 -- avg silhouette score: 0.13304507732391357
n_clusters: 9 -- avg silhouette score: 0.13672156631946564
n_clusters: 10 -- avg silhouette score: 0.10999424010515213

dna_maxpool_learner.analyze_clusters(X_pca, k=6)

Cluster counts: Counter({3: 3971, 4: 3876, 0: 3620, 1: 3066, 5: 1364, 2: 564})

	acc	recall	precision	f1
lr	0.738372	0.755814	0.730337	0.742857
svc	0.718023	0.761628	0.700535	0.729805
xgb	0.729651	0.773256	0.711230	0.740947

	acc	recall	precision	f1
lr	0.674419	0.720930	0.659574	0.688889
svc	0.665698	0.709302	0.652406	0.679666
xgb	0.735465	0.773256	0.718919	0.745098

	best_params	accuracy	recall	precision	f1
LogisticRegression	{'classifier': LogisticRegression(C=100.0, max...	0.674419	0.72093	0.659574	0.688889
LinearSVC	{'classifier': LinearSVC(C=100.0, max_iter=100...	0.65407	0.703488	0.640212	0.67036
XGBClassifier	{'classifier': XGBClassifier(base_score=None, ...	0.732558	0.75	0.724719	0.737143
LabelSpreading	{'alpha': 0.01, 'gamma': 20, 'kernel': 'knn', ...	0.744186	0.761628	0.735955	0.748571

	best_params	accuracy	recall	precision	f1
LogisticRegression	{'classifier': LogisticRegression(C=0.1, max_i...	0.671512	0.726744	0.65445	0.688705
LinearSVC	{'classifier': LinearSVC(C=0.01, max_iter=1000...	0.674419	0.75	0.651515	0.697297
XGBClassifier	{'classifier': XGBClassifier(base_score=None, ...	0.741279	0.784884	0.721925	0.752089
LabelSpreading	{'alpha': 0.01, 'gamma': 20, 'kernel': 'knn', ...	0.706395	0.703488	0.707602	0.705539

	acc	recall	precision	f1
lr	0.931931	0.919395	0.940722	0.929936
svc	0.925743	0.911839	0.935401	0.923469
xgb	0.929455	0.901763	0.952128	0.926261

	acc	recall	precision	f1
lr	0.923267	0.904282	0.937337	0.920513
svc	0.924505	0.899244	0.944444	0.921290
xgb	0.919554	0.886650	0.946237	0.915475

	best_params	accuracy	recall	precision	f1
LogisticRegression	{'classifier': LogisticRegression(C=100.0, max...	0.92698	0.90932	0.940104	0.924456
LinearSVC	{'classifier': LinearSVC(C=100.0, loss='hinge'...	0.92698	0.906801	0.942408	0.924262
XGBClassifier	{'classifier': XGBClassifier(base_score=None, ...	0.925743	0.896725	0.949333	0.92228
LabelSpreading	{'alpha': 0.01, 'gamma': 20, 'kernel': 'knn', ...	0.92203	0.891688	0.946524	0.918288

	best_params	accuracy	recall	precision	f1
LogisticRegression	{'classifier': LogisticRegression(C=100.0, max...	0.928218	0.911839	0.94026	0.925831
LinearSVC	{'classifier': LinearSVC(loss='hinge', max_ite...	0.930693	0.914358	0.942857	0.928389
XGBClassifier	{'classifier': XGBClassifier(base_score=None, ...	0.924505	0.90932	0.935233	0.922095
LabelSpreading	{'alpha': 0.01, 'gamma': 20, 'kernel': 'knn', ...	0.92203	0.879093	0.958791	0.917214

	acc	recall	precision	f1
lr	0.834067	0.889853	0.804075	0.844792
svc	0.841109	0.932350	0.791605	0.856233
xgb	0.879842	0.962706	0.828358	0.890493

	acc	recall	precision	f1
lr	0.798415	0.844753	0.777334	0.809643
svc	0.796655	0.847355	0.773555	0.808775
xgb	0.869278	0.963573	0.813324	0.882096

	best_params	accuracy	recall	precision	f1
LogisticRegression	{'classifier': LogisticRegression(C=100.0, max...	0.802377	0.847355	0.7816	0.81315
LinearSVC	{'classifier': LinearSVC(C=10.0, loss='hinge',...	0.800616	0.85516	0.775157	0.813196
XGBClassifier	{'classifier': XGBClassifier(base_score=None, ...	0.875	0.964441	0.820664	0.886762
LabelSpreading	{'alpha': 0.01, 'gamma': 20, 'kernel': 'knn', ...	0.81338	0.911535	0.765477	0.832146

	best_params	accuracy	recall	precision	f1
LogisticRegression	{'classifier': LogisticRegression(max_iter=100...	0.801496	0.848222	0.779904	0.81263
LinearSVC	{'classifier': LinearSVC(C=100.0, loss='hinge'...	0.805018	0.847355	0.78537	0.815186
XGBClassifier	{'classifier': XGBClassifier(base_score=None, ...	0.865757	0.961839	0.809489	0.879112
LabelSpreading	{'alpha': 0.01, 'gamma': 20, 'kernel': 'knn', ...	0.814701	0.910668	0.767544	0.833003