acp_data = ACPDataset(DATA_STORE)
amp_data = AMPDataset(DATA_STORE)
dna_data = DNABindDataset(DATA_STORE, max_seq_len=300)
Details of the pretrained model:
Evolutionary Scale Modeling (ESM)
- Transformer protein language models.
- https://github.com/facebookresearch/esm
- Details about generating fasta files using the BioPython library is detailed here in the data module.
- fasta files are created by the
generate_fasta_files()
method in the dataset classes for example -acp_data.generate_fasta_files()
ESM has a max sequence length limitation of 1024. So max_seq_len <= 1024
must be used when generating fasta files.
dna_data.generate_fasta_files(use_seq_max_len=True)
In order to cerate embeddings in bulk from fasta files details here ..
- the ESM Github repo needs to be cloned
- a conda environment for it needs to be created as detailed in the repo (link above)
- then in that conda environment at the root of the cloned repo, the following commands need to be run
Sample python commands to generate embeddings from pretrained model in the ESM codebase are as follows:
ACP
- Train
python scripts/extract.py esm1b_t33_650M_UR50S ~/.peptide/datasets/acp/fasta/acp_train.fasta ~/.peptide/datasets/acp/transformer/train/ \ --repr_layers 0 32 33 --include mean Transferred model to GPU Read /home/vinod/.peptide/datasets/acp/fasta/acp_train.fasta with 1378 sequences Processing 1 of 10 batches (292 sequences) Processing 2 of 10 batches (215 sequences) Processing 3 of 10 batches (178 sequences) Processing 4 of 10 batches (157 sequences) Processing 5 of 10 batches (132 sequences) Processing 6 of 10 batches (117 sequences) Processing 7 of 10 batches (105 sequences) Processing 8 of 10 batches (91 sequences) Processing 9 of 10 batches (80 sequences) Processing 10 of 10 batches (11 sequences)
- Test
python scripts/extract.py esm1b_t33_650M_UR50S ~/.peptide/datasets/acp/fasta/acp_test.fasta ~/.peptide/datasets/acp/transformer/test/ \ --repr_layers 0 32 33 --include mean Transferred model to GPU Read /home/vinod/.peptide/datasets/acp/fasta/acp_test.fasta with 344 sequences Processing 1 of 3 batches (175 sequences) Processing 2 of 3 batches (113 sequences) Processing 3 of 3 batches (56 sequences)
AMP
- Train
python scripts/extract.py esm1b_t33_650M_UR50S ~/.peptide/datasets/amp/fasta/amp_train.fasta ~/.peptide/datasets/amp/transformer/train/ \ --repr_layers 0 32 33 --include mean Transferred model to GPU Read /home/vinod/.peptide/datasets/amp/fasta/amp_train.fasta with 3234 sequences Processing 1 of 30 batches (273 sequences) Processing 2 of 30 batches (240 sequences) ... Processing 29 of 30 batches (24 sequences) Processing 30 of 30 batches (1 sequences)
- Test
python scripts/extract.py esm1b_t33_650M_UR50S ~/.peptide/datasets/amp/fasta/amp_test.fasta ~/.peptide/datasets/amp/transformer/test/ \ --repr_layers 0 32 33 --include mean Transferred model to GPU Read /home/vinod/.peptide/datasets/amp/fasta/amp_test.fasta with 808 sequences Processing 1 of 9 batches (204 sequences) Processing 2 of 9 batches (157 sequences) ... Processing 9 of 9 batches (3 sequences)
DNA Binding
- Train
python scripts/extract.py esm1b_t33_650M_UR50S ~/.peptide/datasets/dna/fasta/dna_train_seqlen_1024.fasta ~/.peptide/datasets/dna/transformer/train/ \ --repr_layers 0 32 33 --include mean Transferred model to GPU Read /home/vinod/.peptide/datasets/dna/fasta/dna_train_seqlen_1024.fasta with 14189 sequences Processing 1 of 1525 batches (67 sequences) Processing 2 of 1525 batches (64 sequences)
- Whereas the following results in a max seq length error ```sh python scripts/extract.py esm1b_t33_650M_UR50S ~/.peptide/datasets/dna/fasta/dna_train.fasta ~/.peptide/datasets/dna/transformer/train/ \ --repr_layers 0 32 33 --include mean Transferred model to GPU Read /home/vinod/.peptide/datasets/dna/fasta/dna_train.fasta with 14189 sequences Processing 1 of 1641 batches (67 sequences) ...
- Test
```sh
python scripts/extract.py esm1b_t33_650M_UR50S ~/.peptide/datasets/dna/fasta/dna_test.fasta ~/.peptide/datasets/dna/transformer/test/ \
--repr_layers 0 32 33 --include mean
Transferred model to GPU
Read /home/vinod/.peptide/datasets/dna/fasta/dna_test.fasta with 2272 sequences
Processing 1 of 289 batches (56 sequences)
...
Read Transformer Embeddings
The procedure for reading the embeddings generated from the pretrained ESM model is detailed here in the data module.
X_train, y_train, X_test, y_test = acp_data.get_transformer_emb('acp_train.fasta', 'acp_test.fasta')
X_train.shape, y_train.shape, X_test.shape, y_test.shape
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)
visualize_2pcs(X_train_pca, y_train)
Looks like scaling is needed
scaler = StandardScaler()
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(scaler.fit_transform(X_train))
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)
visualize_2pcs(X_train_pca, y_train)
visualize_3pcs(X_train_pca, y_train)
Evaluation on full data
train_predict(X_train, y_train, X_test, y_test)
Evaluation on reduced data
X_test_pca = pca.transform(scaler.transform(X_test))
train_predict(X_train_pca, y_train, X_test_pca, y_test)
Layer 33 is the final layer in the ESM model. Here we read data (embeddings) from layer 33 and 32 and initialize learners to run training and prediction on both.
- Get data using the
ACPDataset
object - Initialize / instantiate a
Learner
object using
X_train, y_train, X_test, y_test = acp_data.get_transformer_emb('acp_train.fasta', 'acp_test.fasta')
acp_lyr33_learner = Learner(X_train, y_train, X_test, y_test, scaler=True)
# layer 32
X_train, y_train, X_test, y_test = acp_data.get_transformer_emb('acp_train.fasta', 'acp_test.fasta', emb_layer=32)
acp_lyr32_learner = Learner(X_train, y_train, X_test, y_test, scaler=True)
acp_lyr33_learner.pipeline.steps
_, _ = acp_lyr33_learner.train()
acp_lyr32_learner.pipeline.steps
_, _ = acp_lyr32_learner.train()
acp_lyr33_learner.run_label_spreading()
acp_lyr32_learner.run_label_spreading()
acp_lyr33_learner.predict()
acp_lyr32_learner.predict()
Save results.
acp_lyr33_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/acp_transformer_lyr33_learner.csv')
acp_lyr32_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/acp_transformer_lyr32_learner.csv')
X_pca = acp_lyr33_learner.pick_k()
acp_lyr33_learner.analyze_clusters(X_pca, k=4)
X_train, y_train, X_test, y_test = amp_data.get_transformer_emb('amp_train.fasta', 'amp_test.fasta')
X_train.shape, y_train.shape, X_test.shape, y_test.shape
scaler = StandardScaler()
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(scaler.fit_transform(X_train))
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)
visualize_2pcs(X_train_pca, y_train)
visualize_3pcs(X_train_pca, y_train)
Evaluation on full data
train_predict(X_train, y_train, X_test, y_test)
Evaluation on reduced data
X_test_pca = pca.transform(scaler.transform(X_test))
train_predict(X_train_pca, y_train, X_test_pca, y_test)
X_train, y_train, X_test, y_test = amp_data.get_transformer_emb('amp_train.fasta', 'amp_test.fasta')
amp_lyr33_learner = Learner(X_train, y_train, X_test, y_test, scaler=True)
# layer 32
X_train, y_train, X_test, y_test = amp_data.get_transformer_emb('amp_train.fasta', 'amp_test.fasta', emb_layer=32)
amp_lyr32_learner = Learner(X_train, y_train, X_test, y_test, scaler=True)
amp_lyr33_learner.pipeline.steps
_, _ = amp_lyr33_learner.train()
amp_lyr32_learner.pipeline.steps
_, _ = amp_lyr32_learner.train()
amp_lyr33_learner.run_label_spreading()
amp_lyr32_learner.run_label_spreading()
amp_lyr33_learner.predict()
amp_lyr32_learner.predict()
Save results.
amp_lyr33_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/amp_transformer_lyr33_learner.csv')
amp_lyr32_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/amp_transformer_lyr32_learner.csv')
X_pca = amp_lyr33_learner.pick_k()
amp_lyr33_learner.analyze_clusters(X_pca, k=6)
X_train, y_train, X_test, y_test = dna_data.get_transformer_emb('dna_train_seqlen_300.fasta', 'dna_test_seqlen_300.fasta')
X_train.shape, y_train.shape, X_test.shape, y_test.shape
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
print(f'X_train_pca.shape: {X_train_pca.shape}')
print(
f"Explained variance ratio of the first 10 principal components:\n{pca.explained_variance_ratio_[:10]}"
)
visualize_2pcs(X_train_pca, y_train)
visualize_3pcs(X_train_pca, y_train)
Evaluation on full data
train_predict(X_train, y_train, X_test, y_test)
Evaluation on reduced data
X_test_pca = pca.transform(X_test)
train_predict(X_train_pca, y_train, X_test_pca, y_test)
X_train, y_train, X_test, y_test = dna_data.get_transformer_emb('dna_train_seqlen_300.fasta', 'dna_test_seqlen_300.fasta')
dna_lyr33_learner = Learner(X_train, y_train, X_test, y_test, scaler=True)
# layer 32
X_train, y_train, X_test, y_test = dna_data.get_transformer_emb('dna_train_seqlen_300.fasta', 'dna_test_seqlen_300.fasta', emb_layer=32)
dna_lyr32_learner = Learner(X_train, y_train, X_test, y_test, scaler=True)
dna_lyr33_learner.pipeline.steps
_, _ = dna_lyr33_learner.train()
dna_lyr32_learner.pipeline.steps
_, _ = dna_lyr32_learner.train()
dna_lyr33_learner.run_label_spreading()
dna_lyr32_learner.run_label_spreading()
dna_lyr33_learner.predict()
dna_lyr32_learner.predict()
Save results.
dna_lyr33_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/dna_transformer_lyr33_learner.csv')
dna_lyr32_learner.predict_results.to_csv(f'{EXPERIMENT_STORE}/dna_transformer_lyr32_learner.csv')
X_pca = dna_lyr33_learner.pick_k()
dna_lyr33_learner.analyze_clusters(X_pca, k=6)