DATA_STORE
os.listdir(f"{DATA_STORE}")
os.listdir(f"{DATA_STORE}/acp")
raw_acp_train_df = pd.read_csv(f"{DATA_STORE}/acp/train_data.csv")
raw_acp_test_df = pd.read_csv(f"{DATA_STORE}/acp/test_data.csv")
for df in [raw_acp_train_df, raw_acp_test_df]:
display(df.head(5))
for df in [raw_acp_train_df, raw_acp_test_df]:
display(df.describe().T)
print(f"Train: {raw_acp_train_df.label.sum() / len(raw_acp_train_df) : .2%}")
print(f"Test: {raw_acp_test_df.label.sum() / len(raw_acp_test_df) : .2%}")
No class imbalance - class split is 50 - 50
len(raw_acp_test_df) / (len(raw_acp_train_df) + len(raw_acp_test_df))
Train / Test split in the total dataset
- Test ~ 20%
- Train ~ 80%
os.listdir(f"{DATA_STORE}/amp")
raw_amp_df = pd.read_csv(f"{DATA_STORE}/amp/all_data.csv")
raw_amp_df.head(5)
raw_amp_df.describe().T
raw_amp_df.label.sum() / len(raw_amp_df)
- No class imbalance - class distribution is 50%
- Need to split into train (80%) and test (20%)
os.listdir(f"{DATA_STORE}/dna")
raw_dnab_train_df = pd.read_csv(f"{DATA_STORE}/dna/train.csv")
raw_dnab_test_df = pd.read_csv(f"{DATA_STORE}/dna/test.csv")
for df in [raw_dnab_train_df, raw_dnab_test_df]:
display(df.head(5))
for df in [raw_dnab_train_df, raw_dnab_test_df]:
display(df.describe().T)
print(f"Train: {raw_dnab_train_df.label.sum() / len(raw_dnab_train_df) : .2%}")
print(f"Test: {raw_dnab_test_df.label.sum() / len(raw_dnab_test_df) : .2%}")
No class imbalance - class split is 50 - 50
len(raw_dnab_test_df) / (len(raw_dnab_train_df) + len(raw_dnab_test_df))
Train / Test split in the total dataset
- Test ~ 14%
- Train ~ 86%
- Load, clean, split all 3 datasets
- Clean = retain only 2 columns in all 3 dfs -
sequence
andlabel
- Split AMP data set into train (80%) and test (20%)
- Clean = retain only 2 columns in all 3 dfs -
ProteinDataset
is an abstract base class implementing common methods and providing abstract methods for the specific classes one each for ACP, AMP and DNABinding.
Abstract Method:
clean_data()
Implemented method to be inherited without change:
extract_features_labels()
generate_fasta_files()
get_lstm_emb()
get_transformer_emb()
Details are as follows ...
Both the pretrained models are able to read in amino acid sequences from fasta files in bulk and generate corresponding embeddings. This method is used to generate the fasta files that will be used as input to the pretrained models' bulk APIs. The fasta file generated is of this format:
>0 |0
FLPLLLSALPSFLCLVFKKC
>1 |0
DKLIGSCVWLAVNYTSNCNAECKRRGYKGGHCGSFLNVNCWCET
>2 |0
AVKDTYSCFIMRGKCRHECHDFEKPIGFCTKLNANCYM
For each amino acid sequence in the dataset:
- The header is of the form -
>index |label
on the first line - Followed by the actual AA sequence on the second line
- If
use_seq_max_len
is set, then the sequence will be truncated at themax_seq_len
of this dataset object.
- If
fasta + BioPython
An example of creating a fasta
record using the SeqRecord
and Sequence
classes from the BioPython
library is shown below. This is what this method implements.
record = SeqRecord(
Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
id="YP_025292.1",
name="HokC",
description="toxic membrane protein, small",
)
print(record.format('fasta'))
And a sequence can be truncated eaily using standard Python slicing
record[:5]
Reading Embeddings From H5
This method reads in (loads) the LSTM embeddings that were generated by the pretrained (ProSE) model using code similar to the sample shown below.
def get_embeddings(h5_file):
Xs = []
ys = []
with h5py.File(h5_file, "r") as f:
for key in f.keys():
label = key.split('|')[-1]
ys.append(int(label))
seq = f[key][()]
Xs.append(seq)
Xs = np.stack(Xs, axis=0)
ys = np.stack(ys, axis=0)
return Xs, ys
Reading ESM Embeddings
This method reads in (loads) the Transformer embeddings that were generated by the pretrained (ESM) model using code similar to the sample shown below.
def get_embeddings(fasta_path, emb_path, emb_layer):
ys = []
Xs = []
for header, _seq in esm.data.read_fasta(fasta_path):
label = header.split('|')[-1]
ys.append(int(label))
emb_file = f'{emb_path}/{header[1:]}.pt'
embs = torch.load(emb_file)
Xs.append(embs['mean_representations'][emb_layer])
Xs = np.stack(Xs, axis=0)
ys = np.stack(ys, axis=0)
return Xs, ys
Class for ACP data.
- Only implements a single method - the
clean_data()
abstract method defined in the abstract base class.- For ACP data - it just renames the sequences column to make it consistent across datasets.
- The
__init__()
method loads, cleans, extracts labels & features for train and test splits of the datasets. It does this by calling the following 2 methods.clean_data()
method described below.extract_features_labels()
method described in theProteinDataset
abstract class definition above.
Class for AMP data.
- Only implements
clean_data()
, inherits the rest.
Class for DNA Binding data. Like the other 2 only implements clean_data()
and inherits the rest.
An example of how to use.
acp_data = ACPDataset(DATA_STORE)
amp_data = AMPDataset(DATA_STORE)
dna_data = DNABindDataset(DATA_STORE)
dna_data.test
acp_data.X_train
First defining some convenience functions for plotting.
print(f"Samples in train: {len(acp_data.train)}")
plot_seqlen_dist(acp_data.train[["length", "label"]], "Anti Cancer Peptide")
plot_AA_dist(acp_data.train[["sequence"]], "Anti Cancer Peptide")
print(f"Samples in train: {len(amp_data.train)}")
plot_seqlen_dist(
amp_data.train[["length", "label"]], "Antimicrobial Peptide", log_scale=False
)
plot_seqlen_dist(
amp_data.train[["length", "label"]], "Antimicrobial Peptide", log_scale=True
)
plot_AA_dist(amp_data.train[["sequence"]], "Antimicrobial Peptide")
print(f"Samples in train: {len(dna_data.train)}")
plot_seqlen_dist(
dna_data.train[["length", "label"]],
"DNA Binding Protein",
log_scale=False,
)
plot_seqlen_dist(
dna_data.train[["length", "label"]],
"DNA Binding Protein",
log_scale=True,
)
plot_AA_dist(dna_data.train[["sequence"]], "DNA Binding Protein")
Once the dataset objects are created, the following calls will generate fasta files and persist them in the default locations.
acp_data.generate_fasta_files()
amp_data.generate_fasta_files()
amp_data.generate_fasta_files(use_seq_max_len=True)
dna_data.generate_fasta_files()
dna_data.generate_fasta_files(use_seq_max_len=True)
- Steps for generating LSTM embeddings in bulk from fasta files detailed here.
- Steps for generating Transformer embeddings in bulk from fasta files detailed here.
The following is an example of loading LSTM embeddings.
X_train, y_train, X_test, y_test = acp_data.get_lstm_emb(
"acp_avgpool_train.h5", "acp_avgpool_test.h5"
)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
And the following is an example of loading Transformer embeddings.
X_train, y_train, X_test, y_test = acp_data.get_transformer_emb(
"acp_train.fasta", "acp_test.fasta"
)
X_train.shape, y_train.shape, X_test.shape, y_test.shape