Classes and functions to load dataset, clean and save for further processing & vocab creation.
 

Protein, Peptides, Amino Acids - Basics

Analyze Data

DATA_STORE
'/home/vinod/.peptide/datasets'
os.listdir(f"{DATA_STORE}")
['acp', 'dna', 'amp']

Anti Cancer Peptide Dataset (ACP)

os.listdir(f"{DATA_STORE}/acp")
['transformer', 'fasta', 'lstm', 'train_data.csv', 'test_data.csv']
raw_acp_train_df = pd.read_csv(f"{DATA_STORE}/acp/train_data.csv")
raw_acp_test_df = pd.read_csv(f"{DATA_STORE}/acp/test_data.csv")
for df in [raw_acp_train_df, raw_acp_test_df]:
    display(df.head(5))
sequences label
0 RRWWRRWRRW 0
1 GWKSVFRKAKKVGKTVGGLALDHYLG 0
2 ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ 1
3 GLFDVIKKVAAVIGGL 1
4 VAKLLAKLAKKVL 1
sequences label
0 FLPLLLSALPSFLCLVFKKC 0
1 DKLIGSCVWLAVNYTSNCNAECKRRGYKGGHCGSFLNVNCWCET 0
2 AVKDTYSCFIMRGKCRHECHDFEKPIGFCTKLNANCYM 0
3 GLPTCGETCFGGTCNTPGCTCDPWPVCTHN 1
4 ENCGRQAG 0
for df in [raw_acp_train_df, raw_acp_test_df]:
    display(df.describe().T)
count mean std min 25% 50% 75% max
label 1378.0 0.5 0.500182 0.0 0.0 0.5 1.0 1.0
count mean std min 25% 50% 75% max
label 344.0 0.5 0.500728 0.0 0.0 0.5 1.0 1.0
print(f"Train: {raw_acp_train_df.label.sum() / len(raw_acp_train_df) : .2%}")
print(f"Test: {raw_acp_test_df.label.sum() / len(raw_acp_test_df) : .2%}")
Train:  50.00%
Test:  50.00%

No class imbalance - class split is 50 - 50

len(raw_acp_test_df) / (len(raw_acp_train_df) + len(raw_acp_test_df))
0.1997677119628339

Train / Test split in the total dataset

  • Test ~ 20%
  • Train ~ 80%

Antimicrobial Peptide Dataset (AMP)

os.listdir(f"{DATA_STORE}/amp")
['transformer', 'all_data.csv', 'fasta', 'lstm']
raw_amp_df = pd.read_csv(f"{DATA_STORE}/amp/all_data.csv")
raw_amp_df.head(5)
PDBs_code SequenceID label
0 AP02484 GMASKAGSVLGKITKIALGAL 1
1 AP02630 NIGLFTSTCFSSQCFSSKCFTDTCFSSNCFTGRHQCGYTHGSC 1
2 AP01427 GAIKDALKGAAKTVAVELLKKAQCKLEKTC 1
3 AP02983 FFGRLKAVFRGARQGWKEHRY 1
4 AP01815 DFGCARGMIFVCMRRCARMYPGSTGYCQGFRCMCDTMIPIRRPPFIMG 1
raw_amp_df.describe().T
count mean std min 25% 50% 75% max
label 4042.0 0.5 0.500062 0.0 0.0 0.5 1.0 1.0
raw_amp_df.label.sum() / len(raw_amp_df)
0.5
  • No class imbalance - class distribution is 50%
  • Need to split into train (80%) and test (20%)

DNA-Binding Protein Dataset

os.listdir(f"{DATA_STORE}/dna")
['test.csv', 'transformer', 'fasta', 'lstm', 'train.csv']
raw_dnab_train_df = pd.read_csv(f"{DATA_STORE}/dna/train.csv")
raw_dnab_test_df = pd.read_csv(f"{DATA_STORE}/dna/test.csv")
for df in [raw_dnab_train_df, raw_dnab_test_df]:
    display(df.head(5))
code sequence label origin
0 Q6A8L0 MSGHSKWATTKHKKAAIDAKRGKLFARLIKNIEVAARLGGGDPSGN... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1 Q7V7T9 MIGWLQGQKVEAWQQGTRQGVVLACAGVGYEVQIAPRHLSEMEHGQ... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2 Q9ZUP2 MARILRNVYSLRSSLFSSELLRRSVVGTSFQLRGFAAKAKKKSKSD... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3 Q2JVG1 MKCPRCGKQEIRVLESRSAEGGQSVRRRRECMSCGYRFTTYERIEF... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4 Q9K4Q3 MTKADIIEGVYEKVGFSKKESAEIVELVFDTLKETLERGDKIKISG... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
code sequence label origin
0 P27204|1 AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1 P53528|1 MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2 P52684|1 MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3 P10961|1 MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4 P06023|1 MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND... 1 https://github.com/hfuulgb/PDB-Fusion/tree/mai...
for df in [raw_dnab_train_df, raw_dnab_test_df]:
    display(df.describe().T)
count mean std min 25% 50% 75% max
label 14189.0 0.502431 0.500012 0.0 0.0 1.0 1.0 1.0
count mean std min 25% 50% 75% max
label 2272.0 0.507482 0.500054 0.0 0.0 1.0 1.0 1.0
print(f"Train: {raw_dnab_train_df.label.sum() / len(raw_dnab_train_df) : .2%}")
print(f"Test: {raw_dnab_test_df.label.sum() / len(raw_dnab_test_df) : .2%}")
Train:  50.24%
Test:  50.75%

No class imbalance - class split is 50 - 50

len(raw_dnab_test_df) / (len(raw_dnab_train_df) + len(raw_dnab_test_df))
0.1380232063665634

Train / Test split in the total dataset

  • Test ~ 14%
  • Train ~ 86%

Dataset Classes

Organizing all pre-processing steps into classes.

  • Load, clean, split all 3 datasets
    • Clean = retain only 2 columns in all 3 dfs - sequence and label
    • Split AMP data set into train (80%) and test (20%)

class ProteinDataset[source]

ProteinDataset(name:str, location:str, max_seq_len:int=None) :: ABC

Abstract base class for protein datasets.
Type Default Details
name str name of the dataset (e.g. 'acp' or 'amp' or 'dna')
location str location of the raw dataset
max_seq_len int None amino acid sequences will be truncated at this max length

ProteinDataset is an abstract base class implementing common methods and providing abstract methods for the specific classes one each for ACP, AMP and DNABinding.

Abstract Method:

  • clean_data()

Implemented method to be inherited without change:

  • extract_features_labels()
  • generate_fasta_files()
  • get_lstm_emb()
  • get_transformer_emb()

Details are as follows ...

ProteinDataset.clean_data[source]

ProteinDataset.clean_data()

Abstract method for cleaning raw data.

ProteinDataset.extract_features_labels[source]

ProteinDataset.extract_features_labels(df:DataFrame)

Extract features and separate labels.
Type Default Details
df DataFrame cleaned train or test dataframe

ProteinDataset.generate_fasta_files[source]

ProteinDataset.generate_fasta_files(out_dir:str=None, use_seq_max_len:bool=False)

Generate fasta files (using the BioPython lib) for the given protein dataset.
Type Default Details
out_dir str None output directory, defaults to 'dataset_location/dataset_name/fasta'
use_seq_max_len bool False if True, uses truncated amino acid sequences, else full sequence

Both the pretrained models are able to read in amino acid sequences from fasta files in bulk and generate corresponding embeddings. This method is used to generate the fasta files that will be used as input to the pretrained models' bulk APIs. The fasta file generated is of this format:

>0 |0
FLPLLLSALPSFLCLVFKKC
>1 |0
DKLIGSCVWLAVNYTSNCNAECKRRGYKGGHCGSFLNVNCWCET
>2 |0
AVKDTYSCFIMRGKCRHECHDFEKPIGFCTKLNANCYM

For each amino acid sequence in the dataset:

  • The header is of the form - >index |label on the first line
  • Followed by the actual AA sequence on the second line
    • If use_seq_max_len is set, then the sequence will be truncated at the max_seq_len of this dataset object.

fasta + BioPython

An example of creating a fasta record using the SeqRecord and Sequence classes from the BioPython library is shown below. This is what this method implements.

record = SeqRecord(
    Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
    id="YP_025292.1",
    name="HokC",
    description="toxic membrane protein, small",
)
print(record.format('fasta'))
>YP_025292.1 toxic membrane protein, small
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

And a sequence can be truncated eaily using standard Python slicing

record[:5]
SeqRecord(seq=Seq('MKQHK'), id='YP_025292.1', name='HokC', description='toxic membrane protein, small', dbxrefs=[])

ProteinDataset.get_lstm_emb[source]

ProteinDataset.get_lstm_emb(train_h5_file:str, test_h5_file:str, h5_location:str=None)

Read (ProSE) LSTM embeddings from HDF5 files, returns X_train, y_train, X_test, y_test.
Type Default Details
train_h5_file str train h5 filename
test_h5_file str test h5 filename
h5_location str None LSTM embedding directory - defaults to 'dataset_location/dataset_name/lstm'

Reading Embeddings From H5

This method reads in (loads) the LSTM embeddings that were generated by the pretrained (ProSE) model using code similar to the sample shown below.

def get_embeddings(h5_file):

    Xs = []
    ys = []
    with h5py.File(h5_file, "r") as f:
        for key in f.keys():
            label = key.split('|')[-1]
            ys.append(int(label))
            seq = f[key][()]
            Xs.append(seq)
    Xs = np.stack(Xs, axis=0)
    ys = np.stack(ys, axis=0)
    return Xs, ys

ProteinDataset.get_transformer_emb[source]

ProteinDataset.get_transformer_emb(train_fasta_file:str, test_fasta_file:str, fasta_location:str=None, emb_location:str=None, emb_layer:int=33)

Read (ESM) Transformer embeddings.
Type Default Details
train_fasta_file str train fasta filename
test_fasta_file str test fasta filename
fasta_location str None fasta directory - defaults to 'dataset_location/dataset_name/fasta'
emb_location str None ESM embedding directory - defaults to 'dataset_location/dataset_name/transformer'
emb_layer int 33 Layer of ESM Transformer to get embedding from - defaults to final layer (33)

Reading ESM Embeddings

This method reads in (loads) the Transformer embeddings that were generated by the pretrained (ESM) model using code similar to the sample shown below.

def get_embeddings(fasta_path, emb_path, emb_layer):
    ys = []
    Xs = []
    for header, _seq in esm.data.read_fasta(fasta_path):
        label = header.split('|')[-1]
        ys.append(int(label))
        emb_file = f'{emb_path}/{header[1:]}.pt'
        embs = torch.load(emb_file)
        Xs.append(embs['mean_representations'][emb_layer])
    Xs = np.stack(Xs, axis=0)
    ys = np.stack(ys, axis=0)
    return Xs, ys

class ACPDataset[source]

ACPDataset(location:str, max_seq_len:int=None) :: ProteinDataset

Class for Anticancer Peptide Dataset.
Type Default Details
location str location of raw ACP dataset (must contain dir called 'acp' with train and test CSVs in it)
max_seq_len int None max length, if amino acid sequences are to be truncated

Class for ACP data.

  • Only implements a single method - the clean_data() abstract method defined in the abstract base class.
    • For ACP data - it just renames the sequences column to make it consistent across datasets.
  • The __init__() method loads, cleans, extracts labels & features for train and test splits of the datasets. It does this by calling the following 2 methods.
    • clean_data() method described below.
    • extract_features_labels() method described in the ProteinDataset abstract class definition above.

ACPDataset.clean_data[source]

ACPDataset.clean_data()

Implemenation of abstract method - load, clean and return ACP train and test dataframes.

class AMPDataset[source]

AMPDataset(location:str, max_seq_len:int=150, test_pct:float=0.2, seed=1234) :: ProteinDataset

Class for Antimicrobial Peptide Dataset.
Type Default Details
location str location of raw AMP dataset (must contain dir called 'amp' with 'all_data.csv' in it)
max_seq_len int 150 max length that amino acid sequences will be truncated at
test_pct float 0.2 split percent for test data
seed int 1234 random state for split

Class for AMP data.

  • Only implements clean_data(), inherits the rest.

AMPDataset.clean_data[source]

AMPDataset.clean_data(test_pct:float=0.2, seed:int=1234)

Implemenation of abstract method - load, clean, split and return AMP train and test dataframes.
Type Default Details
test_pct float 0.2 split percent for test data
seed int 1234 random state for split

class DNABindDataset[source]

DNABindDataset(location:str, max_seq_len:int=300) :: ProteinDataset

Class for DNA Binding Protein Dataset.
Type Default Details
location str location of raw DNA dataset (must contain dir called 'dna' with train and test CSVs in it)
max_seq_len int 300 max length that amino acid sequences will be truncated at

Class for DNA Binding data. Like the other 2 only implements clean_data() and inherits the rest.

DNABindDataset.clean_data[source]

DNABindDataset.clean_data()

Implemenation of abstract method - load, clean and return DNA Binding train and test dataframes.

An example of how to use.

acp_data = ACPDataset(DATA_STORE)
amp_data = AMPDataset(DATA_STORE)
dna_data = DNABindDataset(DATA_STORE)
dna_data.test
sequence label length
0 AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL... 1 101
1 MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA... 1 714
2 MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT... 1 308
3 MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR... 1 833
4 MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND... 1 174
... ... ... ...
2267 MNFSRERTITEIQNDYKEQVERQNQLKKRRRKGLYRRLTVFGALVF... 0 125
2268 MVVVDKEIKKGQYYLVNGNVVRVTYVNGFDVYYLILKLHKRMICDR... 0 60
2269 MNPSTHVSSNGPTTPPHGPHTTFLPPTSPAPSTSSVAAATLCSPQR... 0 668
2270 MVRSGKKAVVLAAVAFCATSVVQKSHGFVPSPLRQRAAAAGAAAAS... 0 370
2271 MNYSIRLFKIMGIPIELHITFILFLVVIIGLSIMNNSIFWAVLFIL... 0 339

2272 rows × 3 columns

acp_data.X_train
array([['R', 'R', 'W', ..., None, None, None],
       ['G', 'W', 'K', ..., None, None, None],
       ['A', 'L', 'W', ..., None, None, None],
       ...,
       ['E', 'S', 'E', ..., 'Q', 'E', None],
       ['F', 'I', 'S', ..., None, None, None],
       ['R', 'L', 'S', ..., None, None, None]], dtype=object)

Exploratory Data Analysis

First defining some convenience functions for plotting.

plot_seqlen_dist[source]

plot_seqlen_dist(df:DataFrame, dataset_name:str, log_scale:bool=False, bins:int=75)

Plot the sequence length distribution given a train df with length and label columns.
Type Default
df DataFrame
dataset_name str
log_scale bool False
bins int 75

plot_AA_dist[source]

plot_AA_dist(df:DataFrame, dataset_name:str)

Plot the distribution of amino acids given a train df with "sequence" column.
Type Default
df DataFrame
dataset_name str

ACP

print(f"Samples in train: {len(acp_data.train)}")
Samples in train: 1378
plot_seqlen_dist(acp_data.train[["length", "label"]], "Anti Cancer Peptide")
plot_AA_dist(acp_data.train[["sequence"]], "Anti Cancer Peptide")
Number of amino acids in the dataset: 20
Frequencies: [('K', 4141), ('L', 3397), ('G', 3153), ('A', 2825), ('C', 2251), ('I', 2029), ('R', 1810), ('V', 1739), ('S', 1699), ('F', 1535), ('T', 1309), ('P', 1236), ('N', 1119), ('E', 759), ('D', 731), ('Y', 693), ('Q', 690), ('W', 688), ('H', 646), ('M', 383)]

AMP

print(f"Samples in train: {len(amp_data.train)}")
Samples in train: 3234
plot_seqlen_dist(
    amp_data.train[["length", "label"]], "Antimicrobial Peptide", log_scale=False
)
plot_seqlen_dist(
    amp_data.train[["length", "label"]], "Antimicrobial Peptide", log_scale=True
)
plot_AA_dist(amp_data.train[["sequence"]], "Antimicrobial Peptide")
Number of amino acids in the dataset: 21
Frequencies: [('L', 9768), ('G', 9521), ('K', 8835), ('A', 8356), ('S', 7371), ('V', 6703), ('R', 6491), ('I', 6294), ('E', 5366), ('P', 5338), ('T', 5233), ('D', 4651), ('N', 4613), ('F', 4392), ('C', 4219), ('Q', 3831), ('Y', 2962), ('H', 2618), ('M', 1830), ('W', 1578), ('X', 4)]

DNA Binding

print(f"Samples in train: {len(dna_data.train)}")
Samples in train: 14189
plot_seqlen_dist(
    dna_data.train[["length", "label"]],
    "DNA Binding Protein",
    log_scale=False,
)
plot_seqlen_dist(
    dna_data.train[["length", "label"]],
    "DNA Binding Protein",
    log_scale=True,
)
plot_AA_dist(dna_data.train[["sequence"]], "DNA Binding Protein")
Number of amino acids in the dataset: 24
Frequencies: [('L', 550524), ('S', 472593), ('A', 457288), ('E', 414062), ('G', 393279), ('K', 377806), ('V', 367475), ('R', 343987), ('T', 325286), ('P', 324499), ('D', 322736), ('I', 320614), ('Q', 272663), ('N', 271296), ('F', 215949), ('Y', 166668), ('H', 149670), ('M', 138056), ('C', 92015), ('W', 57815), ('X', 116), ('U', 3), ('O', 1), ('B', 1)]

Generate fasta Files

Once the dataset objects are created, the following calls will generate fasta files and persist them in the default locations.

acp_data.generate_fasta_files()
Created /home/vinod/.peptide/datasets/acp/fasta/acp_train.fasta with 1378 sequence records
Created /home/vinod/.peptide/datasets/acp/fasta/acp_test.fasta with 344 sequence records
amp_data.generate_fasta_files()
Created /home/vinod/.peptide/datasets/amp/fasta/amp_train.fasta with 3234 sequence records
Created /home/vinod/.peptide/datasets/amp/fasta/amp_test.fasta with 808 sequence records
amp_data.generate_fasta_files(use_seq_max_len=True)
Created /home/vinod/.peptide/datasets/amp/fasta/amp_train_seqlen_150.fasta with 3234 sequence records
Created /home/vinod/.peptide/datasets/amp/fasta/amp_test_seqlen_150.fasta with 808 sequence records
dna_data.generate_fasta_files()
Created /home/vinod/.peptide/datasets/dna/fasta/dna_train.fasta with 14189 sequence records
Created /home/vinod/.peptide/datasets/dna/fasta/dna_test.fasta with 2272 sequence records
dna_data.generate_fasta_files(use_seq_max_len=True)
Created /home/vinod/.peptide/datasets/dna/fasta/dna_train_seqlen_300.fasta with 14189 sequence records
Created /home/vinod/.peptide/datasets/dna/fasta/dna_test_seqlen_300.fasta with 2272 sequence records

Generate Embeddings From Pretrained Models

  • Steps for generating LSTM embeddings in bulk from fasta files detailed here.
  • Steps for generating Transformer embeddings in bulk from fasta files detailed here.

Load Embeddings

The following is an example of loading LSTM embeddings.

X_train, y_train, X_test, y_test = acp_data.get_lstm_emb(
    "acp_avgpool_train.h5", "acp_avgpool_test.h5"
)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((1378, 6165), (1378,), (344, 6165), (344,))

And the following is an example of loading Transformer embeddings.

X_train, y_train, X_test, y_test = acp_data.get_transformer_emb(
    "acp_train.fasta", "acp_test.fasta"
)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((1378, 1280), (1378,), (344, 1280), (344,))