Protein, Peptides, Amino Acids - Basics

Analyze Data

DATA_STORE

'/home/vinod/.peptide/datasets'

os.listdir(f"{DATA_STORE}")

['acp', 'dna', 'amp']

Anti Cancer Peptide Dataset (ACP)

os.listdir(f"{DATA_STORE}/acp")

['transformer', 'fasta', 'lstm', 'train_data.csv', 'test_data.csv']

raw_acp_train_df = pd.read_csv(f"{DATA_STORE}/acp/train_data.csv")
raw_acp_test_df = pd.read_csv(f"{DATA_STORE}/acp/test_data.csv")

for df in [raw_acp_train_df, raw_acp_test_df]:
    display(df.head(5))

for df in [raw_acp_train_df, raw_acp_test_df]:
    display(df.describe().T)

print(f"Train: {raw_acp_train_df.label.sum() / len(raw_acp_train_df) : .2%}")
print(f"Test: {raw_acp_test_df.label.sum() / len(raw_acp_test_df) : .2%}")

Train:  50.00%
Test:  50.00%

No class imbalance - class split is 50 - 50

len(raw_acp_test_df) / (len(raw_acp_train_df) + len(raw_acp_test_df))

0.1997677119628339

Train / Test split in the total dataset

Test ~ 20%
Train ~ 80%

Antimicrobial Peptide Dataset (AMP)

os.listdir(f"{DATA_STORE}/amp")

['transformer', 'all_data.csv', 'fasta', 'lstm']

raw_amp_df = pd.read_csv(f"{DATA_STORE}/amp/all_data.csv")

raw_amp_df.head(5)

raw_amp_df.describe().T

raw_amp_df.label.sum() / len(raw_amp_df)

0.5

No class imbalance - class distribution is 50%
Need to split into train (80%) and test (20%)

DNA-Binding Protein Dataset

os.listdir(f"{DATA_STORE}/dna")

['test.csv', 'transformer', 'fasta', 'lstm', 'train.csv']

raw_dnab_train_df = pd.read_csv(f"{DATA_STORE}/dna/train.csv")
raw_dnab_test_df = pd.read_csv(f"{DATA_STORE}/dna/test.csv")

for df in [raw_dnab_train_df, raw_dnab_test_df]:
    display(df.head(5))

for df in [raw_dnab_train_df, raw_dnab_test_df]:
    display(df.describe().T)

print(f"Train: {raw_dnab_train_df.label.sum() / len(raw_dnab_train_df) : .2%}")
print(f"Test: {raw_dnab_test_df.label.sum() / len(raw_dnab_test_df) : .2%}")

Train:  50.24%
Test:  50.75%

No class imbalance - class split is 50 - 50

len(raw_dnab_test_df) / (len(raw_dnab_train_df) + len(raw_dnab_test_df))

0.1380232063665634

Train / Test split in the total dataset

Test ~ 14%
Train ~ 86%

Dataset Classes

Organizing all pre-processing steps into classes.

Load, clean, split all 3 datasets
- Clean = retain only 2 columns in all 3 dfs - sequence and label
- Split AMP data set into train (80%) and test (20%)

Abstract base class for protein datasets.

ProteinDataset is an abstract base class implementing common methods and providing abstract methods for the specific classes one each for ACP, AMP and DNABinding.

Abstract Method:

clean_data()

Implemented method to be inherited without change:

extract_features_labels()
generate_fasta_files()
get_lstm_emb()
get_transformer_emb()

Details are as follows ...

Abstract method for cleaning raw data.

Extract features and separate labels.

Generate fasta files (using the BioPython lib) for the given protein dataset.

Both the pretrained models are able to read in amino acid sequences from fasta files in bulk and generate corresponding embeddings. This method is used to generate the fasta files that will be used as input to the pretrained models' bulk APIs. The fasta file generated is of this format:

>0 |0
FLPLLLSALPSFLCLVFKKC
>1 |0
DKLIGSCVWLAVNYTSNCNAECKRRGYKGGHCGSFLNVNCWCET
>2 |0
AVKDTYSCFIMRGKCRHECHDFEKPIGFCTKLNANCYM

For each amino acid sequence in the dataset:

The header is of the form - >index |label on the first line
Followed by the actual AA sequence on the second line
- If use_seq_max_len is set, then the sequence will be truncated at the max_seq_len of this dataset object.

fasta + BioPython

An example of creating a fasta record using the SeqRecord and Sequence classes from the BioPython library is shown below. This is what this method implements.

record = SeqRecord(
    Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
    id="YP_025292.1",
    name="HokC",
    description="toxic membrane protein, small",
)
print(record.format('fasta'))

>YP_025292.1 toxic membrane protein, small
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

And a sequence can be truncated eaily using standard Python slicing

record[:5]

SeqRecord(seq=Seq('MKQHK'), id='YP_025292.1', name='HokC', description='toxic membrane protein, small', dbxrefs=[])

Note: The following 2 methods are to be used after extracting embeddings from the pretrained models - LSTM (ProSE) or Transformer (ESM) - steps for which are detailed here for ProSE and here for ESM.

Read (ProSE) LSTM embeddings from HDF5 files, returns X_train, y_train, X_test, y_test.

Reading Embeddings From H5

This method reads in (loads) the LSTM embeddings that were generated by the pretrained (ProSE) model using code similar to the sample shown below.

def get_embeddings(h5_file):

    Xs = []
    ys = []
    with h5py.File(h5_file, "r") as f:
        for key in f.keys():
            label = key.split('|')[-1]
            ys.append(int(label))
            seq = f[key][()]
            Xs.append(seq)
    Xs = np.stack(Xs, axis=0)
    ys = np.stack(ys, axis=0)
    return Xs, ys

Read (ESM) Transformer embeddings.

Reading ESM Embeddings

This method reads in (loads) the Transformer embeddings that were generated by the pretrained (ESM) model using code similar to the sample shown below.

From example on the ESM repo

def get_embeddings(fasta_path, emb_path, emb_layer):
    ys = []
    Xs = []
    for header, _seq in esm.data.read_fasta(fasta_path):
        label = header.split('|')[-1]
        ys.append(int(label))
        emb_file = f'{emb_path}/{header[1:]}.pt'
        embs = torch.load(emb_file)
        Xs.append(embs['mean_representations'][emb_layer])
    Xs = np.stack(Xs, axis=0)
    ys = np.stack(ys, axis=0)
    return Xs, ys

Class for Anticancer Peptide Dataset.

Class for ACP data.

Only implements a single method - the clean_data() abstract method defined in the abstract base class.
- For ACP data - it just renames the sequences column to make it consistent across datasets.
The __init__() method loads, cleans, extracts labels & features for train and test splits of the datasets. It does this by calling the following 2 methods.
- clean_data() method described below.
- extract_features_labels() method described in the ProteinDataset abstract class definition above.

Implemenation of abstract method - load, clean and return ACP train and test dataframes.

Class for Antimicrobial Peptide Dataset.

Class for AMP data.

Only implements clean_data(), inherits the rest.

Implemenation of abstract method - load, clean, split and return AMP train and test dataframes.

Class for DNA Binding Protein Dataset.

Class for DNA Binding data. Like the other 2 only implements clean_data() and inherits the rest.

Implemenation of abstract method - load, clean and return DNA Binding train and test dataframes.

An example of how to use.

acp_data = ACPDataset(DATA_STORE)
amp_data = AMPDataset(DATA_STORE)
dna_data = DNABindDataset(DATA_STORE)

dna_data.test

acp_data.X_train

array([['R', 'R', 'W', ..., None, None, None],
       ['G', 'W', 'K', ..., None, None, None],
       ['A', 'L', 'W', ..., None, None, None],
       ...,
       ['E', 'S', 'E', ..., 'Q', 'E', None],
       ['F', 'I', 'S', ..., None, None, None],
       ['R', 'L', 'S', ..., None, None, None]], dtype=object)

Exploratory Data Analysis

First defining some convenience functions for plotting.

Plot the sequence length distribution given a train df with length and label columns.

Plot the distribution of amino acids given a train df with "sequence" column.

ACP

print(f"Samples in train: {len(acp_data.train)}")

Samples in train: 1378

plot_seqlen_dist(acp_data.train[["length", "label"]], "Anti Cancer Peptide")

plot_AA_dist(acp_data.train[["sequence"]], "Anti Cancer Peptide")

Number of amino acids in the dataset: 20
Frequencies: [('K', 4141), ('L', 3397), ('G', 3153), ('A', 2825), ('C', 2251), ('I', 2029), ('R', 1810), ('V', 1739), ('S', 1699), ('F', 1535), ('T', 1309), ('P', 1236), ('N', 1119), ('E', 759), ('D', 731), ('Y', 693), ('Q', 690), ('W', 688), ('H', 646), ('M', 383)]

AMP

print(f"Samples in train: {len(amp_data.train)}")

Samples in train: 3234

plot_seqlen_dist(
    amp_data.train[["length", "label"]], "Antimicrobial Peptide", log_scale=False
)

plot_seqlen_dist(
    amp_data.train[["length", "label"]], "Antimicrobial Peptide", log_scale=True
)

plot_AA_dist(amp_data.train[["sequence"]], "Antimicrobial Peptide")

Number of amino acids in the dataset: 21
Frequencies: [('L', 9768), ('G', 9521), ('K', 8835), ('A', 8356), ('S', 7371), ('V', 6703), ('R', 6491), ('I', 6294), ('E', 5366), ('P', 5338), ('T', 5233), ('D', 4651), ('N', 4613), ('F', 4392), ('C', 4219), ('Q', 3831), ('Y', 2962), ('H', 2618), ('M', 1830), ('W', 1578), ('X', 4)]

DNA Binding

print(f"Samples in train: {len(dna_data.train)}")

Samples in train: 14189

plot_seqlen_dist(
    dna_data.train[["length", "label"]],
    "DNA Binding Protein",
    log_scale=False,
)

plot_seqlen_dist(
    dna_data.train[["length", "label"]],
    "DNA Binding Protein",
    log_scale=True,
)

plot_AA_dist(dna_data.train[["sequence"]], "DNA Binding Protein")

Number of amino acids in the dataset: 24
Frequencies: [('L', 550524), ('S', 472593), ('A', 457288), ('E', 414062), ('G', 393279), ('K', 377806), ('V', 367475), ('R', 343987), ('T', 325286), ('P', 324499), ('D', 322736), ('I', 320614), ('Q', 272663), ('N', 271296), ('F', 215949), ('Y', 166668), ('H', 149670), ('M', 138056), ('C', 92015), ('W', 57815), ('X', 116), ('U', 3), ('O', 1), ('B', 1)]

Generate fasta Files

Once the dataset objects are created, the following calls will generate fasta files and persist them in the default locations.

acp_data.generate_fasta_files()

Created /home/vinod/.peptide/datasets/acp/fasta/acp_train.fasta with 1378 sequence records
Created /home/vinod/.peptide/datasets/acp/fasta/acp_test.fasta with 344 sequence records

amp_data.generate_fasta_files()

Created /home/vinod/.peptide/datasets/amp/fasta/amp_train.fasta with 3234 sequence records
Created /home/vinod/.peptide/datasets/amp/fasta/amp_test.fasta with 808 sequence records

amp_data.generate_fasta_files(use_seq_max_len=True)

Created /home/vinod/.peptide/datasets/amp/fasta/amp_train_seqlen_150.fasta with 3234 sequence records
Created /home/vinod/.peptide/datasets/amp/fasta/amp_test_seqlen_150.fasta with 808 sequence records

dna_data.generate_fasta_files()

Created /home/vinod/.peptide/datasets/dna/fasta/dna_train.fasta with 14189 sequence records
Created /home/vinod/.peptide/datasets/dna/fasta/dna_test.fasta with 2272 sequence records

dna_data.generate_fasta_files(use_seq_max_len=True)

Created /home/vinod/.peptide/datasets/dna/fasta/dna_train_seqlen_300.fasta with 14189 sequence records
Created /home/vinod/.peptide/datasets/dna/fasta/dna_test_seqlen_300.fasta with 2272 sequence records

Generate Embeddings From Pretrained Models

Steps for generating LSTM embeddings in bulk from fasta files detailed here.
Steps for generating Transformer embeddings in bulk from fasta files detailed here.

Load Embeddings

The following is an example of loading LSTM embeddings.

X_train, y_train, X_test, y_test = acp_data.get_lstm_emb(
    "acp_avgpool_train.h5", "acp_avgpool_test.h5"
)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1378, 6165), (1378,), (344, 6165), (344,))

And the following is an example of loading Transformer embeddings.

X_train, y_train, X_test, y_test = acp_data.get_transformer_emb(
    "acp_train.fasta", "acp_test.fasta"
)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1378, 1280), (1378,), (344, 1280), (344,))

	sequences	label
0	RRWWRRWRRW	0
1	GWKSVFRKAKKVGKTVGGLALDHYLG	0
2	ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ	1
3	GLFDVIKKVAAVIGGL	1
4	VAKLLAKLAKKVL	1

	PDBs_code	SequenceID	label
0	AP02484	GMASKAGSVLGKITKIALGAL	1
1	AP02630	NIGLFTSTCFSSQCFSSKCFTDTCFSSNCFTGRHQCGYTHGSC	1
2	AP01427	GAIKDALKGAAKTVAVELLKKAQCKLEKTC	1
3	AP02983	FFGRLKAVFRGARQGWKEHRY	1
4	AP01815	DFGCARGMIFVCMRRCARMYPGSTGYCQGFRCMCDTMIPIRRPPFIMG	1

	code	sequence	label	origin
0	Q6A8L0	MSGHSKWATTKHKKAAIDAKRGKLFARLIKNIEVAARLGGGDPSGN...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1	Q7V7T9	MIGWLQGQKVEAWQQGTRQGVVLACAGVGYEVQIAPRHLSEMEHGQ...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2	Q9ZUP2	MARILRNVYSLRSSLFSSELLRRSVVGTSFQLRGFAAKAKKKSKSD...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3	Q2JVG1	MKCPRCGKQEIRVLESRSAEGGQSVRRRRECMSCGYRFTTYERIEF...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4	Q9K4Q3	MTKADIIEGVYEKVGFSKKESAEIVELVFDTLKETLERGDKIKISG...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...

	code	sequence	label	origin
0	P27204\|1	AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1	P53528\|1	MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2	P52684\|1	MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3	P10961\|1	MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4	P06023\|1	MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND...	1	https://github.com/hfuulgb/PDB-Fusion/tree/mai...

	Type	Default	Details
`name`	`str`		name of the dataset (e.g. 'acp' or 'amp' or 'dna')
`location`	`str`		location of the raw dataset
`max_seq_len`	`int`	`None`	amino acid sequences will be truncated at this max length

Data

Protein, Peptides, Amino Acids - Basics

Analyze Data

Anti Cancer Peptide Dataset (ACP)

Antimicrobial Peptide Dataset (AMP)

DNA-Binding Protein Dataset

Dataset Classes

`class` `ProteinDataset`[source]

`ProteinDataset.clean_data`[source]

`ProteinDataset.extract_features_labels`[source]

`ProteinDataset.generate_fasta_files`[source]

`ProteinDataset.get_lstm_emb`[source]

`ProteinDataset.get_transformer_emb`[source]

`class` `ACPDataset`[source]

`ACPDataset.clean_data`[source]

`class` `AMPDataset`[source]

`AMPDataset.clean_data`[source]

`class` `DNABindDataset`[source]

`DNABindDataset.clean_data`[source]

Exploratory Data Analysis

`plot_seqlen_dist`[source]

`plot_AA_dist`[source]

ACP

AMP

DNA Binding

Generate fasta Files

Generate Embeddings From Pretrained Models

Load Embeddings

	sequences	label
0	FLPLLLSALPSFLCLVFKKC	0
1	DKLIGSCVWLAVNYTSNCNAECKRRGYKGGHCGSFLNVNCWCET	0
2	AVKDTYSCFIMRGKCRHECHDFEKPIGFCTKLNANCYM	0
3	GLPTCGETCFGGTCNTPGCTCDPWPVCTHN	1
4	ENCGRQAG	0

	Type	Default	Details
`out_dir`	`str`	`None`	output directory, defaults to 'dataset_location/dataset_name/fasta'
`use_seq_max_len`	`bool`	`False`	if True, uses truncated amino acid sequences, else full sequence

	Type	Default	Details
`train_h5_file`	`str`		train h5 filename
`test_h5_file`	`str`		test h5 filename
`h5_location`	`str`	`None`	LSTM embedding directory - defaults to 'dataset_location/dataset_name/lstm'

	Type	Default	Details
`train_fasta_file`	`str`		train fasta filename
`test_fasta_file`	`str`		test fasta filename
`fasta_location`	`str`	`None`	fasta directory - defaults to 'dataset_location/dataset_name/fasta'
`emb_location`	`str`	`None`	ESM embedding directory - defaults to 'dataset_location/dataset_name/transformer'
`emb_layer`	`int`	`33`	Layer of ESM Transformer to get embedding from - defaults to final layer (33)

	Type	Default	Details
`location`	`str`		location of raw ACP dataset (must contain dir called 'acp' with train and test CSVs in it)
`max_seq_len`	`int`	`None`	max length, if amino acid sequences are to be truncated

	Type	Default	Details
`location`	`str`		location of raw AMP dataset (must contain dir called 'amp' with 'all_data.csv' in it)
`max_seq_len`	`int`	`150`	max length that amino acid sequences will be truncated at
`test_pct`	`float`	`0.2`	split percent for test data
`seed`	`int`	`1234`	random state for split

	Type	Default	Details
`location`	`str`		location of raw DNA dataset (must contain dir called 'dna' with train and test CSVs in it)
`max_seq_len`	`int`	`300`	max length that amino acid sequences will be truncated at

	Type	Default
`df`	`DataFrame`
`dataset_name`	`str`
`log_scale`	`bool`	`False`
`bins`	`int`	`75`

	sequence	label	length
0	AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...	1	101
1	MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...	1	714
2	MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT...	1	308
3	MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR...	1	833
4	MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND...	1	174
...	...	...	...
2267	MNFSRERTITEIQNDYKEQVERQNQLKKRRRKGLYRRLTVFGALVF...	0	125
2268	MVVVDKEIKKGQYYLVNGNVVRVTYVNGFDVYYLILKLHKRMICDR...	0	60
2269	MNPSTHVSSNGPTTPPHGPHTTFLPPTSPAPSTSSVAAATLCSPQR...	0	668
2270	MVRSGKKAVVLAAVAFCATSVVQKSHGFVPSPLRQRAAAAGAAAAS...	0	370
2271	MNYSIRLFKIMGIPIELHITFILFLVVIIGLSIMNNSIFWAVLFIL...	0	339

Data

Protein, Peptides, Amino Acids - Basics

Analyze Data

Anti Cancer Peptide Dataset (ACP)

Antimicrobial Peptide Dataset (AMP)

DNA-Binding Protein Dataset

Dataset Classes

class ProteinDataset[source]

ProteinDataset.clean_data[source]

ProteinDataset.extract_features_labels[source]

ProteinDataset.generate_fasta_files[source]

ProteinDataset.get_lstm_emb[source]

ProteinDataset.get_transformer_emb[source]

class ACPDataset[source]

ACPDataset.clean_data[source]

class AMPDataset[source]

AMPDataset.clean_data[source]

class DNABindDataset[source]

DNABindDataset.clean_data[source]

Exploratory Data Analysis

plot_seqlen_dist[source]

plot_AA_dist[source]

ACP

AMP

DNA Binding

Generate fasta Files

Generate Embeddings From Pretrained Models

Load Embeddings

`class` `ProteinDataset`[source]

`ProteinDataset.clean_data`[source]

`ProteinDataset.extract_features_labels`[source]

`ProteinDataset.generate_fasta_files`[source]

`ProteinDataset.get_lstm_emb`[source]

`ProteinDataset.get_transformer_emb`[source]

`class` `ACPDataset`[source]

`ACPDataset.clean_data`[source]

`class` `AMPDataset`[source]

`AMPDataset.clean_data`[source]

`class` `DNABindDataset`[source]

`DNABindDataset.clean_data`[source]

`plot_seqlen_dist`[source]

`plot_AA_dist`[source]