Learner class for supervised, unsupervised and semi supervised learning with Protein data.
 

class Learner[source]

Learner(X_train:ndarray, y_train:ndarray, X_test:ndarray, y_test:ndarray, ohe:bool=False, scaler:bool=False, pca:bool=True, pca_n_components:int=50, param_grids:list=None)

Class for training and prediction.
Type Default Details
X_train ndarray X_train numpy ndarray
y_train ndarray y_train numpy ndarray
X_test ndarray X_test numpy ndarray
y_test ndarray y_test numpy ndarray
ohe bool False to use one hot encoding or not
scaler bool False to use standard scaling or not
pca bool True to use principal component analysis or not
pca_n_components int 50 PCA number of components
param_grids list None param_grid for grid search, if None - gets default grid from utils

Learner.create_pipeline[source]

Learner.create_pipeline()

Create and return pipeline

Learner.train[source]

Learner.train(scoring:str='accuracy', cv:int=5, n_jobs:int=-1)

Run GridSearchCV for all models on X_train and y_train of dataset.
Returns:
    train_results: list of grid search results
    grid_list: list of trained grid objects
Type Default Details
scoring str accuracy must be one of https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
cv int 5 defaults to 5-fold CV
n_jobs int -1 defaults to -1 to use all cores

Learner.get_top_5_train_results[source]

Learner.get_top_5_train_results()

Return top 5 results for each grid

Learner.predict[source]

Learner.predict()

Get predictions on the dataset's X_test from best estimators of GridSearchCV.

Learner.pick_k[source]

Learner.pick_k(max_clusters:int=10, pca_n_components:int=50)

Plot elbow and silohutte curves & print silohutte scores to help determine the ideal 'k' for Kmeans.
Type Default Details
max_clusters int 10 max number of clusters to try out
pca_n_components int 50 number of components to reduce to in PCA

The pick_k method does the following to help determine the ideal k for KMeans:

  • It first concats X_train and X_test of this dataset into a single ndarray 'X'
  • then encodes X using OneHotEncoder
  • then sclaes X using StandardScaler
  • then dimensionality reduces X using PCA
  • then plots elbow & silhouette plots for X and prints silhouette scores, and returns the PCA-reduced X.

Learner.analyze_clusters[source]

Learner.analyze_clusters(X_pca:ndarray, k:int, random_state:int=10)

Perform KMeans clustering, print cluster counts and plot clusters from the result.
Type Default Details
X_pca ndarray dim reduced X numpy ndarray
k int the chosen value of k for KMeans
random_state int 10 random state for KMeans

Learner.run_label_spreading[source]

Learner.run_label_spreading(pca_n_components:int=50)

Run Label Spreading, print report, append results to predict_results.
Type Default Details
pca_n_components int 50 number of components to reduce to in PCA