Learn

Learner class for supervised, unsupervised and semi supervised learning with Protein data.

`class` `Learner`[source]

Learner(X_train:ndarray, y_train:ndarray, X_test:ndarray, y_test:ndarray, ohe:bool=False, scaler:bool=False, pca:bool=True, pca_n_components:int=50, param_grids:list=None)

Class for training and prediction.

	Type	Default	Details
`X_train`	`ndarray`		X_train numpy ndarray
`y_train`	`ndarray`		y_train numpy ndarray
`X_test`	`ndarray`		X_test numpy ndarray
`y_test`	`ndarray`		y_test numpy ndarray
`ohe`	`bool`	`False`	to use one hot encoding or not
`scaler`	`bool`	`False`	to use standard scaling or not
`pca`	`bool`	`True`	to use principal component analysis or not
`pca_n_components`	`int`	`50`	PCA number of components
`param_grids`	`list`	`None`	param_grid for grid search, if None - gets default grid from utils

`Learner.create_pipeline`[source]

Learner.create_pipeline()

Create and return pipeline

`Learner.train`[source]

Learner.train(scoring:str='accuracy', cv:int=5, n_jobs:int=-1)

Run GridSearchCV for all models on X_train and y_train of dataset.
Returns:
    train_results: list of grid search results
    grid_list: list of trained grid objects

	Type	Default	Details
`scoring`	`str`	`accuracy`	must be one of https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
`cv`	`int`	`5`	defaults to 5-fold CV
`n_jobs`	`int`	`-1`	defaults to -1 to use all cores

`Learner.get_top_5_train_results`[source]

Learner.get_top_5_train_results()

Return top 5 results for each grid

`Learner.predict`[source]

Learner.predict()

Get predictions on the dataset's X_test from best estimators of GridSearchCV.

`Learner.pick_k`[source]

Learner.pick_k(max_clusters:int=10, pca_n_components:int=50)

Plot elbow and silohutte curves & print silohutte scores to help determine the ideal 'k' for Kmeans.

	Type	Default	Details
`max_clusters`	`int`	`10`	max number of clusters to try out
`pca_n_components`	`int`	`50`	number of components to reduce to in PCA

The pick_k method does the following to help determine the ideal k for KMeans:

It first concats X_train and X_test of this dataset into a single ndarray 'X'
then encodes X using OneHotEncoder
then sclaes X using StandardScaler
then dimensionality reduces X using PCA
then plots elbow & silhouette plots for X and prints silhouette scores, and returns the PCA-reduced X.

`Learner.analyze_clusters`[source]

Learner.analyze_clusters(X_pca:ndarray, k:int, random_state:int=10)

Perform KMeans clustering, print cluster counts and plot clusters from the result.

	Type	Default	Details
`X_pca`	`ndarray`		dim reduced X numpy ndarray
`k`	`int`		the chosen value of k for KMeans
`random_state`	`int`	`10`	random state for KMeans

`Learner.run_label_spreading`[source]

Learner.run_label_spreading(pca_n_components:int=50)

Run Label Spreading, print report, append results to predict_results.

	Type	Default	Details
`pca_n_components`	`int`	`50`	number of components to reduce to in PCA

class Learner[source]

Learner.create_pipeline[source]

Learner.train[source]

Learner.get_top_5_train_results[source]

Learner.predict[source]

Learner.pick_k[source]

Learner.analyze_clusters[source]

Learner.run_label_spreading[source]

`class` `Learner`[source]

`Learner.create_pipeline`[source]

`Learner.train`[source]

`Learner.get_top_5_train_results`[source]

`Learner.predict`[source]

`Learner.pick_k`[source]

`Learner.analyze_clusters`[source]

`Learner.run_label_spreading`[source]