Learner class for supervised, unsupervised and semi supervised learning with Protein data.
Learner(X_train:ndarray, y_train:ndarray, X_test:ndarray, y_test:ndarray, ohe:bool=False, scaler:bool=False, pca:bool=True, pca_n_components:int=50, param_grids:list=None)
Class for training and prediction.
|
Type |
Default |
Details |
X_train |
ndarray |
|
X_train numpy ndarray |
y_train |
ndarray |
|
y_train numpy ndarray |
X_test |
ndarray |
|
X_test numpy ndarray |
y_test |
ndarray |
|
y_test numpy ndarray |
ohe |
bool |
False |
to use one hot encoding or not |
scaler |
bool |
False |
to use standard scaling or not |
pca |
bool |
True |
to use principal component analysis or not |
pca_n_components |
int |
50 |
PCA number of components |
param_grids |
list |
None |
param_grid for grid search, if None - gets default grid from utils |
Learner.create_pipeline()
Create and return pipeline
Learner.train(scoring:str='accuracy', cv:int=5, n_jobs:int=-1)
Run GridSearchCV for all models on X_train and y_train of dataset.
Returns:
train_results: list of grid search results
grid_list: list of trained grid objects
Learner.get_top_5_train_results()
Return top 5 results for each grid
Learner.predict()
Get predictions on the dataset's X_test from best estimators of GridSearchCV.
Learner.pick_k(max_clusters:int=10, pca_n_components:int=50)
Plot elbow and silohutte curves & print silohutte scores to help determine the ideal 'k' for Kmeans.
|
Type |
Default |
Details |
max_clusters |
int |
10 |
max number of clusters to try out |
pca_n_components |
int |
50 |
number of components to reduce to in PCA |
The pick_k method does the following to help determine the ideal k for KMeans:
- It first concats X_train and X_test of this dataset into a single ndarray 'X'
- then encodes X using OneHotEncoder
- then sclaes X using StandardScaler
- then dimensionality reduces X using PCA
- then plots elbow & silhouette plots for X and prints silhouette scores, and returns the PCA-reduced X.
Learner.analyze_clusters(X_pca:ndarray, k:int, random_state:int=10)
Perform KMeans clustering, print cluster counts and plot clusters from the result.
|
Type |
Default |
Details |
X_pca |
ndarray |
|
dim reduced X numpy ndarray |
k |
int |
|
the chosen value of k for KMeans |
random_state |
int |
10 |
random state for KMeans |
Learner.run_label_spreading(pca_n_components:int=50)
Run Label Spreading, print report, append results to predict_results.
|
Type |
Default |
Details |
pca_n_components |
int |
50 |
number of components to reduce to in PCA |