Welcome to sbr’s documentation!

sbr is a set of useful functions and classes for modelling gene expression data with tensorflow.

Indices and tables

Full Reference

sbr.compile.one_layer_multicategorical(input_size=None, output_size=None, learning_rate: float = 0.0001, dim: int = 1000, specificityAtSensitivityThreshold: float = 0.5, sensitivityAtSpecificityThreshold: float = 0.5, kernel_initializer=tensorflow.keras.initializers.HeNormal, bias_initializer=tensorflow.zeros_initializer, output_activation: str = 'softmax', isMultilabel: bool = True, verbose: bool = True)

Compile a single layer multicategorical model.

Can use sbr.visualize.plot_loss_curve to see the metrics after fitting

Parameters
  • input_size – Usually x_train.shape[1]; not required for compile, but for calling model.summary()

  • output_size – Number of classes in the one-hot-encoded target vector; usually y_train.shape[1]

  • learning_rate – Plan for this to be reduced during EarlyStopping checkpoints in the model training/fit

  • dim – Number of nodes to have in the hidden layer. Somthing half-way between input_size and output_size is a good choice, but if input_size is very big, the number may need to be smaller in order to reduce the number of trainable parameters and avoid over-fitting.

  • specificityAtSensitivityThreshold – With this percentage of sensitivity (e.g., detecting at least this many true positives), find the specificity (e.g., how many identified will actually be correct). This is a bit trickier for multivariate problems, see this blog artical on analyticsvidhya.com

  • sensitivityAtSpecificityThreshold – Same as above, but for specificity.

  • kernel_initializer – HeNormal initializer forces diversity of outcomes between trainings

  • bias_initializder – initialize biases

  • output_activation – Use softmax for multicategorical, one-hot encoded

  • isMultilabel – Should alwasy be True for multicategorical models

  • verbose – If True, print model summary. Set to False if input_size = None to avoid error

Returns

model of type tf.keras.model

Example usage:

>>> model = compile.one_layer_multicategorical(input_size=x_train.shape[1],
                                     output_size=y_train.shape[1],
                                     output_activation='softmax',
                                     learning_rate=0.0001,
                                     isMultilabel=True,
                                     dim=1000,
                                     specificityAtSensitivityThreshold=0.50,
                                     sensitivityAtSpecificityThreshold=0.50,
                                     verbose=True)
  Model: "sequential"
  _________________________________________________________________
  Layer (type)                 Output Shape              Param #
  =================================================================
  Input_BAD (BADBlock)         (None, 1000)              18968000
  _________________________________________________________________
  output (Dense)               (None, 26)                26026
  =================================================================
  Total params: 18,994,026
  Trainable params: 18,992,026
  Non-trainable params: 2,000
  _________________________________________________________________
class sbr.layers.BADBlock(*args: Any, **kwargs: Any)

Inherits from keras.layers.Dense. Dense layer followed by Batch, Activation, Dropout. When popular kwarg input_shape is passed, then will create a keras input layer to insert before the current layer to avoid explicitly defining an InputLayer.

This is a very good layer to use for gene expression data to increase stability and reduce trainable parameters.

Example 1:

Recreate this layer from its config:

>>> layer = BADBlock(units=1000)
>>> config = layer.get_config()
>>> new_layer = BADBlock.from_config(config)

Example 2:

Use in a model:

>>> import tensorflow as tf
>>> from tensorflow.keras.layers import Dense
>>> from tensorflow.keras.models import Sequential
>>> from sbr.layers import BADBlock
>>> model = Sequential()
>>> model.add(BADBlock(units=1000, input_dim = 18963, activation='relu', dropout_rate=0.50, name="BAD_1"))
>>> model.add(Dense(26, activation="softmax"))
>>> model.summary()
>>> model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy','mse'])
sbr.evaluate.compare_predictions(model, x_test, y_test, class_names=None, verbose=True)

Predicts y_test from x_test using model, then compares predictions with truth.

Parameters
  • model – the model to use model.predict

  • x_test – test features

  • y_test – targets

  • class_names – an ordered list of class name strings that map to the np.argmax(y_test,axis=1) indices in y_test. If none, class indices will be reported instead of strng names.

  • verbose – if verbose, pairs are printed out (good if there aren’t a lot of mislabeled predictions)

Returns

(y_pred, pairs)

  • y_pred: the predicted outcomes from x_test

  • pairs: list pairs of (<truth><false-prediction>) class names

Exampe usage:
>>> y_pred, pairs = compare_predictions(model=model,
                                        x_test=x_test, y_test=ytest,
                                        class_names=class_names,
                                        verbose = True)
Number of test samples: 256
Mis-classifications:
(<truth>,<false-prediction>)
[('Esophagus', 'Blood Vessel'), ('Blood Vessel', 'Heart'), ('Adipose Tissue', 'Breast'), ('Salivary Gland', 'Esophagus')]
[sbr.model.save_architecture] Model successfully saved at: data/model/gtex/manual/gtex_model.h5.
Model: "sequential"
sbr.evaluate.mislabeled_pair_counts(model, X, y, class_names, sample_ids=None, batch_size=1500, verbose=False)

For multicategorical models: creates a table of observed, predicted class names for the mispredicted observations. This tends to use a lot of memory on multiple runs in a jupyter notebook, with tensorflow 2.6. May need to restart the kernel on second run. If resources continue to be a problem after restarting the kernel, reduce the batch_size.

Assumes y_pred, y_obs are one-hot encoded and class_names matches the index predictions returned from np.argmax(y_pred)

Parameters
  • model – used for model.predict

  • X – feature values

  • y – one-hot encoded true labels

  • class_names – ordered list of class_names

  • sample_ids – pass in this Series object to get back a table of pairs with their sample_ids

  • batch_size – number of samples to process in each step (to keep from swamping memory)

  • verbose – helps with debugging; messages each step/batch

Returns

(pairs_counts, pair_id_map)

  • pairs_counts: Table with compound index ‘observed’,’predicted’ and one column, “counts”, with the count of all the

  • the samples in that observed/predicted mislabeled pair.

  • pair_id_map: None if sample_ids wasn’t passed in, otherwise returns a table with columns observed, predicted, sample_id

Example Usage: Get the mislabeled counts
>>> mislabeled_counts, mislabeled = mislabeled_pair_counts(model=model, X=X, y=y, class_names=class_names,
                                                           sample_ids = pd.Series(label_df["sample_id"]),
                                                           batch_size=500)
>>> mislabeled_counts
_images/mislabeled_counts.png
Example Usage: Get the mislabeled samples
>>> m=mislabeled.reset_index()
>>> m[m['observed']=="Lung"]
_images/mislabeled_lung.png
sbr.evaluate.training_report(model, x_test, y_test, sensitivityAtSpecificityThreshold=None, specificityAtSensitivityThreshold=None, verbose=True)

Calls model.evaluate(x_test,y_test) and, if verbose, reports on the performance, then returns a performance object like the one returned by model.evaluate.

Parameters
  • x_test – features

  • y_test – targets

  • verbose – if True, report to stdout

  • sensitivityAtSpecificityThreshold – If not None, and verbose, and this metric was captured in model.fit, report it to stdout

  • specificityAtSensitivityThreshold – see above

Returns

A performance object

Example Usage:
>>> performance = training_report(model, x_test, y_test,
                              sensitivityAtSpecificityThreshold=sensitivityAtSpecificityThreshold,
                              specificityAtSensitivityThreshold=specificityAtSensitivityThreshold,
                              verbose=True)
Performance:
Performance details:
  loss:0.07804308831691742
  accuracy:0.984375
  mse:0.0009617832256481051
  precision:0.984375
  recall:0.984375
  auc:0.9988833665847778
  SpecificityAtSensitivity:0.9998437762260437
  SensitivityAtSpecificity:0.99609375
  fp:4.0
  fn:4.0
  tp:252.0
  tn:6396.0
Figure(500x500)
Number of training samples: 2080
Number of validation samples: 256

sbr.fit.multicategorical_model(model, model_folder, x_train, y_train, x_validation, y_validation, epochs=200, patience=4, lr_patience=2, lr_factor=0.1, batch_size=32, shuffle_value=100, initial_epoch=0, train_verbose=1, checkpoint_verbose=1)

Fits the given model with the given hyperparameters and multi-categorical data, after computing class weights and shuffling the data. Writes checkpoint and final model weights to model_folder. Look under variables/variables.* for weights.

Assumptions

  • Model has been compiled and saved to f”{model_path}.h5” (e.g., data/model/gtex/manual/gtex_model.h5)

  • Targets are one-hot encoded

  • Features have been normalized

Tested with tensorflow v2.6.2, keras 2.6.0

Parameters
  • model – a compiled model

  • model_folder – writable folder to store the checkpoint and final model weights

  • x_train – training features, see sbr.split for help

  • y_train – training targets, see above

  • x_validation – validation feature, see above

  • y_validation – validation feature, see above

  • epochs [200] – Number of epochs to train

  • patience [4] – Number of epochs with no improvement after which training will be stopped.

  • lr_patience [2] – Number of epochs with no improvement after which learning rate will be reduced.

  • lr_factor [0.1] – Factor by which the learning rate will be reduced. new_lr = lr * factor.

  • batch_size [32] – probably don’t change this

  • shuffle_value [100]

  • initial_epoch [0] – use this if you want to resume training at a particular epoch

  • train_verbose [0] – amount of information to print on each epoch. for 0: silent, 1: animated progress bar, 2: mentions epoch. For example:

    • 0: <silent>

    • 1:

    [==================]
    Epoch 00015: val_loss improved from 0.06645 to 0.06611, saving model to
         data/model/gtex
    INFO:tensorflow:Assets written to: data/model/gtex/assets
    
    • 2:

    Epoch 1/10
    checkpoint_verbose [1]: amount of information to print on each epoch about the
        checkpoint. 0: silent.
    
Returns

history

A History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable). Use print(history.history.keys()) to see all the hist and print(history.history[‘val_loss’]) to print validation loss

Example Usage:
>>> from sbr import fit
>>> history=fit.multicategorical_model(model=model,
                             model_folder ='data/model/gtex',
                             x_train=x_train, y_train=y_train,
                             x_validation=x_validation, y_validation=y_validation,
                             epochs = 200,
                             patience = 4,
                             lr_patience = 2,
                             checkpoint_verbose=1,
                             train_verbose=0)
Example Usage: Reload with:
>>> model = load_model('f{model_path}')
>>> model.load_weights(f"{model_folder}")
sbr.model.save_architecture(model, model_path: Optional[str] = None, file_name='model.h5', input_size=None, verbose=1)

Saves the given model to the given path and name. It’s a good idea to train and then run this in a notebook if possible so the train model is resident in memory because this function can be tried again in case it fails for some reason.

Note

Custom layer BADBlock will be loaded as part of the configuration.

Warning

THIS WILL OVER-WRITE ANY EXISTING MODEL.

Parameters
  • model – model object for calling model.save

  • model_path – file path where model is to be written

  • file_name – name of the file, h5 format. Any exisiting file will be over-written.

  • input_size – if not None, attempts to check predictions on saved model are close to original model

  • verbose – 0: debug, 1:print out model summary. This may throw an error if model wasn’t compiled with a known input size

Returns

True on success, False otherwise. Check the return to try again if it fails while model is still resident in memory.

Example usage:

>>> success = sbf.model.save(model, model_path="data/model/manual", file_name="model.h5", verbose=1)
True
sbr.preprocessing.dataset.multicategorical_split(X, y, sample_count_threshold=100, test_fraction=0.1, validation_fraction=0.1, verbose=True, batch_size=32, seed=None, shuffle=True)

Shuffles and splits X, y into test, train, validate; round dataset sizes to be a factor of batch_size.

Final dataset size is (sample_count_threshold * <number of classes>)

see also: sbr.preprocessing.gtex.dataset_setup

Parameters
  • X – Features

  • y – multicategorical targets (more than one column)

  • sample_count_threshold – use about this many samples from each class

  • seed[None] – set this to make function deterministic/repeatable

  • shuffle[True] – probably don’t touch this. Shuffling the data really helps down-stream model training.

Returns

(x_train, y_train, x_val, y_val, x_test, y_test)

sbr.preprocessing.dataset.trim_list_size_to_batch_size_factor(batch_size=32, trim_list=None)

Trims the given list of multicategorical arrays down to a factor of the given batch_size. This can avoid errors during training when the dataset is very large, a small amount of data loss isn’t a factor, and retaining a specfic batch_size (e.g., of 32) is prefered .

Parameters
  • trim_list – a list of arrays to be trimmed

  • batch_size[32] – probably leave this alone

Returns

the same trim_list, but trimmed

Example Usage:
>>> [x_train, y_train, x_val, y_val, x_test, y_test] = trim_list_size_to_batch_size_factor([x_train, y_train, x_val, y_val, x_test, y_test])
sbr.preprocessing.gtex.dataset_setup(sample_count_threshold=100, expr_path='data/gtex/expr.ftr', attr_path='dist/gtex/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt', drop_classes_list=None, attr_class_name_column_name='SMTS', attr_sample_id_column_name='SAMPID', expr_sample_id_column_name='sample_id', verbose=True)

Reads the expression and attribute feather files, normalizes the expression values, one-hot encodes the classes, and returns the features, targets and labels in coordinated order.

Parameters
  • sample_count_threshold – drop any classes that are less than this threshold. If ‘None’, don’t drop any classes

  • expr_path

  • attr_path

  • drop_classes_list

  • attr_class_name_column_name

  • attr_sample_id_column_name

  • expr_sample_id_column_name

  • verbose

Returns

(X, y, class_names, label_df)

  • X: Normalized feature values

  • y: One-hot encoded target values

  • class_names: Ordered list of strings, one item per class. This will be handy for understanding the predictions

  • label_df: a dataframe that combines X, y, class_names, and IDs together

Example Usage;
>>> X, y, class_names, label_df = dataset_setup(100)
>>> # The return from this function (X, y) can be split as such:
>>> x_train, x_test,y_train, y_test = sklearn.model_selection.train_test_split(X, np.array(y), test_size=1.-fraction, random_state=42, shuffle=True)`
>>> # Class names can be retrieved from the returned target aray (y) and ordered list of class names (class_names) as such:
>>> class_names[np.argmax[y]]