Welcome to sbr’s documentation!

sbr is a set of useful functions and classes for modelling gene expression data with tensorflow.

Indices and tables

Full Reference

sbr.compile.one_layer_multicategorical(input_size=None, output_size=None, learning_rate: float = 0.0001, dim: int = 1000, specificityAtSensitivityThreshold: float = 0.5, sensitivityAtSpecificityThreshold: float = 0.5, kernel_initializer=tensorflow.keras.initializers.HeNormal, bias_initializer=tensorflow.zeros_initializer, output_activation: str = 'softmax', isMultilabel: bool = True, verbose: bool = True)

Compile a single layer multicategorical model.

Can use sbr.visualize.plot_loss_curve to see the metrics after fitting

Parameters

input_size – Usually x_train.shape[1]; not required for compile, but for calling model.summary()
output_size – Number of classes in the one-hot-encoded target vector; usually y_train.shape[1]
learning_rate – Plan for this to be reduced during EarlyStopping checkpoints in the model training/fit
dim – Number of nodes to have in the hidden layer. Somthing half-way between input_size and output_size is a good choice, but if input_size is very big, the number may need to be smaller in order to reduce the number of trainable parameters and avoid over-fitting.
specificityAtSensitivityThreshold – With this percentage of sensitivity (e.g., detecting at least this many true positives), find the specificity (e.g., how many identified will actually be correct). This is a bit trickier for multivariate problems, see this blog artical on analyticsvidhya.com
sensitivityAtSpecificityThreshold – Same as above, but for specificity.
kernel_initializer – HeNormal initializer forces diversity of outcomes between trainings
bias_initializder – initialize biases
output_activation – Use softmax for multicategorical, one-hot encoded
isMultilabel – Should alwasy be True for multicategorical models
verbose – If True, print model summary. Set to False if input_size = None to avoid error

Returns

model of type tf.keras.model

Example usage:

>>> model = compile.one_layer_multicategorical(input_size=x_train.shape[1],
                                     output_size=y_train.shape[1],
                                     output_activation='softmax',
                                     learning_rate=0.0001,
                                     isMultilabel=True,
                                     dim=1000,
                                     specificityAtSensitivityThreshold=0.50,
                                     sensitivityAtSpecificityThreshold=0.50,
                                     verbose=True)
  Model: "sequential"
  _________________________________________________________________
  Layer (type)                 Output Shape              Param #
  =================================================================
  Input_BAD (BADBlock)         (None, 1000)              18968000
  _________________________________________________________________
  output (Dense)               (None, 26)                26026
  =================================================================
  Total params: 18,994,026
  Trainable params: 18,992,026
  Non-trainable params: 2,000
  _________________________________________________________________

class sbr.layers.BADBlock(*args: Any, **kwargs: Any)

Inherits from keras.layers.Dense. Dense layer followed by Batch, Activation, Dropout. When popular kwarg input_shape is passed, then will create a keras input layer to insert before the current layer to avoid explicitly defining an InputLayer.

This is a very good layer to use for gene expression data to increase stability and reduce trainable parameters.

Example 1:

Recreate this layer from its config:

>>> layer = BADBlock(units=1000)
>>> config = layer.get_config()
>>> new_layer = BADBlock.from_config(config)

Example 2:

Use in a model:

>>> import tensorflow as tf
>>> from tensorflow.keras.layers import Dense
>>> from tensorflow.keras.models import Sequential
>>> from sbr.layers import BADBlock
>>> model = Sequential()
>>> model.add(BADBlock(units=1000, input_dim = 18963, activation='relu', dropout_rate=0.50, name="BAD_1"))
>>> model.add(Dense(26, activation="softmax"))
>>> model.summary()
>>> model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy','mse'])

sbr.evaluate.compare_predictions(model, x_test, y_test, class_names=None, verbose=True)

Predicts y_test from x_test using model, then compares predictions with truth.

Parameters

model – the model to use model.predict
x_test – test features
y_test – targets
class_names – an ordered list of class name strings that map to the np.argmax(y_test,axis=1) indices in y_test. If none, class indices will be reported instead of strng names.
verbose – if verbose, pairs are printed out (good if there aren’t a lot of mislabeled predictions)

Returns

(y_pred, pairs)

y_pred: the predicted outcomes from x_test
pairs: list pairs of (<truth><false-prediction>) class names

Exampe usage:

>>> y_pred, pairs = compare_predictions(model=model,
                                        x_test=x_test, y_test=ytest,
                                        class_names=class_names,
                                        verbose = True)

Number of test samples: 256
Mis-classifications:
(<truth>,<false-prediction>)
[('Esophagus', 'Blood Vessel'), ('Blood Vessel', 'Heart'), ('Adipose Tissue', 'Breast'), ('Salivary Gland', 'Esophagus')]
[sbr.model.save_architecture] Model successfully saved at: data/model/gtex/manual/gtex_model.h5.
Model: "sequential"

sbr.evaluate.mislabeled_pair_counts(model, X, y, class_names, sample_ids=None, batch_size=1500, verbose=False)

For multicategorical models: creates a table of observed, predicted class names for the mispredicted observations. This tends to use a lot of memory on multiple runs in a jupyter notebook, with tensorflow 2.6. May need to restart the kernel on second run. If resources continue to be a problem after restarting the kernel, reduce the batch_size.

Assumes y_pred, y_obs are one-hot encoded and class_names matches the index predictions returned from np.argmax(y_pred)

Parameters

model – used for model.predict
X – feature values
y – one-hot encoded true labels
class_names – ordered list of class_names
sample_ids – pass in this Series object to get back a table of pairs with their sample_ids
batch_size – number of samples to process in each step (to keep from swamping memory)
verbose – helps with debugging; messages each step/batch

Returns

(pairs_counts, pair_id_map)

pairs_counts: Table with compound index ‘observed’,’predicted’ and one column, “counts”, with the count of all the
the samples in that observed/predicted mislabeled pair.
pair_id_map: None if sample_ids wasn’t passed in, otherwise returns a table with columns observed, predicted, sample_id

Example Usage: Get the mislabeled counts

>>> mislabeled_counts, mislabeled = mislabeled_pair_counts(model=model, X=X, y=y, class_names=class_names,
                                                           sample_ids = pd.Series(label_df["sample_id"]),
                                                           batch_size=500)
>>> mislabeled_counts

Example Usage: Get the mislabeled samples

>>> m=mislabeled.reset_index()
>>> m[m['observed']=="Lung"]

sbr.evaluate.training_report(model, x_test, y_test, sensitivityAtSpecificityThreshold=None, specificityAtSensitivityThreshold=None, verbose=True)

Calls model.evaluate(x_test,y_test) and, if verbose, reports on the performance, then returns a performance object like the one returned by model.evaluate.

Parameters

x_test – features
y_test – targets
verbose – if True, report to stdout
sensitivityAtSpecificityThreshold – If not None, and verbose, and this metric was captured in model.fit, report it to stdout
specificityAtSensitivityThreshold – see above

Returns

A performance object

Example Usage:

>>> performance = training_report(model, x_test, y_test,
                              sensitivityAtSpecificityThreshold=sensitivityAtSpecificityThreshold,
                              specificityAtSensitivityThreshold=specificityAtSensitivityThreshold,
                              verbose=True)

Performance:
Performance details:
  loss:0.07804308831691742
  accuracy:0.984375
  mse:0.0009617832256481051
  precision:0.984375
  recall:0.984375
  auc:0.9988833665847778
  SpecificityAtSensitivity:0.9998437762260437
  SensitivityAtSpecificity:0.99609375
  fp:4.0
  fn:4.0
  tp:252.0
  tn:6396.0
Figure(500x500)
Number of training samples: 2080
Number of validation samples: 256

…

sbr.fit.multicategorical_model(model, model_folder, x_train, y_train, x_validation, y_validation, epochs=200, patience=4, lr_patience=2, lr_factor=0.1, batch_size=32, shuffle_value=100, initial_epoch=0, train_verbose=1, checkpoint_verbose=1)

Fits the given model with the given hyperparameters and multi-categorical data, after computing class weights and shuffling the data. Writes checkpoint and final model weights to model_folder. Look under variables/variables.* for weights.

Assumptions

Model has been compiled and saved to f”{model_path}.h5” (e.g., data/model/gtex/manual/gtex_model.h5)
Targets are one-hot encoded
Features have been normalized

Tested with tensorflow v2.6.2, keras 2.6.0

Parameters

model – a compiled model
model_folder – writable folder to store the checkpoint and final model weights
x_train – training features, see sbr.split for help
y_train – training targets, see above
x_validation – validation feature, see above
y_validation – validation feature, see above
epochs [200] – Number of epochs to train
patience [4] – Number of epochs with no improvement after which training will be stopped.
lr_patience [2] – Number of epochs with no improvement after which learning rate will be reduced.
lr_factor [0.1] – Factor by which the learning rate will be reduced. new_lr = lr * factor.
batch_size [32] – probably don’t change this
shuffle_value [100]
initial_epoch [0] – use this if you want to resume training at a particular epoch

train_verbose [0] – amount of information to print on each epoch. for 0: silent, 1: animated progress bar, 2: mentions epoch. For example:

0: <silent>
1:

[==================]
Epoch 00015: val_loss improved from 0.06645 to 0.06611, saving model to
     data/model/gtex
INFO:tensorflow:Assets written to: data/model/gtex/assets

2:

Epoch 1/10
checkpoint_verbose [1]: amount of information to print on each epoch about the
    checkpoint. 0: silent.

Returns

history

A History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable). Use print(history.history.keys()) to see all the hist and print(history.history[‘val_loss’]) to print validation loss

Example Usage:

>>> from sbr import fit
>>> history=fit.multicategorical_model(model=model,
                             model_folder ='data/model/gtex',
                             x_train=x_train, y_train=y_train,
                             x_validation=x_validation, y_validation=y_validation,
                             epochs = 200,
                             patience = 4,
                             lr_patience = 2,
                             checkpoint_verbose=1,
                             train_verbose=0)

Example Usage: Reload with:

>>> model = load_model('f{model_path}')
>>> model.load_weights(f"{model_folder}")

sbr.model.save_architecture(model, model_path: Optional[str] = None, file_name='model.h5', input_size=None, verbose=1)

Saves the given model to the given path and name. It’s a good idea to train and then run this in a notebook if possible so the train model is resident in memory because this function can be tried again in case it fails for some reason.

Note

Custom layer BADBlock will be loaded as part of the configuration.

Warning

THIS WILL OVER-WRITE ANY EXISTING MODEL.

Parameters

model – model object for calling model.save
model_path – file path where model is to be written
file_name – name of the file, h5 format. Any exisiting file will be over-written.
input_size – if not None, attempts to check predictions on saved model are close to original model
verbose – 0: debug, 1:print out model summary. This may throw an error if model wasn’t compiled with a known input size

Returns

True on success, False otherwise. Check the return to try again if it fails while model is still resident in memory.

Example usage:

>>> success = sbf.model.save(model, model_path="data/model/manual", file_name="model.h5", verbose=1)
True

sbr.preprocessing.dataset.multicategorical_split(X, y, sample_count_threshold=100, test_fraction=0.1, validation_fraction=0.1, verbose=True, batch_size=32, seed=None, shuffle=True)

Shuffles and splits X, y into test, train, validate; round dataset sizes to be a factor of batch_size.

Final dataset size is (sample_count_threshold * <number of classes>)

see also: sbr.preprocessing.gtex.dataset_setup

Parameters

X – Features
y – multicategorical targets (more than one column)
sample_count_threshold – use about this many samples from each class
seed[None] – set this to make function deterministic/repeatable
shuffle[True] – probably don’t touch this. Shuffling the data really helps down-stream model training.

Returns

(x_train, y_train, x_val, y_val, x_test, y_test)

sbr.preprocessing.dataset.trim_list_size_to_batch_size_factor(batch_size=32, trim_list=None)

Trims the given list of multicategorical arrays down to a factor of the given batch_size. This can avoid errors during training when the dataset is very large, a small amount of data loss isn’t a factor, and retaining a specfic batch_size (e.g., of 32) is prefered .

Parameters

trim_list – a list of arrays to be trimmed
batch_size[32] – probably leave this alone

Returns

the same trim_list, but trimmed

Example Usage:

>>> [x_train, y_train, x_val, y_val, x_test, y_test] = trim_list_size_to_batch_size_factor([x_train, y_train, x_val, y_val, x_test, y_test])

sbr.preprocessing.gtex.dataset_setup(sample_count_threshold=100, expr_path='data/gtex/expr.ftr', attr_path='dist/gtex/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt', drop_classes_list=None, attr_class_name_column_name='SMTS', attr_sample_id_column_name='SAMPID', expr_sample_id_column_name='sample_id', verbose=True)

Reads the expression and attribute feather files, normalizes the expression values, one-hot encodes the classes, and returns the features, targets and labels in coordinated order.

Parameters

sample_count_threshold – drop any classes that are less than this threshold. If ‘None’, don’t drop any classes
expr_path
attr_path
drop_classes_list
attr_class_name_column_name
attr_sample_id_column_name
expr_sample_id_column_name
verbose

Returns

(X, y, class_names, label_df)

X: Normalized feature values
y: One-hot encoded target values
class_names: Ordered list of strings, one item per class. This will be handy for understanding the predictions
label_df: a dataframe that combines X, y, class_names, and IDs together

Example Usage;

>>> X, y, class_names, label_df = dataset_setup(100)
>>> # The return from this function (X, y) can be split as such:
>>> x_train, x_test,y_train, y_test = sklearn.model_selection.train_test_split(X, np.array(y), test_size=1.-fraction, random_state=42, shuffle=True)`
>>> # Class names can be retrieved from the returned target aray (y) and ordered list of class names (class_names) as such:
>>> class_names[np.argmax[y]]