API

Images

Open-i Interface

class biovida.images.openi_interface.OpeniInterface(cache_path=None, verbose=True)[source]

Python Interface for the NIH’s Open-i API.

Parameters:
  • cache_path (str or None) – path to the location of the BioVida cache. If a cache does not exist in this location, one will created. Default to None, which will generate a cache in the home folder.
  • verbose (bool) – if True print additional details.
save_records_db(path)[source]

Save the current records_db.

Parameters:path (str) – a system path ending with the ‘.p’ file extension.
load_records_db(records_db)[source]

Load a records_db

Parameters:records_db (str or Pandas DataFrame) – a system path or records_db itself.
records_db_short

Return records_db with nonessential columns removed.

cache_records_db_short

Return cache_records_db with nonessential columns removed.

options(search_parameter=None, print_options=True)[source]

Options for parameters of search().

Parameters:
  • search_parameter (str) – one of: ‘image_type’, ‘rankby’, ‘article_type’, ‘subset’, ‘collection’, ‘fields’, ‘specialties’, ‘video’ or exclusions. If None, print the parameters of search().
  • print_options (bool) – if True, pretty print the options, else return as a list. Defaults to True.
Returns:

a list of valid values for a given search search_parameter.

Return type:

list

search(query=None, image_type=None, rankby=None, article_type=None, subset=None, collection=None, fields=None, specialties=None, video=None, exclusions=[‘graphics’], print_results=True)[source]

Tool to generate a search term (URL) for the NIH’s Open-i API. The computed term is stored as a class attribute (INSTANCE.current_search_url)

Parameters:
  • query (str or None) – a search term. None will be converted to an empty string.
  • image_type (str, list, tuple or None) – see OpeniInterface().options('image_type') for valid values.
  • rankby (str, list, tuple or None) – see OpeniInterface().options('rankby') for valid values.
  • article_type (str, list, tuple or None) – see OpeniInterface().options('article_type') for valid values.
  • subset (str, list, tuple or None) – see OpeniInterface().options('subset') for valid values.
  • collection (str, list, tuple or None) – see OpeniInterface().options('collection') for valid values.
  • fields (str, list, tuple or None) – see OpeniInterface().options('fields') for valid values.
  • specialties (str, list, tuple or None) – see OpeniInterface().options('specialties') for valid values.
  • video (str, list, tuple or None) – see OpeniInterface().options('video') for valid values.
  • exclusions (list, tuple or None) –

    one or both of: ‘graphics’, ‘multipanel’. Defaults to ['graphics'].

    Note

    Excluding ‘multipanel’ can result in images that are multipanel being returned from Open-i API. For this reason, including ‘multipanel’ is not currently recommended.

  • print_results (bool) – if True, print the number of search results.

Note

If passing a single option to image_type, rankby, article_type, subset, collection, fields, specialties or video, a string can be used, e.g., ...image_type='ct'). For multiple values, a list or tuple must be used, e.g., ...image_type=('ct', 'mri').

pull(image_size=’large’, records_sleep_time=(10, 1.5), images_sleep_time=(10, 1.5), download_limit=100, clinical_cases_only=False, use_image_caption=False, new_records_pull=True)[source]

Pull (i.e., download) the current search.

In addition to the columns provided by Open-i, this method will automatically generate the following columns by analyzing the pulled data:

  • 'age'
  • 'sex'
  • 'ethnicity'
  • 'diagnosis'
  • 'parsed_abstract'
  • duration of illness ('illness_duration_years')
  • the imaging modality (e.g., MRI) used, based on the text associated with the image ('imaging_modality_from_text')
  • the plane (‘axial’, ‘coronal’ or ‘sagittal’) of the image ('image_plane')
  • image problems (‘arrows’, ‘asterisks’ and ‘grids’) inferred from the image caption ('image_problems_from_text')

Note

The ‘parsed_abstract’ column contains abstracts coerced into dictionaries where the subheadings of the abstract form the keys and their associated information form the values. For example, a MedPix image will typically yield a dictionary with the following keys: ‘history’, ‘finding’, ‘ddx’ (differential diagnosis), ‘dxhow’ and ‘exam’.

Warning

MedPix images include a distinct ‘diagnosis’ section. For images from other sources, the 'diagnosis' column is obtained by analyzing the text associated with the image. This analysis could produce inaccuracies.

Parameters:
  • image_size (str or None) –
    one of: ‘large’, ‘grid150’, ‘thumb’, ‘thumb_large’ or None. Defaults to ‘large’.
    If None, no attempt will be made to download images.

    Warning

    The analyses performed by the image_processing.OpeniImageProcessing class are most accurate with large images.

  • records_sleep_time (tuple or None) – tuple of the form: (every x downloads, period of time [seconds]). Defaults to (10, 1.5). Note: noise is randomly added to the sleep time by sampling from a normal distribution (with mean = 0, sd = 0.75).
  • images_sleep_time (tuple or None) – tuple of the form: (every x downloads, period of time [seconds]). Defaults to (10, 1.5). Note: noise is randomly added to the sleep time by sampling from a normal distribution (with mean = 0, sd = 0.75).
  • download_limit (None or int) – max. number of results to download. If None, no limit will be imposed (not recommended). Defaults to 100.
  • clinical_cases_only (bool) –
    if True require that the data harvested is of a clinical case. Specifically,
    this parameter requires that ‘article_type’ is one of: ‘encounter’, ‘case_report’. Defaults to False.

    Note

    If True, this parameter will often result in fewer records being returned than the download_limit.

  • use_image_caption (bool) – if True block downloading of an image if its caption suggests the presence of problematic image properties (e.g., ‘arrows’) likely to corrupt a dataset. Defaults to False.
  • new_records_pull (bool) –

    if True, download the data for the current search. If False, use INSTANCE.records_db.

    Note

    Setting new_records_pull=False can be useful if one wishes to initially set image_size=None, truncate or otherwise modify INSTANCE.records_db and then download images.

Returns:

a DataFrame with the record information. If image_size is not None, images will also be harvested and cached.

Return type:

Pandas DataFrame

Raises:

ValueError – if search() has not been called.

Cancer Imaging Archive Interface

class biovida.images.cancer_image_interface.CancerImageInterface(api_key, cache_path=None, verbose=True)[source]

Python Interface for the Cancer Imaging Archive’s API.

Parameters:
  • api_key (str) –

    a key to the the Cancer Imaging Archive’s API.

    Note

    An API key can be obtained by following the instructions provided here.

  • cache_path (str or None) – path to the location of the BioVida cache. If a cache does not exist in this location, one will created. Default to None, which will generate a cache in the home folder.
  • verbose (bool) – if True print additional details

Warning

If you wish to use any data obtained from this resource for any form of publication, you must follow the citation guidelines provided by the study’s authors on the project’s Cancer Imaging Archive repository page.

Warning

Several studies on the Cancer Imaging Archive are subject to publication blockades. Therefore, if you intend to publish any findings which use data from this resource, you must first check that the studies you have selected are not subject to such restrictions.

save_records_db(path)[source]

Save the current records_db.

Parameters:path (str) – a system path ending with the ‘.p’ file extension.
load_records_db(records_db)[source]

Load a records_db

Parameters:records_db (str or Pandas DataFrame) – a system path or records_db itself.
records_db_short

Return records_db with nonessential columns removed.

cache_records_db_short

Return cache_records_db with nonessential columns removed.

update_collections()[source]

Refresh the list of collections provided by the Cancer Imaging Archive.

search(collection=None, cancer_type=None, location=None, modality=None, download_override=False, pretty_print=True)[source]

Method to Search for studies on the Cancer Imaging Archive.

Parameters:
  • collection (list, tuple, str or None) – a collection (study), or iterable (e.g., list) of collections, hosted by the Cancer Imaging Archive. Defaults to None.
  • cancer_type (str, iterable or None) – a string or list/tuple of specifying cancer types. Defaults to None.
  • location (str, iterable or None) – a string or list/tuple of specifying body locations. Defaults to None.
  • modality (str, iterable or None) – the type of imaging technology. See: CancerImageInterface().dicom_modality_abbrevs for valid values. Defaults to None.
  • download_override (bool) – If True, override any existing database currently cached and download a new one. Defaults to False.
  • pretty_print (bool) – if True, pretty print the search results. Defaults to True.
Returns:

a dataframe containing the search results.

Return type:

Pandas DataFrame

Example:
>>> CancerImageInterface(YOUR_API_KEY_HERE).search(cancer_type='carcinoma', location='head')
...
   collection                   cancer_type                          modalities         subjects    location
0  TCGA-HNSC            Head and Neck Squamous Cell Carcinoma  CT, MR, PT                 164     [Head, Neck]
1  QIN-HeadNeck         Head and Neck Carcinomas               PT, CT, SR, SEG, RWV       156     [Head, Neck]
      ...                          ...                                  ...               ...         ...
extract_dicom_data(database=’records_db’, make_hashable=False)[source]

Extract data from all dicom files referenced in records_db or cache_records_db. Note: this requires that save_dicom is True when pull() is called.

Parameters:
  • database (str) – the name of the database to use. Must be one of: ‘records_db’, ‘cache_records_db’. Defaults to ‘records_db’.
  • make_hashable (bool) – If True convert the data extracted to nested tuples. If False generate nested dictionaries. Defaults to False
Returns:

a series of the dicom data with dictionaries of the form {path: {DICOM Description: value, ...}, ...}. If make_hashable is True, all dictionaries will be converted to tuples.

Return type:

Pandas Series

pull(patient_limit=3, session_limit=1, collections_limit=None, allowed_modalities=None, save_dicom=True, save_png=False, new_records_pull=True)[source]

Pull (i.e., download) the current search.

Notes:

  • When save_png is True, 3D DICOM images are saved as individual frames.

  • PNG file names in the cache adhere to the following format:

    [instance, pull_position]__[patient_id_[Last 10 Digits of SeriesInstanceUID]]__[Image Scale ('default')].png

  • DICOM file names in the cache adhere to the following format:

    [instance, original_name_in_source_file]__[patient_id_[Last 10 Digits of SeriesInstanceUID]].dcm

where:

  • ‘instance’ denotes the image’s position in the 3D image (if applicable and available)
  • ‘pull_position’ denotes the position of the image in the set returned for the given ‘SeriesInstanceUID’ by the Cancer Imaging Archive.
Parameters:
  • patient_limit (int or None) – limit on the number of patients to extract. Patient IDs are sorted prior to this limit being imposed. If None, no patient_limit will be imposed. Defaults to 3.
  • session_limit (int) –
    restrict image harvesting to the first n imaging sessions (days) for a given patient,
    where n is the value passed to this parameter. If None, no limit will be imposed. Defaults to 1.

    Warning

    Several studies (collections) in the Cancer Imaging Archive database have multiple imaging sessions. Latter sessions may be of patients following interventions, such as surgery, intended to eliminate cancerous tissue. For this reason it cannot be assumed that images obtained from non-baseline sessions (i.e., session number > 1) contain signs of disease.

  • collections_limit (int or None) – limit the number of collections to download. If None, no limit will be applied. Defaults to None.
  • allowed_modalities (list or tuple) – limit images downloaded to certain modalities. See: CancerImageInterface(YOUR_API_KEY_HERE).dicom_modality_abbrevs (use the keys). Note: ‘MRI’, ‘PET’, ‘CT’ and ‘X-Ray’ can also be used. This parameter is not case sensitive. Defaults to None.
  • save_dicom (bool) – if True, save the DICOM images provided by The Cancer Imaging Archive ‘as is’. Defaults to True.
  • save_png (bool) – if True, convert the DICOM images provided by The Cancer Imaging Archive to PNGs. Defaults to False.
  • new_records_pull (bool) – if True, download the data for the current search. If False, use INSTANCE.records_db.
Returns:

a DataFrame with the record information.

Return type:

Pandas DataFrame

Image Processing

class biovida.images.image_processing.OpeniImageProcessing(instance, db_to_extract=’records_db’, model_location=None, download_override=False, verbose=True)[source]

This class is designed to allow easy analysis and cleaning of cached Open-i image data.

Parameters:
  • instance (OpenInterface Class) – an instance of the biovida.images.openi_interface.OpenInterface() class.
  • db_to_extract (str) – records_db or“cache_records_db“. Defaults to ‘records_db’.
  • model_location (str) – the location of the model for Convnet. If None, the default model will be used. Defaults to None.
  • download_override (bool) – If True, download a new copy of the ‘visual_image_problems_model’ weights (and associated resources) regardless of whether or not files with these names are already cached. Defaults to False.
  • verbose (bool) – if True, print additional details. Defaults to False.
Variables:

image_dataframe – this is the dataframe that was passed when instantiating the class and contains a cache of all analyses run as new columns.

image_dataframe_short

Return image_dataframe with nonessential columns removed.

grayscale_analysis(new_analysis=False, status=True)[source]

Analyze the images to determine whether or not they are grayscale (uses the PIL image library).

Note:
  • this tool is very conservative (very small amounts of ‘color’ will yield False).
  • the exception to the above rule is the very rare case of an image which an even split of red, green and blue. In such an instance this function may errounously conclude that the image is grayscale.
Parameters:
  • new_analysis (bool) – rerun the analysis if it has already been computed. Defaults to False.
  • status (bool) – display status bar. Defaults to True.
logo_analysis(match_quality_threshold=0.25, xy_position_threshold=(0.3333333333333333, 0.4), base_resizes=(0.5, 2.5, 0.1), end_search_threshold=0.875, base_image_cropping=(0.15, 0.5), new_analysis=False, status=True)[source]

Search for the MedPix Logo. If located, with match quality above match_quality_threshold, populate the the ‘medpix_logo_bounding_box’ of image_dataframe with its bounding box.

Parameters:
  • match_quality_threshold (float) – the minimum match quality required to accept the match. See: skimage.feature.match_template() for more information.
  • xy_position_threshold (tuple) – tuple of the form: (x_greater_check, y_greater_check). For instance the default ((1/3.0, 1/2.5)) requires that the x position of the logo is greater than 1/3 of the image’s width and less than 1/2.5 of the image’s height.
  • base_resizes (tuple) – See: biovida.images.models.template_matching.robust_match_template().
  • end_search_threshold (float) – See: biovida.images.models.template_matching.robust_match_template().
  • base_image_cropping (tuple) – See: biovida.images.models.template_matching.robust_match_template()
  • new_analysis (bool) – rerun the analysis if it has already been computed.
  • status (bool) – display status bar. Defaults to True.
border_analysis(signal_strength_threshold=0.25, min_border_separation=0.15, lower_bar_search_space=0.9, new_analysis=False, status=True)[source]

Wrapper for biovida.images.models.border_detection.border_detection().

Parameters:
  • signal_strength_threshold (float) – see biovida.images.models.border_detection().
  • min_border_separation (float) – see biovida.images.models.border_detection().
  • lower_bar_search_space (float) – see biovida.images.models.border_detection().
  • new_analysis (bool) – rerun the analysis if it has already been computed. Defaults to False.
  • status (bool) – display status bar. Defaults to True.
crop_decision(new_analysis=False)[source]

Decide where to crop the images, if at all.

Parameters:new_analysis (bool) – rerun the analysis if it has already been computed. Defaults to False.
visual_image_problems(limit_to_known_modalities=True, new_analysis=False, status=True)[source]

This method is powered by a Convolutional Neural Network which computes probabilities for the presence of problematic image properties or types.

Currently, the model can identify the follow problems:

  • arrows in images
  • images arrayed as grids
Parameters:
  • limit_to_known_modalities (bool) – if True, remove model predicts for image modalities the model has not explicitly been trained on. Defaults to True.
  • new_analysis (bool) – rerun the analysis if it has already been computed. Defaults to False.
  • status (bool) – display status bar. Defaults to True.
Examples:
>>> DataFrame['visual_image_problems']
...
0 [('valid_image', 0.82292306), ('text', 0.13276383), ('arrows', 0.10139297), ('grids', 0.021935554)]
1 [('valid_image', 0.76374823), ('arrows', 0.1085605), ('grids', 0.0024915827), ('text', 0.00037114936)]
2 [('valid_image', 0.84319711), ('text', 0.10483728), ('arrows', 0.06458132), ('grids', 0.0125442)]
3 [('valid_image', 0.84013706), ('arrows', 0.090836897), ('text', 0.055015128), ('grids', 0.0088913934)]

The first value in the tuple represents the problem identified and second value represents its associated probability.

auto_analysis(limit_to_known_modalities=True, new_analysis=False, status=True)[source]

Automatically use the class methods to analyze the image_dataframe using default parameter values for class methods.

Parameters:
  • limit_to_known_modalities (bool) – if True, remove model predicts for image modalities the model has not explicitly been trained on. Defaults to True.
  • new_analysis (bool) – rerun the analysis if it has already been computed. Defaults to False.
  • status (bool) – display status bar. Defaults to True.
auto_decision(valid_floor=0.8, problems_to_ignore=None)[source]

Automatically generate ‘invalid_image’ column in the image_dataframe column by deciding whether or not images are valid using default parameter values for class methods.

Parameters:
  • valid_floor (float) –

    the smallest value needed for a ‘valid_image’ to be considered valid. Defaults to 0.8.

    Note

    For an image to be considered ‘valid’, the most likely categorization for the image must be ‘valid_image’ and the probability that the model assigns when placing it in this category must be great than or equal to valid_floor.

  • problems_to_ignore (None, list or tuple) – image problems to ignore. See INSTANCE.known_image_problems for valid values. Defaults to None.
auto(valid_floor=0.8, limit_to_known_modalities=True, problems_to_ignore=None, new_analysis=False, status=True)[source]

Automatically carry out all aspects of image preprocessing (recommended).

Parameters:
  • valid_floor (float) – the smallest value needed for a ‘valid_image’ to be considered valid. Defaults to 0.8.
  • limit_to_known_modalities (bool) – if True, remove model predicts for image modalities the model has not explicitly been trained on. Defaults to True.
  • problems_to_ignore (None, list or tuple) – image problems to ignore. See INSTANCE.known_image_problems for valid values. Defaults to None.
  • new_analysis (bool) – rerun the analysis if it has already been computed. Defaults to False.
  • status (bool) – display status bar. Defaults to True.
Returns:

the image_dataframe, complete with the results of all possible analyses (using default parameter values).

Return type:

Pandas DataFrame

clean_image_dataframe(crop_images=True, convert_to_rgb=False, status=True)[source]

Define a dataframe with rows of images found to be ‘valid’. These ‘valid’ images are cleaned and stored as PIL images in a 'cleaned_image' column.

Note

The DataFrame this method creates can be viewed with INSTANCE.image_dataframe_cleaned.

Parameters:
  • crop_images (bool) – Crop the images using analyses results from border_analysis() and logo_analysis(). Defaults to True.
  • convert_to_rgb (bool) – if True, use the PIL library to convert the images to RGB. Defaults to False.
  • status (bool) – display status bar. Defaults to True.
Returns:

self.image_dataframe where the ‘invalid_image’ column is True, with the addition of a ‘cleaned_image’ populated by PIL image to be saved to disk.

Return type:

Pandas DataFrame

output(output_rule, create_dirs=False, allow_overwrite=True, action=’copy’, status=True, **kwargs)[source]

Save processed images to disk.

Parameters:
  • output_rule (str or func) –
    • if a str: the directory to save the images.
    • if a function: it must (1) accept a single parameter (argument) and (2) return system path(s) [see example below].
  • create_dirs (bool) – if True, create directories returned by divvy_rule if they do not exist. Defaults to False.
  • allow_overwrite (bool) – if True allow existing images to be overwritten. Defaults to True.
  • action (str) – one of ‘copy’, ‘ndarray’.
  • status (bool) – display status bar. Defaults to True.
Example:
>>> from biovida.images import OpeniInterface
>>> from biovida.images import OpeniImageProcessing
...
>>> opi = OpeniInterface()
>>> opi.search(image_type='mri')
>>> opi.pull()
...
>>> ip = OpeniImageProcessing(opi)
>>> ip.auto()

Next, prune invalid images

>>> ip.clean_image_dataframe()

A Simple Output Rule

>>> ip.output('/your/path/here/images')

A More Complex Output Rule

>>> def my_save_rule(row):
>>>     if isinstance(row['abstract'], str) and 'lung' in row['abstract']:
>>>         return '/your/path/here/lung_images'
>>>     elif isinstance(row['abstract'], str) and 'heart' in row['abstract']:
>>>         return '/your/path/here/heart_images'
...
>>> ip.save(my_save_rule)

Image Classification

class biovida.images.models.image_classification.ImageClassificationCNN(data_path=None, image_shape=(224, 224), rescale=0.00392156862745098, shear_range=0.05, zoom_range=0.3, horizontal_flip=True, vertical_flip=False, batch_size=1)[source]

Keras Convolutional Neural Networks Interface.

Parameters:
  • data_path (str) – path to the directory with the subdirectories entitled ‘train’ and ‘validation’. This directory must have this structure. Defaults to None (to be use when loading pre-computed weights).
  • image_shape (tuple or list.) – the (height, width) to rescale the images to. Elements must be ints. Defaults to (150, 150).
  • rescale (float) – See: keras.preprocessing.image.ImageDataGenerator(). Defaults to 1/255.
  • shear_range (float) – See: keras.preprocessing.image.ImageDataGenerator(). Defaults to 0.1.
  • zoom_range (float) – See: keras.preprocessing.image.ImageDataGenerator(). Defaults to 0.35.
  • horizontal_flip (bool) – See: keras.preprocessing.image.ImageDataGenerator(). Defaults to True.
  • vertical_flip (bool) – See: keras.preprocessing.image.ImageDataGenerator(). Defaults to True.
  • batch_size (int) – Samples to propagate through the model. See: keras.preprocessing.ImageDataGenerator().flow_from_directory(). Defaults to 4.
default(classes, output_layer_activation)[source]

A Convolutional Neural Network with two convolution layers and ~5 million parameters.

Sources:

Parameters:
  • classes (int) – number of neuron in the output layer (which equals the number of classes).
  • output_layer_activation (str) – the activation function to use on the output layer. See: https://keras.io/activations/#available-activations. Defaults to ‘sigmoid’.
squeezenet(classes, output_layer_activation)[source]

Keras Implementation of the SqueezeNet Model.

Sources:

  • Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. ‘SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size.’ arXiv preprint arXiv:1602.07360 (2016).
  • Implementation adapted from: github.com/rcmalli/keras-squeezenet
Parameters:
  • classes (int) – number of neuron in the output layer (which equals the number of classes).
  • output_layer_activation (str) – the activation function to use on the output layer. See: https://keras.io/activations/#available-activations. Defaults to ‘sigmoid’.
vgg_19(classes, output_layer_activation)[source]

Keras Implementation of the VGG19 Model.

  • Very Deep Convolutional Networks for Large-Scale Image Recognition K. Simonyan, A. Zisserman. arXiv:1409.1556.
  • The Keras Implementation is available here.
Parameters:
  • classes (int) – number of neuron in the output layer (which equals the number of classes).
  • output_layer_activation (str) – the activation function to use on the output layer. See: https://keras.io/activations/#available-activations. Defaults to ‘sigmoid’.
convnet(model_to_use=’default’, loss=’binary_crossentropy’, optimizer=’default’, metrics=(‘accuracy’, ), output_layer_activation=’sigmoid’)[source]

Define and Compile the Image Recognition Convolutional Neural Network.

Parameters:
  • model_to_use (str) –

    one of: ‘default’, ‘squeezenet’ or ‘vgg19’. Defaults to ‘default’.

    • ’default’: a relatively simple sequential model with two convolution layers.
    • ’squeezenet’: SqueezeNet model.
    • ’vgg19’: the VGG 19 model.
  • loss (str) – Loss function. Defaults to ‘categorical_crossentropy’. See: keras.models.Sequential().
  • optimizer (str or keras.optimizers) – Optimizer name. Defaults to ‘default’, which will use RMSprop with learning rate = 0.0001. See: keras.models.Sequential().
  • metrics (tuple) – Metrics to evaluate. Defaults to (‘accuracy’,). Note: if round braces are used, it MUST contain a comma (to make it a tuple). See: keras.models.Sequential().
  • output_layer_activation (str) – the activation function to use on the output layer. See: https://keras.io/activations/#available-activations. Defaults to ‘sigmoid’.
fit(epochs=10, min_delta=0.1, patience=3)[source]

Fit the model to the training data and run a validation.

Parameters:
  • epochs (int) – number of epochs. See: keras.models.Sequential(). Defaults to 10.
  • min_delta (float) – see keras.callbacks.EarlyStopping().
  • patience (int) – see keras.callbacks.EarlyStopping().
Raises:

AttributeError if ImageClassificationCNN().convnet() is yet to be called.

save(save_name, path=None, overwrite=False)[source]

Save the weights from a trained model.

Parameters:
  • save_name (str) – name of the file. Do not include the ‘.h5’ extension as it will be added automatically.
  • path (str) – path to save the data to. See: keras.models.Sequential().
  • overwrite (bool) – overwrite the existing copy of the data
Raises:

AttributeError if ImageClassificationCNN().fit() is yet to be called.

load(path, override_existing=False, default_model_load=False)[source]

Load a model from disk.

Note

This method expects an additional file ending in “_support.p”. For example: if path='/your/path/my_model.h5', a file entitled '/your/path/my_model_support.p' is also expected.

Parameters:
  • path (str) – path to save the data to.See: keras.models.Sequential().
  • override_existing (bool) – If True and a model has already been instantiated, override this replace this model. Defaults to False.
  • default_model_load (bool) – load the default model if ImageClassificationCNN().convnet() has not been called. Defaults to False.
Raises:

AttributeError if a model is currently instantiated.

predict(list_of_images, desc=None, status=True)[source]

Generate Predictions for a list of images.

Parameters:
  • list_of_images (list) – a list of paths (strings) to images or ndarrays.
  • desc (str or None) – description for tqdm.
  • status (bool) – True for a tqdm status bar; False for no status bar. Defaults to True.
Returns:

a list of lists with tuples of the form (name, probability). Defaults to False.

Return type:

list

Template Matching

biovida.images.models.template_matching.robust_match_template(pattern_image, base_image, base_resizes=(0.5, 2.5, 0.1), end_search_threshold=0.875, base_image_cropping=(0.15, 0.5))[source]

Search for a pattern image in a base image using a algorithm which is robust against variation in the size of the pattern in the base image.

Method: Fast Normalized Cross-Correlation.

Limitations:

  • Cropping is limited to the the top left of the base image. The can be circumvented by setting base_image_cropping=(1, 1) and cropping base_image oneself.
  • This function may become unstable in situations where the pattern image is larger than the base image.
Parameters:
  • pattern_image (str or ndarray) –

    the pattern image.

    Warning

    If a ndarray is passed to pattern_image, it must be preprocessed to be a 2D array, e.g., scipy.misc.imread(pattern_image, flatten=True)

  • base_image (str or ndarray) –

    the base image in which to look for the pattern_image.

    Warning

    If a ndarray is passed to base_image, it must be preprocessed to be a 2D array, e.g., scipy.misc.imread(base_image, flatten=True)

  • base_resizes (tuple) – the range over which to rescale the base image.Define as a tuple of the form (start, end, step size). Defaults to (0.5, 2.0, 0.1).
  • end_search_threshold (float or None) – if a match of this quality is found, end the search. Set equal to None to disable. Defaults to 0.875.
  • base_image_cropping – the amount of the image to crop with respect to the x and y axis. form: (height, width). Defaults to (0.15, 0.5).

Notes:

  • Decreasing height will increase the amount of the lower part of the image removed.
  • Increasing width will increase the amount of the image’s left half removed.
  • Cropping more of the base image reduces the probability that the algorithm getting confused. However, if the image is cropped too much, the target pattern itself could be removed.
Returns:A dictionary of the form: {"bounding_box": ..., "match_quality": ..., "base_image_shape": ...}.
  • bounding_box (dict): {'bottom_right': (x, y), 'top_right': (x, y), 'top_left': (x, y), 'bottom_left': (x, y)}.
  • match_quality (float): quality of the match.
  • base_image_shape (tuple): the size of the base image provided. Form: (width (x), height (y)).
Return type:dict

Border and Edge Detection

biovida.images.models.border_detection.edge_detection(image, axis=0, n_largest_override=None)[source]

Detects edges within an image.

Parameters:
  • image (ndarray) – an image represented as a matrix
  • axis (int) – 0 for columns; 1 for rows.
  • n_largest_override (int or None) – override the defaults for the number of inflections to report when searching the image along a given axis.
Returns:

the location of large inflections (changes) in the image along a given axis.

Return type:

list

biovida.images.models.border_detection.lower_bar_analysis(image_array, lower_bar_search_space, signal_strength_threshold, lower_bar_second_pass)[source]

Executes a two passes looking for a lower bar. This can be helpful if the lower border is striated in such a way that could confuse the algorithm. Namely, two the space between two lines of text can confuse the algorithm. The first pass with truncate the first line; the second pass will truncate the line above it.

Parameters:
  • image_array (2D ndarray) – an image represented as a matrix.
  • lower_bar_search_space (float) – a value between 0 and 1 specify the proportion of the image to search for a lower bar (e.g., 0.90).
  • signal_strength_threshold (int or float) – a value between 0 and 1 specify the signal strength required for an area required to be considered a ‘lower bar’. This is measured as a absolute value of the difference between a location and the median signal strength of the average image.
  • lower_bar_second_pass (bool) – if True perform a second pass on the lower bar.
Returns:

the location of the start of the lower bar (i.e., edge).

Return type:

int

Example:

Take the following lower border in an image:

     --- Image Here ---
_____________________________
The quick brown fox jumped
over jumps over the lazy dog
-----------------------------

One pass of this procedure should produce:

     --- Image Here ---
_____________________________
The quick brown fox jumped
-----------------------------

Following this up with a second pass should yeild:

     --- Image Here ---
_____________________________
biovida.images.models.border_detection.border_detection(image, signal_strength_threshold=0.25, min_border_separation=0.15, lower_bar_search_space=0.9, report_signal_strength=False, rescale_input_ndarray=True, lower_bar_second_pass=True)[source]

Detects the borders and lower bar in an image.

At a high level, this algorithm works as follows:
  1. Along a given axis (rows or columns), vectors with standard deviation approximately equal to 0 are replaced with zero vectors. (a).
  2. Values are averaged along this same axis. This produces a signal (which can be visualized as a line graph). (b).
  3. The median value for this signal is computed. (c).
  4. The n points for which are the furthest, in absolute value, from the median are selected.
  5. The signal strength of the n points is quantified using percent error, where the median value is used as the expected value.
  6. Candidates for border pairs (e.g., left and right borders) are then weighed based on three lines of evidence. Namely, their signal strength, how separated they are and their absolute distance from the image’s midline (about the corresponding axis). If a candidate fails to meet any of these criteria, it is rejected.
  7. The evidence for a lower bar concerns only its signal strength, though only an area of image below a given height is considered when trying to locate it. A double pass, the default, will try a second time to find another lower bar (for reasons explained in the docstring for the lower_bar_analysis() function). Regardless of whether or not the second pass could find a second edge, all of the edges detect are averaged and returned as an integer. If no plausible borders could be found, None is returned.
  1. This reduces the muffling effect that areas with solid color can have on step 2.
  2. Large inflections after areas with little change suggest a transition from a solid background to an image.
  3. The median is used here, as opposed to the average, because it is more robust against outliers.
Parameters:
  • image (str or 2D ndarray) –

    a path to an image or an image represented as a 2D ndarray.

    Warning

    If a ndarray is passed, it should be the output of the biovida.images.image_tools.load_image_rescale() function. Without this preprocessing, this function’s stability is not assured.

  • signal_strength_threshold (float) – a value between 0 and 1 specify the signal strength required for an area required to be considered a ‘lower bar’. Internally, this is measured as a location deviation from the median signal strength of the average image.
  • min_border_separation (float) – a value between 0 and 1 that determines the proportion of the axis that two edges must be separated for them to be considered borders. (i.e., axis_size * min_border_separation)
  • lower_bar_search_space (float) – a value between 0 and 1 specify the proportion of the image to search for a lower bar (e.g., 0.90). Set to None to disable.
  • report_signal_strength (bool) – if True include the strength of the signal suggesting the existence of an edge. Defaults to False.
  • rescale_input_ndarray (bool) – if True, rescale a 2D ndarray passed to image.
  • lower_bar_second_pass (bool) – if True perform a second pass on the lower bar. Defaults to True.
Returns:

a dictionary of the form:

{'vborder': (left, right) or None, 'hborder': (upper, lower) or None, 'hbar': int or None}

  • ’vborder’ gives the locations of vertical borders.
  • ’hborder’ gives the locations of horizontal borders.
  • ’hbar’ gives the location for the top of the horizontal bar at the bottom of the image.

Return type:

dict

Genomics

DisGeNET Interface

class biovida.genomics.disgenet_interface.DisgenetInterface(cache_path=None, verbose=True)[source]

Python Interface for Harvesting Databases from DisGeNET.

Parameters:
  • cache_path (str or None) – location of the BioVida cache. If one does not exist in this location, one will created. Default to None (which will generate a cache in the home folder).
  • verbose (bool) – If True, print notice when downloading database. Defaults to True.
options(database=None, pretty_print=True)[source]

Disgenet databases which can be downloaded as well as additional information about the databases.

Parameters:
  • database (str) – A database to review. Must be one of: ‘all’, ‘curated’, ‘snp_disgenet’ or None. If a specific database is given, the database’s full name and description will be provided. If None, a list of databases which can be downloaded will be returned (or printed). Defaults to None.
  • pretty_print (bool) – pretty print the information. Defaults to True.
Returns:

a list if database is None, else a dict with the database’s full name and description.

Return type:

list or dict

pull(database, download_override=False)[source]

Pull (i.e., download) a DisGeNET Database.

Note: if a database is already cached, it will be used instead of downloading (the download_override argument can be used override this behaviour).

Parameters:
  • database (str) – A database to download. Must be one of: ‘all’, ‘curated’, ‘snp_disgenet’ or None. See options() for more information.
  • download_override (bool) – If True, override any existing database currently cached and download a new one. Defaults to False.
Returns:

a DisGeNET database

Return type:

Pandas DataFrame

Diagnostics

Disease Ontology Interface

class biovida.diagnostics.disease_ont_interface.DiseaseOntInterface(cache_path=None, verbose=True)[source]

Python Interface for Harvesting the Disease Ontology Database.

Parameters:
  • cache_path (str or None) – location of the BioVida cache. If one does not exist in this location, one will created. Default to None (which will generate a cache in the home folder).
  • verbose (bool) – If True, print notice when downloading database. Defaults to True.
pull(download_override=False, disease_ontology_db_url=’http://purl.obolibrary.org/obo/doid.obo’)[source]

Pull (i.e., download) the Disease Ontology Database.

Notes:

  • if a database is already cached, it will be used instead of downloading (use download_override to override).
  • multiple values are separated by semicolons followed by a space, i.e., “; “.
Parameters:
  • download_override (bool) – If True, override any existing database currently cached and download a new one. Defaults to False.
  • disease_ontology_db_url (str) – URL to the disease ontology database in ‘.obo’ format. Defaults to ‘http://purl.obolibrary.org/obo/doid.obo’.
Returns:

the Disease Ontology database as a DataFrame.

Return type:

Pandas DataFrame

Disease-Symptoms Interface

class biovida.diagnostics.disease_symptoms_interface.DiseaseSymptomsInterface(cache_path=None, verbose=True)[source]

Tools to obtain databases relating symptoms to diseases.

References:

Parameters:
  • cache_path (str or None) – location of the BioVida cache. If one does not exist in this location, one will created. Default to None (which will generate a cache in the home folder).
  • verbose (bool) – If True, print notice when downloading database. Defaults to True.
hsdn_pull(download_override=False)[source]

Pull (i.e., download) the Human Symptoms Disease Network Database.

Parameters:download_override (bool) – If True, override any existing database currently cached and download a new one.
Returns:the Human Symptoms Disease Network database as a DataFrame.
Return type:Pandas DataFrame
rephetio_ml_pull(download_override=False)[source]

Pull (i.e., download) the Rephetio Medline Database.

Parameters:download_override (bool) – If True, override any existing database currently cached and download a new one.
Returns:the Rephetio Medline database as a DataFrame.
Return type:Pandas DataFrame
pull(download_override=False)[source]

Construct a dataframe by combining the Rephetio Medline and Human Symptoms Disease Network databases.

Parameters:download_override (bool) – If True, override any existing database currently cached and download a new one.
Returns:a combined dataframe with the following columns: ‘common_disease_name’ and ‘common_symptom_term’.
Return type:Pandas DataFrame

Domain Integration

Unifying Across Domains

biovida.unification.unify_domains.unify_against_images(instances, db_to_extract=’records_db’, verbose=True, fuzzy_threshold=False)[source]

Tool to unify image instances (namely OpeniInterface and/or CancerImageInterface) with Diagnostic and Genomic Data.

Parameters:
  • instances (list, tuple, OpeniInterface, CancerImageInterface or OpeniImageProcessing.) – any one of OpeniInterface, CancerImageInterface or OpeniImageProcessing, or some combination inside an iterable.
  • db_to_extract (str) –
    the database to use. Must be one of: ‘records_db’, ‘cache_records_db’ or ‘image_dataframe’.
    Defaults to ‘records_db’.

    Note

    If an instance of OpeniImageProcessing is passed to instances, the image_dataframe attribute will be extracted regardless of the value passed to this argument.

  • verbose (bool) – If True, print notice when downloading database. Defaults to True.
  • fuzzy_threshold (int, bool, None) –

    an integer on (0, 100]. If True a threshold of 95 will be used. Defaults to False.

    Warning

    While this parameter will likely increase the number of matches, fuzzy searching with large databases, such as those this function integrates, is very computationally expensive.

Returns:

a dataframe which unifies image instances with genomic and diagnostics data.

Return type:

Pandas DataFrame

This function evolves a DataFrame with the following columns:

  • ‘age’
  • ‘article_type’
  • ‘disease’
  • ‘image_caption’
  • ‘image_id’
  • ‘modality_best_guess’
  • ‘pull_time’
  • ‘query’
  • ‘sex’
  • ‘source_api’
  • ‘disease_family’
  • ‘disease_synonym’
  • ‘disease_definition’
  • ‘known_associated_symptoms’
  • ‘mentioned_symptoms’
  • ‘known_associated_genes’

Note

The 'known_associated_genes' column is of the form ((Gene Name, DisGeNET Evidence Score), ...).

Note

If an instance of the CancerImageInterface class is passed, a 'source_images_path' column will also be generated if the database extracted from the instance contains a 'cached_dicom_images_path' column.

Warning

The 'known_associated_symptoms' and 'known_associated_genes' columns denote symptoms and genes known to be associated with the disease the patient presented with. These columns are not an account of the symptomatology or genotype of the patients themselves. Conversely, the 'mentioned_symptoms' column is an account of a given patient’s symptoms if the data is from a clinical case (i.e., article_type equals ‘case_report’).

Example:
>>> from biovida.images import OpeniInterface
>>> from biovida.images import CancerImageInterface
>>> from biovida.unification import unify_against_images
...
>>> opi = OpeniInterface()
# --- Search and Pull ---
>>> udf1 = unify_against_images(opi)
...
# Adding another Interface from the images subpackage
>>> cii = CancerImageInterface(YOUR_API_KEY_HERE)
# --- Search and Pull ---
>>> udf2 = unify_against_images([opi, cii])

Cache Manipulation

Image Cache Management

biovida.images.image_cache_mgmt.image_delete(instance, delete_rule, only_recent=False, verbose=True)[source]

Delete rows, and associated images, from records_db and cache_records_db DataFrames associated with instance.

Warning

The effects of this function can only be undone by downloading the deleted data again.

Warning

The default behavior for this function is to delete all rows (and associated image(s)) that delete_rule authorizes. To limit deletion to the most recent data pull, only_recent must be set to True.

Parameters:
  • instance (OpeniInterface, OpeniImageProcessing or CancerImageInterface) – an instance of OpeniInterface, OpeniImageProcessing or CancerImageInterface.
  • delete_rule (str or func) – must be one of: 'all' (delete all data) or a function which (1) accepts a single parameter (argument) and (2) returns True when the data is to be deleted.
  • only_recent (bool) – if True, only apply delete_rule to data obtained in the most recent pull. Defaults to False.
  • verbose (bool) – if True, print additional information.
Returns:

a dictionary of the indices which were dropped. Example: {'records_db': [58, 59], 'cache_records_db': [158, 159]}.

Return type:

dict

Example:
>>> from biovida.images import image_delete
>>> from biovida.images import OpeniInterface
...
>>> opi = OpeniInterface()
>>> opi.search(image_type=['ct', 'mri'], collection='medpix')
>>> opi.pull()
...
>>> def my_delete_rule(row):
>>>     if isinstance(row['abstract'], str) and 'Oompa Loompas' in row['abstract']:
>>>         return True
...
>>> image_delete(opi, delete_rule=my_delete_rule)

Note

In this example, any rows in the records_db and cache_records_db for which the ‘abstract’ column contains the string ‘Oompa Loompas’ will be deleted. Any images associated with this row will also be destroyed.

Warning

If a function is passed to delete_rule it must return a boolean True to delete a row. All other output will be ignored.

biovida.images.image_cache_mgmt.image_divvy(instance, divvy_rule, action=’ndarray’, db_to_extract=’records_db’, train_val_test_dict=None, create_dirs=True, allow_overwrite=True, image_column=None, stack=True, verbose=True)[source]

Grouping Cached Images.

Warning

Currently, if an OpeniImageProcessing instance is first passed to biovida.unification.unify_against_images and then to this function, image cropping will not be applied.

Parameters:
  • instance (OpeniInterface, OpeniImageProcessing, CancerImageInterface or Pandas DataFrame) – the yield of the yield of biovida.unification.unify_against_images() or an instance of OpeniInterface, OpeniImageProcessing or CancerImageInterface.
  • divvy_rule (str or func) – must be a function` which (1) accepts a single parameter (argument) and (2) return system path(s) [see example below].
  • action (str) –

    one of: 'copy', 'ndarray'.

    • if 'copy': copy from files from the cache to (i) the location prescribed by divvy_rule, when train_val_test_dict=None, else (ii) the ‘target_location’ key in train_val_test_dict.
    • if 'ndarray': return a nested dictionary of ndarray (‘numpy’) arrays (default).
  • db_to_extract (str) –

    the database to use. Must be one of:

    • ’records_db’: the dataframe resulting from the most recent search() & pull() (default).
    • ’cache_records_db’: the cache dataframe for instance.
    • ’unify_against_images’: the yield of biovida.unification.unify_against_images().

    Note

    If an instance of OpeniImageProcessing is passed, the dataframe will be extracted automatically.

  • train_val_test_dict (None or dict) –

    a dictionary denoting the proportions for any of: 'train', 'validation' and/or 'test'.

    Note

    • If action='copy', a 'target_dir' key (target directory) must also be included.
    • A 'random_state' key can be passed, with an integer as the value, to seed shuffling.
    • To delete the source files, a 'delete_source' key may be included (optional). The corresponding value provided must be a boolean. If no such key key is provided, 'delete_source' defaults to False.
  • create_dirs (bool) – if True, create directories returned by divvy_rule if they do not exist. Defaults to True.
  • allow_overwrite (bool) – if True allow existing images to be overwritten. Defaults to True.
  • image_column (str) – the column to use when copying images. If None, use 'cached_images_path'. Defaults to None.
  • stack (bool) – if True, stack 3D volumes and time-series images when action='ndarray'. Defaults to True.
  • verbose (bool) – if True print additional details. Defaults to True.
Returns:

  • If divvy_rule is a string:
    • If action='copy' and train_val_test_dict is not dictionary, this function will return a dictionary of the form {divvy_rule: [cache_file_path, cache_file_path, ...], ...}.
  • If divvy_rule is a function:
    • If action='copy' and train_val_test_dict is not a dictionary, this function will return a dictionary of the form {string returned by divvy_rule(): [cache_file_path, cache_file_path, ...], ...}.
    • If action='ndarray' and train_val_test_dict is not a dictionary, this function will return a dictionary of the form {string returned by divvy_rule(): array([Image Matrix, Image Matrix, ...]), ...}.
    • If train_val_test_dict is a dictionary, the output is powered by utilities.train_val_test (available here).

Return type:

dict

Example:
>>> from biovida.images import image_divvy
>>> from biovida.images import OpeniInterface

Obtain Images
>>> opi = OpeniInterface()
>>> opi.search(image_type=['mri', 'ct'])
>>> opi.pull()

Usage 1a: Copy Images from the Cache to a New Location
>>> summary_dict = image_divvy(opi, divvy_rule="/your/output/path/here/output", action='copy')

Usage 1b: Converting to ndarrays
>>> def my_divvy_rule1(row):
>>>     if isinstance(row['image_modality_major'], str):
>>>         if 'mri' == row['image_modality_major']:
>>>             return 'mri'
>>>         elif 'ct' == row['image_modality_major']:
>>>             return 'ct'
...
>>> nd_data = image_divvy(opi, divvy_rule=my_divvy_rule1, action='ndarray')

The resultant ndarrays can be extracted as follows:

>>> ct_images = nd_data['ct']
>>> mri_images = nd_data['mri']

Usage 2a: A Rule which Invariably Returns a Single Save Location for a Single Row
>>> def my_divvy_rule2(row):
>>>     if isinstance(row['image_modality_major'], str):
>>>         if 'mri' == row['image_modality_major']:
>>>             return '/your/path/here/MRI_images'
>>>         elif 'ct' == row['image_modality_major']:
>>>             return '/your/path/here/CT_images'
...
>>> summary_dict = image_divvy(opi, divvy_rule=my_divvy_rule2, action='copy')

Usage 2b: A Rule which can Return Multiple Save Locations for a Single Row
>>> def my_divvy_rule2(row):
>>>     locations = list()
>>>     if isinstance(row['image_modality_major'], str):
>>>         if 'leg' in row['abstract']:
>>>             locations.append('/your/path/here/leg_images')
>>>         if 'pelvis' in row['abstract']:
>>>             locations.append('/your/path/here/pelvis_images')
>>>     return locations
...
>>> summary_dict= image_divvy(opi, divvy_rule=my_divvy_rule2, action='copy')

Usage 3: Divvying into train/validation/test

i. Copying to a New Location (reusing my_divvy_rule1)

>>> train_val_test_dict = {'train': 0.7, 'test': 0.3, 'target_dir': '/your/path/here/output'}
>>> summary_dict = image_divvy(opi, divvy_rule=my_divvy_rule1, action='copy', train_val_test_dict=train_val_test_dict)

ii. Obtaining ndarrays (numpy arrays)

>>> train_val_test_dict = {'train': 0.7, 'validation': 0.2, 'test': 0.1}
>>> image_dict = image_divvy(opi, divvy_rule=my_divvy_rule1, action='ndarray', train_val_test_dict=train_val_test_dict)

The resultant ndarrays can be unpacked as follows:

>>> train_ct, train_mri = image_dict['train']['ct'], image_dict['train']['mri']
>>> val_ct, val_mri = image_dict['validation']['ct'], image_dict['validation']['mri']
>>> test_ct, test_mri = image_dict['test']['ct'], image_dict['test']['mri']

This function behaves the same if passed an instance of OpeniImageProcessing

>>> from biovida.images import OpeniImageProcessing
>>> ip = OpeniImageProcessing(opi)
>>> ip.auto()
>>> ip.clean_image_dataframe()
>>> image_dict = image_divvy(ip, divvy_rule=my_divvy_rule1, action='ndarray', train_val_test_dict=train_val_test_dict)
...

Note

If an instance of OpeniImageProcessing is passed to image_divvy, the image_data_frame_cleaned dataframe will be extracted.

Note

If a function passed to divvy_rule returns a system path when a dictionary has been passed to train_val_test_dict, only the basename of the system path will be used.

Warning

While it is possible to pass a function to divvy_rule which returns multiple categories (similar to my_divvy_rule2()) when divvying into train/validation/test, doing so is not recommended. Overlap between these groups is likely to lead to erroneous performance metrics (e.g., accuracy) when assessing fitted models.

Support Tools

Utilities

biovida.support_tools.utilities.train_val_test(data, train, validation, test, target_dir=None, action=’ndarray’, delete_source=False, stack=True, random_state=None, verbose=True)[source]

Split data into any combination of the following: train, validation and/or test.

Parameters:
  • data (dict or str) –
    • a dictionary of the form: {group_name: [file_path, file_path, ...], ...}.
    • the directory containing the data. This directory should contain subdirectories (the categories) populated with the files.

    Warning

    If a directory is passed, subdirectories therein entitled ‘train’, ‘validation’ and ‘test’ will be ignored.

  • train (int, float, bool or None) – the proportion images in data to allocate to train. If False or None, no images will be allocated.
  • validation (int, float, bool or None) – the proportion images in data to allocate to validation. If False or None, no images will be allocated.
  • test (int, float, bool or None) – the proportion images in data to allocate to test. If False or None, no images will be allocated.
  • target_dir (str or None) –
    the location to output the images to (if action is ‘copy’ or ‘move’).
    If None, the output location will be data. Defaults to None.

    Warning

    target_dir must be a string if data is not.

  • action (str) –

    one of: ‘copy’, ‘move’ or ‘ndarray’.

    • if 'copy': copy from files from data to target_dir.
    • if 'move': move from files from data to target_dir.
    • if 'ndarray': return a nested dictionary of ndarray (‘numpy’) arrays (default).

    Warning

    Using ‘move’ directly on files in a cache is not recommended. However, if such an action is performed, the corresponding class must be reinstated to allow the cache_records_db database to update.

  • delete_source (bool) –

    if True delete the source subdirectories in data after copying is complete. Defaults to False.

    Note

    This can be useful for transforming a directory ‘in-place’, e.g., if data and target_dir are the same and delete_source=True.

  • stack (bool) – if True, stack 3D volumes and time-series images when action='ndarray'. Defaults to True.
  • random_state (None or int) – set a seed for random shuffling. Similar to sklearn.model_selection.train_test_split. Defaults to None.
  • verbose (bool) – if True, print the resultant structure. Defaults to True.
Returns:

a dictionary of the form: {one of 'train', 'validation' or 'test': {subdirectory in `data`: [file_path, file_path, ...], ...}, ...}.

  • if action='copy': the dictionary returned will be exactly as shown above.
  • if action='ndarray': ‘file_path’ will be replaced with the image as a ndarray and the list will be ndarray``s, i.e., ``array([Image Matrix, Image Matrix, ...]).

Return type:

dict

Raises:

ValueError – if the combination of train, validation, test which which were passed numeric values (i.e., int or float) do not sum to 1.

Note

Files are randomly shuffled prior to assignment.

Warning

In the case of division with a remainder, preference is as follows: train < validation < test. For instance, if we have train=0.7, validation=0.3, and the number of files in a given subdirectory equal to 6 (as in the final example below) the number of files allocated to train will be rounded down, in favor of validation obtaining the final file. In this instance, train would obtain 4 files (floor(0.7 * 6) = 4) and validation would obtain 2 files (ceil(0.3 * 6) = 2).

Example:

The examples below use a sample directory entitled images.

This is its structure:

$ tree /path/to/data/images
├── ct
│   ├── ct_1.png
│   ├── ct_2.png
│   ├── ct_3.png
│   ├── ct_4.png
│   ├── ct_5.png
│   └── ct_6.png
└── mri
    ├── mri_1.png
    ├── mri_2.png
    ├── mri_3.png
    ├── mri_4.png
    ├── mri_5.png
    └── mri_6.png

Usage 1: Obtaining ndarrays
>>> from biovida.support_tools import train_val_test
>>> tt = train_val_test(data='/path/to/data/images', train=0.7, validation=None, test=0.3,
...                     action='ndarray')

The resultant ndarrays can be unpacked into objects as follows:

>>> train_ct, train_mri = tt['train']['ct'], tt['train']['mri']
>>> test_ct, test_mri = tt['test']['ct'], tt['test']['mri']

Usage 2: Reorganize a Directory In-place
>>> from biovida.support_tools import train_val_test
>>> tv = train_val_test(data='/path/to/data/images', train=0.7, validation=0.3, test=None,
...                     action='copy', delete_source=True)

Which results in the following structure:

$ tree /path/to/data/images
├── train
│   ├── ct
│   │   ├── ct_4.png
│   │   ├── ct_5.png
│   │   ├── ct_3.png
│   │   └── ct_1.png
│   └── mri
│       ├── mri_2.png
│       ├── mri_1.png
│       ├── mri_5.png
│       └── mri_4.png
└── validation
    ├── ct
    │   ├── ct_2.png
    │   └── ct_6.png
    └── mri
        ├── mri_3.png
        └── mri_6.png

Note

The following changes to Usage 2 would preserve the original ct and mri directories:

  • setting delete_source=False (the default) and/or
  • providing a path to target_dir, e.g., target_dir='/path/to/output/output_data'
biovida.support_tools.utilities.reverse_train_val_test(data, delete_source=True, verbose=True)[source]

Reverse the action of train_val_test on a directory.

Parameters:
  • data (str) – the path to the directory created by train_val_test.
  • delete_source (bool) – if True delete the empty train/validation/test directories in data after the move is complete. Defaults to True.
  • verbose (bool) – if True, print the resultant structure. Defaults to True.
Example:

Initial Directory Structure:

$ tree /path/to/data/images
├── ct
│   ├── ct_1.png
│   ├── ct_2.png
│   ├── ct_3.png
│   ├── ct_4.png
│   ├── ct_5.png
│   └── ct_6.png
└── mri
    ├── mri_1.png
    ├── mri_2.png
    ├── mri_3.png
    ├── mri_4.png
    ├── mri_5.png
    └── mri_6.png
>>> from biovida.support_tools import train_val_test
>>> tv = train_val_test(data='/path/to/data/images', train=0.7, validation=0.3, test=None,
...                     action='copy', delete_source=True)
$ tree /path/to/data/images
├── train
│   ├── ct
│   │   ├── ct_4.png
│   │   ├── ct_5.png
│   │   ├── ct_3.png
│   │   └── ct_1.png
│   └── mri
│       ├── mri_2.png
│       ├── mri_1.png
│       ├── mri_5.png
│       └── mri_4.png
└── validation
    ├── ct
    │   ├── ct_2.png
    │   └── ct_6.png
    └── mri
        ├── mri_3.png
        └── mri_6.png
>>> reverse_train_val_test('/path/to/data/images')
$ tree /path/to/data/images
├── ct
│   ├── ct_1.png
│   ├── ct_2.png
│   ├── ct_3.png
│   ├── ct_4.png
│   ├── ct_5.png
│   └── ct_6.png
└── mri
    ├── mri_1.png
    ├── mri_2.png
    ├── mri_3.png
    ├── mri_4.png
    ├── mri_5.png
    └── mri_6.png

Printing Tools

biovida.support_tools.printing.dict_pprint(d, sort_keys=True, space_entries=True, max_value_length=70)[source]

Pretty prints a dictionary with vertically aligned values. Dictionary values with strings longer than max_value_length are automatically broken and aligned with the line(s) above.

Parameters:
  • d (dict) – a dictionary
  • sort_keys (bool) – if True, sort the keys in alphabetical order.
  • space_entries (bool) – if True, insert a line break between entries.
  • max_value_length (int) – max. number of characters in a string before a line break. This is a fuzzy threshold because the algorithm will only insert line breaks where there are already spaces.
Example:
>>> d = {
'part_2': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut dictum nisi sed euismod consequat. '
          'Donec tempus enim nec lorem ornare, non sagittis dolor consequat. Sed facilisis tortor '
          'vel enim mattis, et fermentum mi posuere.',
'part_1': 'Duis sit amet nulla fermentum, vestibulum mauris at, porttitor massa. '
          'Vestibulum luctus interdum mattis.'
}
>>> dict_pprint(d)
 - Part 2:  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut dictum nisi
            sed euismod consequat. Donec tempus enim nec lorem ornare, non sagittis
            dolor consequat. Sed facilisis tortor vel enim mattis, et fermentum
            mi posuere.
 - Part 1:  Duis sit amet nulla fermentum, vestibulum mauris at, porttitor massa.
            Vestibulum luctus interdum mattis.
biovida.support_tools.printing.pandas_pprint(data, col_align=’right’, header_align=’center’, full_rows=False, full_cols=False, suppress_index=False, column_width_limit=None)[source]

Pretty Print a Pandas DataFrame or Series.

Parameters:
  • data (Pandas DataFrame or Pandas Series) – a dataframe or series.
  • col_align (str or dict) – ‘left’, ‘right’, ‘center’ or a dictionary of the form: {'column_name': 'alignment'}, e.g., {'my_column': 'left', 'my_column2': 'right'}.
  • header_align (str or dict) – alignment of headers. Must be one of: ‘left’, ‘right’, ‘center’.
  • full_rows (bool) – print all rows.
  • full_cols (bool) – print all columns.
  • suppress_index (bool) – if True, suppress the index. Defaults to False.
  • column_width_limit (int or None) – change limit on how wide columns can be. If None, no change will be made. Defaults to None.