chromosight.utils package

Submodules

chromosight.utils.contacts_map module

Chromosight’s contact_map submodule contains classes to keep track of the different aspects of Hi-C data and automate standard pre-processing steps. The ContactMap class holds the whole genome informations and metadata (chromosome sizes, resolution, etc) without loading the actual contact map. Upon calling its “make_sub_matrices” method, it will generate a collection of ContactMap instances accessible via the sub_mats attribute. Each instance corresponds to an inter- or intra-chromosomal matrix and the Hi-C matrix of each chromosome is loaded and preprocessed upon instantiation.

class chromosight.utils.contacts_map.ContactMap(clr, extent, name='', detectable_bins=None, inter=False, max_dist=None, largest_kernel=0, dump=None, smooth=False, sample=None, use_norm=True)[source]

Bases: object

Class to store and manipulate a simple Hi-C matrix, either intra or inter-chromosomal.

clr : cooler.Cooler
Reference to a cooler object containing Hi-C data.
extent : list of tuples of ints
List of two tuples containing the start and end bin numbers of both chromosomes from the submatrix.
matrix : scipy.sparse.csr_matrix
The contact map as a sparse matrix.
detectable_bins : tuple of arrays
List containing two arrays (rows and columns) of indices from bins considered detectable in the matrix.
inter : bool
True if the matrix represents contacts between two different chromosomes, False otherwise.
max_dist : int
Maximum distance (in bins) at which contact values should be analysed. Only valid for intrachromosomal matrices.
dump : str
Base path where dump files will be generated. None means no dump.
name : str
Name of the submatrix (used for dumping).
smooth : bool
Whether isotonic regression should be used to smooth the signal for detrending. This will reduce noise at long ranges but assumes contacts can only decrease with distance from the diagonal. Do not use with circular chromosomes.
sample : int, float or None
Proportion of contacts to sample from the data if between 0 and 1. Number of contacts to keep if above 1. Keep all if None.
use_norm : bool
Whether to use the balanced matrix. If set to False, the raw contact counts are used.
create_mat()[source]
destroy_mat()[source]

Destroys contact map to clean up memory

detrend(**kwargs)[source]

Executed at run time of the wrapped method. Executes the input function with its arguments, then dumps the matrix to target path. Note args[0] will always denote the instance of the wrapped method.

keep_distance
preprocess_inter_matrix(**kwargs)[source]

Executed at run time of the wrapped method. Executes the input function with its arguments, then dumps the matrix to target path. Note args[0] will always denote the instance of the wrapped method.

preprocess_intra_matrix()[source]
remove_diags(**kwargs)[source]

Executed at run time of the wrapped method. Executes the input function with its arguments, then dumps the matrix to target path. Note args[0] will always denote the instance of the wrapped method.

subsample(**kwargs)[source]

Executed at run time of the wrapped method. Executes the input function with its arguments, then dumps the matrix to target path. Note args[0] will always denote the instance of the wrapped method.

class chromosight.utils.contacts_map.DumpMatrix(dump_name)[source]

Bases: object

This class is used as a decorator to wrap ContactMap’s methods and generate dump files of the “matrix” atribute. The matrix should be a scipy.sparse matrix and will be saved in npy format. The full dump path will be:

inst.dump / os.path.basename(inst.name) + self.dump_name + “.npy”

Where inst is the ContactMap instance of the wrapped method and self is the DumpMatrix instance. If the inst has no dump attribute, no action is performed.

Parameters:dump_name (str, os.PathLike object or None) – The basename of the file where to save the dump. If None, no action is performed.
class chromosight.utils.contacts_map.HicGenome(path, inter=False, kernel_config=None, dump=None, smooth=False, sample=None)[source]

Bases: object

Class used to manage relationships between whole genome and intra- or inter- chromosomal Hi-C sub matrices. Also handles reading and writing data.

clr : cooler.Cooler
Cooler object containing Hi-C data and related informations for the whole genome
sub_mats : pandas.DataFrame
Table containing intra- and optionally inter-chromosomal matrices.
detectable_bins : array of ints
Array containing indices of detectable rows and columns.
bins : pandas.DataFrame
Table containing bin related informations.
inter : bool
Whether interchromosomal matrices should be stored.
kernel_config : dict
Kernel configuration associated with the Hi-C genome
max_dist : int
Maximum scanning distance for convolution during pattern detection.
dump : str
Base path where dump files will be generated. None means no dump.
smooth : bool
Whether isotonic regression should be used to smooth the signal for detrending intrachromosomal sub matrices. This will reduce noise at long ranges but assumes contacts can only decrease with distance from the diagonal. Do not use with circular chromosomes.
sample : int, float or None
Proportion of contacts to sample from the data if between 0 and 1. Number of contacts to keep if above 1. Keep all if None.
bins_to_coords(bin_idx)[source]

Converts a list of bin IDs to genomic coordinates based on the whole genome contact map.

Parameters:bin_idx (numpy.array of ints) – A list of bin numbers corresponding to rows or columns of the whole genome matrix.
Returns:A subset of the bins dataframe, with columns chrom, start, end where chrom is the chromosome name (str), and start and end are the genomic coordinates of the bin (int).
Return type:pandas.DataFrame
compute_max_dist()[source]

Use the kernel config to compute max_dist

coords_to_bins(coords)[source]

Converts genomic coordinates to a list of bin ids based on the whole genome contact map.

Parameters:coords (pandas.DataFrame) – Table of genomic coordinates, with columns chrom, pos.
Returns:Indices in the whole genome matrix contact map.
Return type:numpy.array of ints
gather_sub_matrices()[source]

Gathers all processed sub_matrices into a whole genome matrix

get_full_mat_pattern(chr1, chr2, patterns)[source]

Converts bin indices of a list of patterns from an submatrix into their value in the original full-genome matrix.

Parameters:
  • chr1 (str) – Name of the first chromosome of the sub matrix (rows).
  • chr2 (str) – Name of the second chromosome of the sub matrix (cols).
  • pattern (pandas.DataFrame) – A dataframme of pattern coordinates. Each row is a pattern and columns should be bin1 and bin2, for row and column coordinates in the Hi-C matrix, respectively.
Returns:

full_patterns – A dataframe similar to the input, but with bins shifted to represent coordinates in the whole genome matrix.

Return type:

pandas.DataFrame

get_sub_mat_pattern(chr1, chr2, patterns)[source]

Converts bin indices of a list of patterns from the whole genome matrix into their value in the desired intra- or inter-chromosomal sub-matrix.

Parameters:
  • chr1 (str) – Name of the first chromosome of the sub matrix (rows).
  • chr2 (str) – Name of the second chromosome of the sub matrix (cols).
  • pattern (pandas.DataFrame) – A dataframme of pattern coordinates. Each row is a pattern and columns should be bin1 and bin2, for row and column coordinates in the Hi-C matrix, respectively.
Returns:

full_patterns – A dataframe similar to the input, but with bins shifted to represent coordinates in the target sub-matrix.

Return type:

pandas.DataFrame

make_sub_matrices()[source]

Generates a table of Hi-C sub matrices. Each sub matrix is either intra or interchromosomal. The table has 3 columns: chr1, chr2 and contact_map. The contact_map column contains instances of the ContactMap class.

Returns:The table of sub matrices which will contain n_chrom rows if the inter attribute is set to false, or (n_chrom^2) / 2 + n_chroms / 2 if inter is True (that is, the upper triangle matrix).
Return type:pandas.DataFrame
normalize(norm='auto', n_mads=5, threads=1)[source]

If the instance’s cooler is not balanced, finds detectable bins and applies ICE normalisation to the whole genome matrix. Newly computed biases are stored in the input file. If it is already balanced, detectable bins and weights will be extracted from the file.

Parameters:
  • force_norm (str) – Normalization behaviour. If ‘auto’, existing weights are reused and matrix is balanced only in the absence of weights. If ‘raw’, raw contact values are used. If ‘force’, weights are recomputed and the underlying cool file is overwritten.
  • n_mads (float) – Maximum number of median absoluted deviations (MADs) below the median of the distribution of logged bin sums to consider a bin detectable.
  • threads (int) – Number of parallel threads to use for normalization.

chromosight.utils.detection module

Chromosight’s detection submodule implements the bulk of chromosight convolution engine, as well as functions to perform the different steps of the detection algorithm (pre-processing, local maxima, post-processing…). The pattern_detector function orchestrate all those different steps.

chromosight.utils.detection.filter_foci(foci_mat, min_size=2)[source]

Given an input matrix of labelled foci (continuous islands of equal nonzero values), filters out foci consisting of fewer pixels than min_size.

Parameters:
  • foci_mat (scipy.sparse.coo_matrix) – Input matrix of labelled foci. Pixels are numbered according to their respective foci. Pixels that are not assigned to a focus are 0.
  • min_size (int) – Minimum number of pixels required to keep a focus. Pixels belonging to smaller foci will be set to 0.
Returns:

  • num_filtered (int) – Number of foci remaining after filtering.
  • filtered_mat (scipy.sparse.coo_matrix) – Matrix of filtered foci.

chromosight.utils.detection.label_foci(matrix)[source]

Given a sparse matrix of 1 and 0 values, find all foci of continuously neighbouring positive pixels and assign them a label according to their focus. Horizontal and vertical (4-way) adjacency is considered.

Parameters:matrix (scipy.sparse.coo_matrix of ints) – The input matrix where to label foci. Should be filled with 1 and 0s.
Returns:
  • num_foci (int) – Number of individual foci identified.
  • foci_mat (scipy.sparse.coo_matrix:) – The matrix with values replaced by their respective foci labels.

Example

>>> M.todense()
array([[1 0 0 0]
       [1 0 1 0]
       [1 0 1 1]
       [0 0 0 0]])
>>> num_foci, foci_mat = label_foci(M)
>>> num_foci
2
>>>foci_mat.todense()
array([[1 0 0 0]
       [1 0 2 0]
       [1 0 2 2]
       [0 0 0 0]])
chromosight.utils.detection.normxcorr2(signal, kernel, max_dist=None, sym_upper=False, full=False, missing_mask=None, missing_tol=0.75, tsvd=None, pval=False)[source]

Computes the normalized cross-correlation of a dense or sparse signal and a dense kernel. The resulting matrix contains Pearson correlation coefficients.

Parameters:
  • signal (scipy.sparse.csr_matrix or numpy.array) – The input processed Hi-C matrix.
  • kernel (numpy.array) – The pattern kernel to use for convolution.
  • max_dist (int) – Maximum scan distance, in number of bins from the diagonal. If None, the whole matrix is convoluted. Otherwise, pixels further than this distance from the diagonal are set to 0 and ignored for performance. Only useful for intrachromosomal matrices.
  • sym_upper (False) – Whether the matrix is symmetric and upper triangle. True for intrachromosomal matrices.
  • missing_mask (scipy.sparse.csr_matrix of bool or None) – Matrix defining which pixels are missing (1) or not (0).
  • full (bool) – If True, convolutions will be made in ‘full’ mode; the matrix is first padded with margins to allow scanning to the edges, and missing bins are also masked to exclude them when computing correlation scores. Computationally intensive
  • missing_mask – Mask matrix denoting missing bins, where missing is denoted as True and valid as False. Can be None to ignore missing bin information. Only taken into account when full=True.
  • missing_tol (float) – Proportion of missing values allowed in windows to keep the correlation coefficients.
  • tsvd (float or None) – If a float between 0 and 1 is given, the input kernel is factorised using truncated SVD, keeping enough singular vectors to retain this proportion of information. Factorisation speeds up convolution at the cost of a loss of information. If the number of singular vectors required to retain the desired information isDisabled by default.
  • pval (bool) – Whether to return a matrix of p-values.
Returns:

  • scipy.sparse.csr_matrix or numpy.array – The sparse matrix of correlation coefficients. Same type as the input signal.
  • scipy.sparse.csr_matrix or numpy.array or None – A map of Benjamini-Hochberg corrected p-values (q-values). Same type as the input signal. If pval=False, this will be None.

chromosight.utils.detection.pattern_detector(contact_map, kernel_config, kernel_matrix, coords=None, dump=None, full=False, tsvd=None)[source]

Detect patterns in a contact map by kernel matching, and extract windows around the detected patterns. If coordinates are provided, detection is skipped and windows are extracted around those coordinates.

Parameters:
  • contact_map (ContactMap object) – An object containing an inter- or intra-chromosomal Hi-C contact map and additional metadata.
  • kernel_config (dict) – The kernel configuration, as documented in chromosight.utils.io.load_kernel_config
  • kernel_matrix (numpy.array) – The kernel matrix to use for convolution as a 2D numpy array
  • coords (numpy.array of ints or None) – A table with coordinates of patterns, with one pattern per row and 2 columns being the row and column number of the pattern in the input contact map. If this is provided, detection is skipped and quantification is performed on those coordinates.
  • dump (str or None) – Folder in which dumps should be generated after each step of the detection process. If None, no dump is generated
  • tsvd (float or None) – If a float between 0 and 1 is given, the input kernel is factorised using truncated SVD, keeping enough singular vectors to retain this proportion of information. Factorisation speeds up convolution at the cost of a loss of information. If the number of singular vectors required to retain the desired information is disabled by default.
Returns:

  • filtered_chrom_patterns (pandas.DataFrame) – A table of detected patterns with 4 columns: bin1, bin2, score, qvalue.
  • chrom_pattern_windows (numpy array) – A 3D array containing the pile of windows around detected patterns.

chromosight.utils.detection.pick_foci(mat_conv, pearson, min_size=2)[source]

Pick coordinates of local maxima in a sparse 2D convolution heatmap. A threshold computed based on the pearson argument is applied to the heatmap. All values below that threshold are set to 0. The coordinate of the maximum value in each focus (contiguous region of high values) is returned.

Parameters:
  • mat_conv (scipy.sparse.coo_matrix of floats) – A 2D sparse convolution heatmap.
  • pearson (float) – Minimum correlation coefficient required to consider a pixel as candidate. Increasing this value reduces the amount of false positive patterns.
  • min_size (int) – Minimum number of pixels required to keep a focus. Pixels belonging to smaller foci will be set to 0.
Returns:

  • foci_coords (numpy.array of ints) – 2D array of coordinates for identified patterns corresponding to foci maxima. None is no pattern was detected.
  • labelled_mat (scipy.sparse.coo_matrix) – The map of detected foci. Pixels which were assigned to a focus are given an integer as their focus ID. Pixels not assigned to a focus are set to 0.

chromosight.utils.detection.pileup_patterns(pattern_windows)[source]

Generate a pileup (arithmetic mean) from a stack of pattern windows.

Parameters:pattern_windows (numpy.array of floats) – 3D numpy array of detected windows. Shape is (N, H, W) where N is the number of windows, H the height, and W the width of each window.
Returns:2D numpy array containing the pileup (arithmetic mean) of input windows.
Return type:numpy.array of floats
chromosight.utils.detection.remove_neighbours(patterns, win_size=8)[source]

Identify patterns that are too close from each other to exclude them. The pattern with the highest score are kept in priority.

Parameters:
  • patterns (numpy.array of float) – 2D Array of patterns, with 3 columns: bin1, bin2 and score.
  • win_size (int) – The maximum number of pixels at which patterns are considered overlapping.
Returns:

Boolean array indicating which patterns are valid (True values) and which are overlapping neighbours (False values)

Return type:

numpy.array of bool

chromosight.utils.detection.validate_patterns(coords, matrix, conv_mat, detectable_bins, kernel_matrix, drop=True, zero_tol=0.3, missing_tol=0.75)[source]

Given a list of pattern coordinates and a contact map, drop or flag patterns in low detectability regions or too close to matrix boundaries. Also returns the surrounding window of Hi-C contacts around each detected pattern.

Parameters:
  • coords (numpy.array of ints) – Coordinates of all detected patterns in the sub matrix. One pattern per row, the first column is the matrix row, second column is the matrix col.
  • matrix (scipy.sparse.csr_matrix) – Hi-C contact map of the sub matrix.
  • conv_mat (scipy.sparse.csr_matrix) – Convolution product of the kernel with the Hi-C sub matrix.
  • detectable_bins (list of numpy.array) – List of two 1D numpy arrays of ints representing ids of detectable rows and columns, respectively.
  • kernel_matrix (numpy.array of float) – The kernel that was used for pattern detection on the Hi-C matrix.
  • zero_tol (float) – Proportion of zero pixels allowed in a pattern window to consider it valid.
  • missing_tol (float) – Proportion of missing pixels allowed in a pattern window to consider it valid.
  • drop (bool) – Whether to discard pattern coordinates and windows from patterns which fall outside the matrix or do not pass validation. If those are kept, they will be given a score of np.nan and their windows will be filled with np.nan.
Returns:

  • filtered_coords (pandas.DataFrame) – Table of coordinates that passed the filters. The dataframe has 3: columns: bin1 (rows), bin2 (col) and score (the correlation coefficient).
  • filtered_windows (numpy.array) – 3D numpy array of signal windows around detected patterns. Each window spans axes 1 and 2, and they are stacked along axis 0.

chromosight.utils.detection.xcorr2(signal, kernel, threshold=0.0001, tsvd=None)[source]

Cross correlate a dense or sparse 2D signal with a dense 2D kernel.

Parameters:
  • signal (scipy.sparse.csr_matrix or numpy.array of floats) – A 2-dimensional numpy array Ms x Ns acting as the detrended Hi-C map.
  • kernel (numpy.array of floats) – A 2-dimensional numpy array Mk x Nk acting as the pattern template.
  • threshold (float) – Convolution score below which pixels will be set back to zero to save on time and memory.
  • tsvd (float or None) – If a float between 0 and 1 is given, the input kernel is factorised using truncated SVD, keeping enough singular vectors to retain this proportion of information. Factorisation speeds up convolution at the cost of a loss of information. If the number of singular vectors required to retain the desired information isDisabled by default.
  • -------
  • out (scipy.sparse.csr_matrix or numpy.array) – Convolution product of signal by kernel. Same type as the input signal.

chromosight.utils.io module

Chromosight’s io submodule contains input/output related functions to load contact matrices in cool format, and save output patterns coordinates and windows.

chromosight.utils.io.check_prefix_dir(prefix)[source]

Checks for existence of the parent directory of an output prefix

chromosight.utils.io.download_file(url, file, length=16384)[source]
chromosight.utils.io.load_bed2d(path)[source]

Loads only the first 6 columns of a 2D BED file. Will sniff for header and generate a default header only if none is present. Compatible with output of chromosight detect.

Parameters:path (str) – The path to the 2D BED file.
Returns:The content of the 2D BED file as a dataframe with 6 columns. Header will be: chrom1, start1, end1, chrom2, start2, end2.
Return type:pandas.DataFrame
chromosight.utils.io.load_cool(cool_path)[source]

Reads a cool file into memory and parses it into a COO sparse matrix and an array with the starting bin of each chromosome.

Parameters:cool (str) – Path to the input .cool file.
Returns:
  • mat (scipy.sparse.coo_matrix) – Output sparse matrix in coordinate format
  • chroms (pandas.DataFrame) – Table of chromosome information. Each row contains the name, length, first and last bin of a chromosome.
  • bins (pandas.DataFrame) – Table of genomic bins information. Each row contains the chromosome, genomic start and end coordinates of a matrix bin.
  • bin_size (int) – Matrix resolution. Corresponds to the number of base pairs per matrix bin.
chromosight.utils.io.load_kernel_config(kernel, custom=False)[source]

Load a kernel configuration from input JSON file.

All parameters associated with the kernel along its kernel matrices are loaded into a dictionary.

A kernel config file is a JSON file with the following structure:


{

“name”: str, “kernels”: [

str, …

], “max_dist”: int, “min_dist”: int, “max_iterations”: int, “max_perc_zero”: float, “max_perc_undetected”: float, “pearson”: float “resolution”: int

}

The kernels field should contain a list of path to kernel matrices to be loaded. These path should be relative to the config file. When loaded, the kernel field will contain the target matrices as 2D numpy arrays.

The kernel matrices files themselves are raw tsv files containing a dense matrix of numeric value as read by the numpy.loadtxt function.

Other fields are:

  • name : Name of the pattern
  • max_dist : maximum distance in basepairs to scan from the diagonal
  • max_iterations: maximum number of scanning iterations to perform
  • max_perc_zero: Maximum percentage of empty (0) pixels to include a pattern
  • max_perc_zero: Maximum percentage of missing (nan) pixels to include a pattern
  • pearson: Increasing this value reduces false positive patterns.
  • resolution: Basepair resolution for the kernel matrix.
Parameters:
  • kernel (str) – The name of the built-in pattern configuration to load if custom is False. Otherwise, the path to the custom JSON configuration file to load.
  • custom (bool) – Determines if a custom JSON configuration file must be loaded, or if a preset configuration is used.
Returns:

kernel_config – A dictionary containing a key: value pair for each parameter as well as list of kernel matrices under key ‘kernels’.

Return type:

dict

chromosight.utils.io.progress(count, total, status='')[source]

Basic progress bar in terminal.

Parameters:
  • count (float) – Current task id.
  • total (float) – Maximum task id.
  • status (str) – Info to write on the side of the bar.
chromosight.utils.io.save_windows(windows, output_prefix, fmt='json')[source]

Write windows surrounding detected patterns to a npy or json file. The file contains a 3D array where windows are piled on axis 0, matrix rows are on axis 1 and columns on axis 2.

Parameters:
  • windows (numpy.array of floats) – 3D numpy array with axes (windows, rows, columns).
  • output_prefix (str) – Output path where the file will be saved, an extension will be added based on the value of “format”.
  • format (str) – Format in which to save windows. Can be either npy for numpy’s binary format, or json for a general purpose text format.
chromosight.utils.io.write_patterns(coords, output_prefix, dec=10)[source]

Writes coordinates to a text file.

Parameters:
  • coords (pandas.DataFrame) – Pandas dataframe containing the coordinates and score of one detected pattern per row.
  • pattern_name (str) – Name of the pattern. Will be the basename of the output file.
  • output_dir (str) – Output path where the file will be saved.
  • dec (int) – Number of decimals to keep in correlation scores and p-values.

chromosight.utils.plotting module

Chromosight’s plotting submodule contains utilities to visualize the pileup of detected patterns or the input matrix. It also implements an interactive map recording the coordinates of double clicks.

chromosight.utils.plotting.click_finder(mat, half_w=8, xlab=None, ylab=None)[source]

Given an input Hi-C matrix, show an interactive window and record coordinates where the user double-clicks. When the interactive window is closed, the stack of windows around recorded coordinates is returned.

Parameters:
  • mat (scipy.sparse.csr_matrix) – The input Hi-C matrix to display interactively.
  • half_w (int) – Half width of the windows to return. The resulting windows
  • xlab (str) – Horizontal label to display below the matrix.
  • ylab (str) – Vertical label to display next to the matrix.
Returns:

3D stack of images around coordinates recorded interactively. The shape of the stack is (N, w, w) where N is the number of coordinates and w is 2*half_w.

Return type:

numpy.array

chromosight.utils.plotting.pileup_plot(pileup_pattern, output_prefix, name='pileup_patterns')[source]

Wrapper around matplotlib.pyplot.imshow to visualize the pileup of patterns detected by chromosight

chromosight.utils.plotting.plot_whole_matrix(clr, patterns, out=None, region=None, region2=None, log_transform=False)[source]

Visualise the input matrix with a set of patterns overlaid on top. Can optionally restrict the visualisation to a region.

Parameters:
  • mat (scipy.sparse.csr_matrix) – The whole genome Hi-C matrix to be visualized.
  • patterns (pandas.DataFrame) – The set of patterns to be plotted on top of the matrix. One pattern per row, 3 columns: bin1, bin2 and score of types int, int and float, respectively.
  • region (str) – The genomic range, in UCSC format, corresponding to rows to be plotted in the matrix. If not given, the whole matrix is used. It only region is given, but not region2, the matrix is subsetted on rows and columns to show a region on the diagonal.
  • region2 (str) – The genomic range, in UCSC format, of columns to be plotted in the matrix. Region must also be provided, or this will be ignored.
  • log_transform (bool) – Whether to log transform the matrix.
chromosight.utils.plotting.print_ascii_mat(mat, adjust=True, colored=False, print_str=True)[source]

Given a 2D numpy array of float, print it in ASCII art.

Parameters:
  • mat (np.array of floats) – Matrix to visualize.
  • adjust (bool) – Whether to adjust the drawing size to termina width.
  • colored (bool) – Whether to use colors.
  • print_str (bool) – If true, the ASCII art is printed to stdout, otherwise it is stored in a string and returned.
Returns:

An empty string is returned if print_str is True, otherwise the ASCII art is returned as a string.

Return type:

str

chromosight.utils.preprocessing module

Chromosight’s preprocessing submodule implements a number of functions to operate on Hi-C contact maps before detection. These functions can be used to improve the signal or filter unneeded signal. There are also functions to edit (zoom, crop, factorize) kernel matrices.

chromosight.utils.preprocessing.check_missing_mask(signal, mask)[source]

Ensure all elements defined as missing by the mask are set to zero in the signal. If this is not the case, raises an error.

Parameters:
  • signal (numpy.ndarray of floats or scipy.sparse.csr_matrix of floats) – The signal to be checked.
  • mask (numpy.ndarray of bools or scipy.sparse.csr_matrix of bools) – The mask defining missing values as True and valid values as False.
chromosight.utils.preprocessing.crop_kernel(kernel, target_size)[source]

Crop a kernel matrix to target size horizontally and vertically. If the target size is even, the target size is adjusted to the next integer up.

Parameters:
  • kernel (numpy.ndarray of floats) – Image to crop.
  • target_size (tuple of ints) – Tuple defining the target shape of the kernel, takes the form (rows, cols) where rows and cols are odd numbers.
Returns:

cropped – New image no larger than target dimensions

Return type:

numpy.ndarray of floats

chromosight.utils.preprocessing.detrend(matrix, detectable_bins=None, max_dist=None, smooth=False, fun=<function nanmean>, max_val=10)[source]

Detrends a Hi-C matrix by the distance law. The input matrix should have been normalised beforehandand.

Parameters:
  • matrix (scipy.sparse.csr_matrix) – The normalised intrachromosomal Hi-C matrix to detrend.
  • detectable_bins (tuple) – Tuple containing a list of detectable rows and a list of columns on which to perform detrending. Poorly interacting indices have been excluded.
  • max_dist (int) – Maximum number of bins from the diagonal at which to compute trend.
  • smooth (bool) – Whether to use isotonic regression to smooth the trend.
  • fun (callable) – Function to use on each diagonal to compute the trend.
  • max_val (float or None) – Maximum value in the detrended matrix. Set to None to disable
Returns:

The detrended Hi-C matrix.

Return type:

numpy.ndarray

chromosight.utils.preprocessing.diag_trim(mat, n)[source]

Trim an upper triangle sparse matrix so that only the first n diagonals are kept.

Parameters:
  • mat (scipy.sparse.csr_matrix or numpy.ndarray) – The sparse matrix to be trimmed
  • n (int) – The number of diagonals from the center to keep (0-based).
Returns:

The diagonally trimmed upper triangle matrix with only the first n diagonal.

Return type:

scipy.sparse.dia_matrix or numpy.ndarray

chromosight.utils.preprocessing.distance_law(matrix, detectable_bins=None, max_dist=None, smooth=True, fun=<function nanmean>)[source]

Computes genomic distance law by averaging over each diagonal in the upper triangle matrix. If a list of detectable bins is provided, pixels in missing bins will be excluded from the averages. A maximum distance can be specified to define how many diagonals should be computed.

Parameters:
  • matrix (scipy.sparse.csr_matrix) – the input matrix to compute distance law from.
  • detectable_bins (numpy.ndarray of ints) – An array of detectable bins indices to consider when computing distance law.
  • max_dist (int) – Maximum distance from diagonal, in number of bins in which to compute distance law
  • smooth (bool) – Whether to use isotonic regression to smooth the distance law.
  • fun (callable) – A function to apply on each diagonal. Defaults to mean.
Returns:

dist – the output genomic distance law.

Return type:

np.ndarray

Example

>>> m = np.ones((3,3))
>>> m += np.array([1,2,3])
>>> m
array([[2., 3., 4.],
       [2., 3., 4.],
       [2., 3., 4.]])
>>> distance_law(csr_matrix(m))
array([3. , 3.5, 4. ])
chromosight.utils.preprocessing.erase_missing(signal, valid_rows, valid_cols, sym_upper=True)[source]

Given a sparse matrix, set all pixels in missing (invalid) bins to 0.

Parameters:
  • signal (scipy.sparse.csr_matrix of floats) – Input signal on which to erase values.
  • valid_rows (numpy.ndarray of ints) – Indices of rows considered valid (not missing).
  • valid_cols (numpy.ndarray of ints) – Indices of columns considered valid (not missing).
  • sym_upper (bool) – Define if the input signal is upper symmetric.
Returns:

The input signal with all values in missing bins set to 0

Return type:

scipy.sparse.csr_matrix

chromosight.utils.preprocessing.factorise_kernel(kernel, prop_info=0.999)[source]

Performs truncated SVD on an input kernel, returning the singular vectors necessary to retain a given proportion of information contained in the kernel.

Parameters:
  • kernel (numpy.ndarray of floats) – The input 2D kernel to factorise.
  • prop_info (float) – Proportion of information to retain.
Returns:

A tuple containing the truncated left and right singular matrices, where each singular vector has been multiplied by the square root of their respective singular values.

Return type:

tuple of numpy.ndarrays of floats

chromosight.utils.preprocessing.frame_missing_mask(mask, kernel_shape, sym_upper=False, max_dist=None)[source]

Adds a frame around input mask, given a kernel. The goal of this frame is define margins around the matrix where the kernel will not perform convolution (denoted by 1). If the matrix is upper symmetric, a margin of half the kernel’s width is added below the diagonal and a maximum distance from the diagonal above which margins need not be drawn can be considered. Otherwise Margins are simply added on all 4 sides of the matrix.

signal    kernel   _________
______   ____      |#######|
|     |  |   | ==> |#     #|
|     |  |___|     |#     #|
|     |            |#     #|
|_____|            |#     #|
                   |#######|
                   --------
Parameters:
  • mask (scipy.sparse.csr_matrix of bool) – The mask around which to add margins.
  • kernels_shape (tuple of ints) – The number of rows and kernel in the input kernel. Margins will be half these values.
  • sym_upper (bool) – Whether the signal is a symmetric upper triangle matrix. If so, values on a margin below the diagonal will be masked.
  • max_dist (int or None) – Number of diagonals to keep
Returns:

framed_mask – The input mask with a padding of True around the edges. If sym_upper is True, a padding is also added below the diagonal.

Return type:

scipy.sparse.csr_matrix of bool

chromosight.utils.preprocessing.get_detectable_bins(mat, n_mads=3, inter=False)[source]

Returns lists of detectable indices after excluding low interacting bins based on the proportion of zero pixel values in the matrix bins.

Parameters:
  • mat (scipy.sparse.coo_matrix) – A Hi-C matrix in tihe form of a 2D numpy array or coo matrix
  • n_mads (int) – Number of median absolute deviation below the median required to consider bins non-detectable.
  • inter (bool) – Whether the matrix is interchromosomal. Default is to consider the matrix is intrachromosomal (i.e. upper symmetric).
Returns:

tuple of 2 1D arrays containing indices of low interacting rows and columns, respectively.

Return type:

numpy ndarray

chromosight.utils.preprocessing.make_missing_mask(shape, valid_rows, valid_cols, max_dist=None, sym_upper=False)[source]

Given lists of valid rows and columns, generate a sparse matrix mask with missing pixels denoted as 1 and valid pixels as 0. If a max_dist is provided, upper symmetric matrices will only be flagged up to max_dist pixels from the diagonal.

Parameters:
  • shape (tuple of ints) – Shape of the mask to generate.
  • valid_rows (numpy.ndarray of ints) – Array with the indices of valid rows that should be set to 0 in the mask.
  • valid_cols (numpy.ndarray of ints) – Array with the indices of valid rows that should be set to 0 in the mask.
  • max_dist (int or None) – The maximum diagonal distance at which masking should take place.
  • sym_upper (bool) – Whether the matrix is symmetric upper. If so, max_dist is ignored
Returns:

The mask containing False values where pixels are valid and True valid where pixels are missing

Return type:

scipy.sparse.csr_matrix of bool

chromosight.utils.preprocessing.resize_kernel(kernel, kernel_res=None, signal_res=None, factor=None, min_size=7, quiet=False)[source]

Resize a kernel matrix based on the resolution at which it was defined and the signal resolution. E.g. if a kernel matrix was generated for 10kb and the input signal is 20kb, kernel size will be divided by two. If the kernel is enlarged, pixels are interpolated with a spline of degree 1. Alternatively, a resize factor can be provided. In the example above, the factor would be 0.5.

Parameters:
  • kernel (numpy.ndarray) – Kernel matrix.
  • kernel_res (int) – Resolution for which the kernel was designed. Mutually exclusive with factor.
  • signal_res (int) – Resolution of the signal matrix in basepair per matrix bin. Mutually exclusive with factor.
  • factor (float) – Resize factor. Can be provided as an alternative to kernel_res and signal_res. Values above 1 will enlarge the kernel, values below 1 will shrink it.
  • min_size (int) – Lower bound, in number of rows/column allowed when resizing the kernel.
  • quiet (bool) – Suppress warnings if resize factor was adjusted.
Returns:

resized_kernel – The resized input kernel.

Return type:

numpy.ndarray

chromosight.utils.preprocessing.set_mat_diag(mat, diag=0, val=0)[source]

Set the nth diagonal of a symmetric 2D numpy array to a fixed value. Operates in place.

Parameters:
  • mat (numpy.ndarray) – Symmetric 2D array of floats.
  • diag (int) – 0-based index of the diagonal to modify. Use negative values for the lower half.
  • val (float) – Value to use for filling the diagonal
chromosight.utils.preprocessing.subsample_contacts(M, n_contacts)[source]

Bootstrap sampling of contacts in a sparse Hi-C map.

Parameters:
  • M (scipy.sparse.coo_matrix) – The input Hi-C contact map in sparse format.
  • n_contacts (int) – The number of contacts to sample.
Returns:

A new matrix with a fraction of the original contacts.

Return type:

scipy.sparse.coo_matrix

chromosight.utils.preprocessing.sum_mat_bins(mat)[source]

Compute the sum of matrices bins (i.e. rows or columns) using only the upper triangle, assuming symmetrical matrices.

Parameters:mat (scipy.sparse.coo_matrix) – Contact map in sparse format, either in upper triangle or full matrix.
Returns:1D array of bin sums.
Return type:numpy.ndarray
chromosight.utils.preprocessing.valid_to_missing(valid, size)[source]

Given an array of valid indices, return the corrsesponding array of missing indices.

Parameters:
  • valid (numpy.ndarray of ints) – The valid indices.
  • size (int) – The size of the matrix (maximum possible index + 1).
Returns:

missing – The missing indices.

Return type:

numpy.ndarray of ints

chromosight.utils.preprocessing.zero_pad_sparse(mat, margin_h, margin_v, fmt='coo')[source]

Adds margin of zeros around an input sparse matrix.

Parameters:
  • mat (scipy.sparse.csr_matrix) – The matrix to be padded.
  • margin_h (int) – The width of the horizontal margin to add on the left and right of the matrix.
  • margin_v (int) – The width of the vertical margin to add on the top and bottom of the matrix.
  • fmt (string) – The desired scipy sparse format of the output matrix
Returns:

The padded matrix of dimensions (m + 2 * margin_h, n + 2 * margin_v).

Return type:

scipy.sparse.csr_matrix

Examples

>>> m = sp.csr_matrix(np.array([[1, 2], [10, 20]]))
>>> zero_pad_sparse(m, 2, 1).toarray()
array([[ 0,  0,  0,  0,  0,  0],
       [ 0,  0,  1,  2,  0,  0],
       [ 0,  0, 10, 20,  0,  0],
       [ 0,  0,  0,  0,  0,  0]])
chromosight.utils.preprocessing.ztransform(matrix)[source]

Z transformation for Hi-C matrices.

Parameters:matrix (scipy.sparse.coo_matrix) – A Hi-C matrix in sparse format.
Returns:The detrended Hi-C matrix
Return type:scipy.sparse.coo_matrix

chromosight.utils.stats module

Chromosight’s stats submodule contains helper function to compute statistical estimators from distributions

chromosight.utils.stats.corr_to_pval(corr, n, rho0=0)[source]

Given a list of Pearson correlation coefficient, convert them to two-sided log10 p-values. The p-values are computed via the fisher transformation described on: https://w.wiki/Ksu

Parameters:
  • corr (numpy.array) – The Pearson correlation coefficients.
  • n (int or numpy.array) – The number of observations used to compute correlation coefficients. Can be given as an array of the same size as corr to give the number of sample in each coefficient.
  • rho0 (float) – The correlation value under h0. We test if corr is significantly different from rho0.
Returns:

The array of log10-transformed two-sided p-values, same size as corr.

Return type:

numpy.array

chromosight.utils.stats.fdr_correction(pvals)[source]

Applies false discovery rate correction via the Benjamini-Hochberg procedure to adjust input p-values for multiple testing. .

Parameters:pvals (numpy.array of floats) – Array of uncorrected p-values.
Returns:fdr – Array of corrected p-values (q-values).
Return type:numpy.array of floats

Module contents