chromosight.utils package¶
Submodules¶
chromosight.utils.contacts_map module¶
Chromosight’s contact_map submodule contains classes to keep track of the different aspects of HiC data and automate standard preprocessing steps. The ContactMap class holds the whole genome informations and metadata (chromosome sizes, resolution, etc) without loading the actual contact map. Upon calling its “make_sub_matrices” method, it will generate a collection of ContactMap instances accessible via the sub_mats attribute. Each instance corresponds to an inter or intrachromosomal matrix and the HiC matrix of each chromosome is loaded and preprocessed upon instantiation.

class
chromosight.utils.contacts_map.
ContactMap
(clr, extent, name='', detectable_bins=None, inter=False, max_dist=None, largest_kernel=0, dump=None, smooth=False, sample=None, use_norm=True)[source]¶ Bases:
object
Class to store and manipulate a simple HiC matrix, either intra or interchromosomal.
 clr : cooler.Cooler
 Reference to a cooler object containing HiC data.
 extent : list of tuples of ints
 List of two tuples containing the start and end bin numbers of both chromosomes from the submatrix.
 matrix : scipy.sparse.csr_matrix
 The contact map as a sparse matrix.
 detectable_bins : tuple of arrays
 List containing two arrays (rows and columns) of indices from bins considered detectable in the matrix.
 inter : bool
 True if the matrix represents contacts between two different chromosomes, False otherwise.
 max_dist : int
 Maximum distance (in bins) at which contact values should be analysed. Only valid for intrachromosomal matrices.
 dump : str
 Base path where dump files will be generated. None means no dump.
 name : str
 Name of the submatrix (used for dumping).
 smooth : bool
 Whether isotonic regression should be used to smooth the signal for detrending. This will reduce noise at long ranges but assumes contacts can only decrease with distance from the diagonal. Do not use with circular chromosomes.
 sample : int, float or None
 Proportion of contacts to sample from the data if between 0 and 1. Number of contacts to keep if above 1. Keep all if None.
 use_norm : bool
 Whether to use the balanced matrix. If set to False, the raw contact counts are used.

detrend
(**kwargs)[source]¶ Executed at run time of the wrapped method. Executes the input function with its arguments, then dumps the matrix to target path. Note args[0] will always denote the instance of the wrapped method.

keep_distance
¶

preprocess_inter_matrix
(**kwargs)[source]¶ Executed at run time of the wrapped method. Executes the input function with its arguments, then dumps the matrix to target path. Note args[0] will always denote the instance of the wrapped method.

class
chromosight.utils.contacts_map.
DumpMatrix
(dump_name)[source]¶ Bases:
object
This class is used as a decorator to wrap ContactMap’s methods and generate dump files of the “matrix” atribute. The matrix should be a scipy.sparse matrix and will be saved in npy format. The full dump path will be:
inst.dump / os.path.basename(inst.name) + self.dump_name + “.npy”Where inst is the ContactMap instance of the wrapped method and self is the DumpMatrix instance. If the inst has no dump attribute, no action is performed.
Parameters: dump_name (str, os.PathLike object or None) – The basename of the file where to save the dump. If None, no action is performed.

class
chromosight.utils.contacts_map.
HicGenome
(path, inter=False, kernel_config=None, dump=None, smooth=False, sample=None)[source]¶ Bases:
object
Class used to manage relationships between whole genome and intra or inter chromosomal HiC sub matrices. Also handles reading and writing data.
 clr : cooler.Cooler
 Cooler object containing HiC data and related informations for the whole genome
 sub_mats : pandas.DataFrame
 Table containing intra and optionally interchromosomal matrices.
 detectable_bins : array of ints
 Array containing indices of detectable rows and columns.
 bins : pandas.DataFrame
 Table containing bin related informations.
 inter : bool
 Whether interchromosomal matrices should be stored.
 kernel_config : dict
 Kernel configuration associated with the HiC genome
 max_dist : int
 Maximum scanning distance for convolution during pattern detection.
 dump : str
 Base path where dump files will be generated. None means no dump.
 smooth : bool
 Whether isotonic regression should be used to smooth the signal for detrending intrachromosomal sub matrices. This will reduce noise at long ranges but assumes contacts can only decrease with distance from the diagonal. Do not use with circular chromosomes.
 sample : int, float or None
 Proportion of contacts to sample from the data if between 0 and 1. Number of contacts to keep if above 1. Keep all if None.

bins_to_coords
(bin_idx)[source]¶ Converts a list of bin IDs to genomic coordinates based on the whole genome contact map.
Parameters: bin_idx (numpy.array of ints) – A list of bin numbers corresponding to rows or columns of the whole genome matrix. Returns: A subset of the bins dataframe, with columns chrom, start, end where chrom is the chromosome name (str), and start and end are the genomic coordinates of the bin (int). Return type: pandas.DataFrame

coords_to_bins
(coords)[source]¶ Converts genomic coordinates to a list of bin ids based on the whole genome contact map.
Parameters: coords (pandas.DataFrame) – Table of genomic coordinates, with columns chrom, pos. Returns: Indices in the whole genome matrix contact map. Return type: numpy.array of ints

get_full_mat_pattern
(chr1, chr2, patterns)[source]¶ Converts bin indices of a list of patterns from an submatrix into their value in the original fullgenome matrix.
Parameters:  chr1 (str) – Name of the first chromosome of the sub matrix (rows).
 chr2 (str) – Name of the second chromosome of the sub matrix (cols).
 pattern (pandas.DataFrame) – A dataframme of pattern coordinates. Each row is a pattern and columns should be bin1 and bin2, for row and column coordinates in the HiC matrix, respectively.
Returns: full_patterns – A dataframe similar to the input, but with bins shifted to represent coordinates in the whole genome matrix.
Return type: pandas.DataFrame

get_sub_mat_pattern
(chr1, chr2, patterns)[source]¶ Converts bin indices of a list of patterns from the whole genome matrix into their value in the desired intra or interchromosomal submatrix.
Parameters:  chr1 (str) – Name of the first chromosome of the sub matrix (rows).
 chr2 (str) – Name of the second chromosome of the sub matrix (cols).
 pattern (pandas.DataFrame) – A dataframme of pattern coordinates. Each row is a pattern and columns should be bin1 and bin2, for row and column coordinates in the HiC matrix, respectively.
Returns: full_patterns – A dataframe similar to the input, but with bins shifted to represent coordinates in the target submatrix.
Return type: pandas.DataFrame

make_sub_matrices
()[source]¶ Generates a table of HiC sub matrices. Each sub matrix is either intra or interchromosomal. The table has 3 columns: chr1, chr2 and contact_map. The contact_map column contains instances of the ContactMap class.
Returns: The table of sub matrices which will contain n_chrom rows if the inter attribute is set to false, or (n_chrom^2) / 2 + n_chroms / 2 if inter is True (that is, the upper triangle matrix). Return type: pandas.DataFrame

normalize
(norm='auto', n_mads=5, threads=1)[source]¶ If the instance’s cooler is not balanced, finds detectable bins and applies ICE normalisation to the whole genome matrix. Newly computed biases are stored in the input file. If it is already balanced, detectable bins and weights will be extracted from the file.
Parameters:  force_norm (str) – Normalization behaviour. If ‘auto’, existing weights are reused and matrix is balanced only in the absence of weights. If ‘raw’, raw contact values are used. If ‘force’, weights are recomputed and the underlying cool file is overwritten.
 n_mads (float) – Maximum number of median absoluted deviations (MADs) below the median of the distribution of logged bin sums to consider a bin detectable.
 threads (int) – Number of parallel threads to use for normalization.
chromosight.utils.detection module¶
Chromosight’s detection submodule implements the bulk of chromosight convolution engine, as well as functions to perform the different steps of the detection algorithm (preprocessing, local maxima, postprocessing…). The pattern_detector function orchestrate all those different steps.

chromosight.utils.detection.
filter_foci
(foci_mat, min_size=2)[source]¶ Given an input matrix of labelled foci (continuous islands of equal nonzero values), filters out foci consisting of fewer pixels than min_size.
Parameters:  foci_mat (scipy.sparse.coo_matrix) – Input matrix of labelled foci. Pixels are numbered according to their respective foci. Pixels that are not assigned to a focus are 0.
 min_size (int) – Minimum number of pixels required to keep a focus. Pixels belonging to smaller foci will be set to 0.
Returns:  num_filtered (int) – Number of foci remaining after filtering.
 filtered_mat (scipy.sparse.coo_matrix) – Matrix of filtered foci.

chromosight.utils.detection.
label_foci
(matrix)[source]¶ Given a sparse matrix of 1 and 0 values, find all foci of continuously neighbouring positive pixels and assign them a label according to their focus. Horizontal and vertical (4way) adjacency is considered.
Parameters: matrix (scipy.sparse.coo_matrix of ints) – The input matrix where to label foci. Should be filled with 1 and 0s. Returns:  num_foci (int) – Number of individual foci identified.
 foci_mat (scipy.sparse.coo_matrix:) – The matrix with values replaced by their respective foci labels.
Example
>>> M.todense() array([[1 0 0 0] [1 0 1 0] [1 0 1 1] [0 0 0 0]]) >>> num_foci, foci_mat = label_foci(M) >>> num_foci 2 >>>foci_mat.todense() array([[1 0 0 0] [1 0 2 0] [1 0 2 2] [0 0 0 0]])

chromosight.utils.detection.
normxcorr2
(signal, kernel, max_dist=None, sym_upper=False, full=False, missing_mask=None, missing_tol=0.75, tsvd=None, pval=False)[source]¶ Computes the normalized crosscorrelation of a dense or sparse signal and a dense kernel. The resulting matrix contains Pearson correlation coefficients.
Parameters:  signal (scipy.sparse.csr_matrix or numpy.array) – The input processed HiC matrix.
 kernel (numpy.array) – The pattern kernel to use for convolution.
 max_dist (int) – Maximum scan distance, in number of bins from the diagonal. If None, the whole matrix is convoluted. Otherwise, pixels further than this distance from the diagonal are set to 0 and ignored for performance. Only useful for intrachromosomal matrices.
 sym_upper (False) – Whether the matrix is symmetric and upper triangle. True for intrachromosomal matrices.
 missing_mask (scipy.sparse.csr_matrix of bool or None) – Matrix defining which pixels are missing (1) or not (0).
 full (bool) – If True, convolutions will be made in ‘full’ mode; the matrix is first padded with margins to allow scanning to the edges, and missing bins are also masked to exclude them when computing correlation scores. Computationally intensive
 missing_mask – Mask matrix denoting missing bins, where missing is denoted as True and valid as False. Can be None to ignore missing bin information. Only taken into account when full=True.
 missing_tol (float) – Proportion of missing values allowed in windows to keep the correlation coefficients.
 tsvd (float or None) – If a float between 0 and 1 is given, the input kernel is factorised using truncated SVD, keeping enough singular vectors to retain this proportion of information. Factorisation speeds up convolution at the cost of a loss of information. If the number of singular vectors required to retain the desired information isDisabled by default.
 pval (bool) – Whether to return a matrix of pvalues.
Returns:  scipy.sparse.csr_matrix or numpy.array – The sparse matrix of correlation coefficients. Same type as the input signal.
 scipy.sparse.csr_matrix or numpy.array or None – A map of BenjaminiHochberg corrected pvalues (qvalues). Same type as the input signal. If pval=False, this will be None.

chromosight.utils.detection.
pattern_detector
(contact_map, kernel_config, kernel_matrix, coords=None, dump=None, full=False, tsvd=None)[source]¶ Detect patterns in a contact map by kernel matching, and extract windows around the detected patterns. If coordinates are provided, detection is skipped and windows are extracted around those coordinates.
Parameters:  contact_map (ContactMap object) – An object containing an inter or intrachromosomal HiC contact map and additional metadata.
 kernel_config (dict) – The kernel configuration, as documented in chromosight.utils.io.load_kernel_config
 kernel_matrix (numpy.array) – The kernel matrix to use for convolution as a 2D numpy array
 coords (numpy.array of ints or None) – A table with coordinates of patterns, with one pattern per row and 2 columns being the row and column number of the pattern in the input contact map. If this is provided, detection is skipped and quantification is performed on those coordinates.
 dump (str or None) – Folder in which dumps should be generated after each step of the detection process. If None, no dump is generated
 tsvd (float or None) – If a float between 0 and 1 is given, the input kernel is factorised using truncated SVD, keeping enough singular vectors to retain this proportion of information. Factorisation speeds up convolution at the cost of a loss of information. If the number of singular vectors required to retain the desired information is disabled by default.
Returns:  filtered_chrom_patterns (pandas.DataFrame) – A table of detected patterns with 4 columns: bin1, bin2, score, qvalue.
 chrom_pattern_windows (numpy array) – A 3D array containing the pile of windows around detected patterns.

chromosight.utils.detection.
pick_foci
(mat_conv, pearson, min_size=2)[source]¶ Pick coordinates of local maxima in a sparse 2D convolution heatmap. A threshold computed based on the pearson argument is applied to the heatmap. All values below that threshold are set to 0. The coordinate of the maximum value in each focus (contiguous region of high values) is returned.
Parameters:  mat_conv (scipy.sparse.coo_matrix of floats) – A 2D sparse convolution heatmap.
 pearson (float) – Minimum correlation coefficient required to consider a pixel as candidate. Increasing this value reduces the amount of false positive patterns.
 min_size (int) – Minimum number of pixels required to keep a focus. Pixels belonging to smaller foci will be set to 0.
Returns:  foci_coords (numpy.array of ints) – 2D array of coordinates for identified patterns corresponding to foci maxima. None is no pattern was detected.
 labelled_mat (scipy.sparse.coo_matrix) – The map of detected foci. Pixels which were assigned to a focus are given an integer as their focus ID. Pixels not assigned to a focus are set to 0.

chromosight.utils.detection.
pileup_patterns
(pattern_windows)[source]¶ Generate a pileup (arithmetic mean) from a stack of pattern windows.
Parameters: pattern_windows (numpy.array of floats) – 3D numpy array of detected windows. Shape is (N, H, W) where N is the number of windows, H the height, and W the width of each window. Returns: 2D numpy array containing the pileup (arithmetic mean) of input windows. Return type: numpy.array of floats

chromosight.utils.detection.
remove_neighbours
(patterns, win_size=8)[source]¶ Identify patterns that are too close from each other to exclude them. The pattern with the highest score are kept in priority.
Parameters:  patterns (numpy.array of float) – 2D Array of patterns, with 3 columns: bin1, bin2 and score.
 win_size (int) – The maximum number of pixels at which patterns are considered overlapping.
Returns: Boolean array indicating which patterns are valid (True values) and which are overlapping neighbours (False values)
Return type: numpy.array of bool

chromosight.utils.detection.
validate_patterns
(coords, matrix, conv_mat, detectable_bins, kernel_matrix, drop=True, zero_tol=0.3, missing_tol=0.75)[source]¶ Given a list of pattern coordinates and a contact map, drop or flag patterns in low detectability regions or too close to matrix boundaries. Also returns the surrounding window of HiC contacts around each detected pattern.
Parameters:  coords (numpy.array of ints) – Coordinates of all detected patterns in the sub matrix. One pattern per row, the first column is the matrix row, second column is the matrix col.
 matrix (scipy.sparse.csr_matrix) – HiC contact map of the sub matrix.
 conv_mat (scipy.sparse.csr_matrix) – Convolution product of the kernel with the HiC sub matrix.
 detectable_bins (list of numpy.array) – List of two 1D numpy arrays of ints representing ids of detectable rows and columns, respectively.
 kernel_matrix (numpy.array of float) – The kernel that was used for pattern detection on the HiC matrix.
 zero_tol (float) – Proportion of zero pixels allowed in a pattern window to consider it valid.
 missing_tol (float) – Proportion of missing pixels allowed in a pattern window to consider it valid.
 drop (bool) – Whether to discard pattern coordinates and windows from patterns which fall outside the matrix or do not pass validation. If those are kept, they will be given a score of np.nan and their windows will be filled with np.nan.
Returns:  filtered_coords (pandas.DataFrame) – Table of coordinates that passed the filters. The dataframe has 3: columns: bin1 (rows), bin2 (col) and score (the correlation coefficient).
 filtered_windows (numpy.array) – 3D numpy array of signal windows around detected patterns. Each window spans axes 1 and 2, and they are stacked along axis 0.

chromosight.utils.detection.
xcorr2
(signal, kernel, threshold=0.0001, tsvd=None)[source]¶ Cross correlate a dense or sparse 2D signal with a dense 2D kernel.
Parameters:  signal (scipy.sparse.csr_matrix or numpy.array of floats) – A 2dimensional numpy array Ms x Ns acting as the detrended HiC map.
 kernel (numpy.array of floats) – A 2dimensional numpy array Mk x Nk acting as the pattern template.
 threshold (float) – Convolution score below which pixels will be set back to zero to save on time and memory.
 tsvd (float or None) – If a float between 0 and 1 is given, the input kernel is factorised using truncated SVD, keeping enough singular vectors to retain this proportion of information. Factorisation speeds up convolution at the cost of a loss of information. If the number of singular vectors required to retain the desired information isDisabled by default.
  –
 out (scipy.sparse.csr_matrix or numpy.array) – Convolution product of signal by kernel. Same type as the input signal.
chromosight.utils.io module¶
Chromosight’s io submodule contains input/output related functions to load contact matrices in cool format, and save output patterns coordinates and windows.

chromosight.utils.io.
check_prefix_dir
(prefix)[source]¶ Checks for existence of the parent directory of an output prefix

chromosight.utils.io.
load_bed2d
(path)[source]¶ Loads only the first 6 columns of a 2D BED file. Will sniff for header and generate a default header only if none is present. Compatible with output of chromosight detect.
Parameters: path (str) – The path to the 2D BED file. Returns: The content of the 2D BED file as a dataframe with 6 columns. Header will be: chrom1, start1, end1, chrom2, start2, end2. Return type: pandas.DataFrame

chromosight.utils.io.
load_cool
(cool_path)[source]¶ Reads a cool file into memory and parses it into a COO sparse matrix and an array with the starting bin of each chromosome.
Parameters: cool (str) – Path to the input .cool file. Returns:  mat (scipy.sparse.coo_matrix) – Output sparse matrix in coordinate format
 chroms (pandas.DataFrame) – Table of chromosome information. Each row contains the name, length, first and last bin of a chromosome.
 bins (pandas.DataFrame) – Table of genomic bins information. Each row contains the chromosome, genomic start and end coordinates of a matrix bin.
 bin_size (int) – Matrix resolution. Corresponds to the number of base pairs per matrix bin.

chromosight.utils.io.
load_kernel_config
(kernel, custom=False)[source]¶ Load a kernel configuration from input JSON file.
All parameters associated with the kernel along its kernel matrices are loaded into a dictionary.
A kernel config file is a JSON file with the following structure:
 {
“name”: str, “kernels”: [
str, …], “max_dist”: int, “min_dist”: int, “max_iterations”: int, “max_perc_zero”: float, “max_perc_undetected”: float, “pearson”: float “resolution”: int
}
The kernels field should contain a list of path to kernel matrices to be loaded. These path should be relative to the config file. When loaded, the kernel field will contain the target matrices as 2D numpy arrays.
The kernel matrices files themselves are raw tsv files containing a dense matrix of numeric value as read by the numpy.loadtxt function.
Other fields are:
 name : Name of the pattern
 max_dist : maximum distance in basepairs to scan from the diagonal
 max_iterations: maximum number of scanning iterations to perform
 max_perc_zero: Maximum percentage of empty (0) pixels to include a pattern
 max_perc_zero: Maximum percentage of missing (nan) pixels to include a pattern
 pearson: Increasing this value reduces false positive patterns.
 resolution: Basepair resolution for the kernel matrix.
Parameters: Returns: kernel_config – A dictionary containing a key: value pair for each parameter as well as list of kernel matrices under key ‘kernels’.
Return type:

chromosight.utils.io.
progress
(count, total, status='')[source]¶ Basic progress bar in terminal.
Parameters:

chromosight.utils.io.
save_windows
(windows, output_prefix, fmt='json')[source]¶ Write windows surrounding detected patterns to a npy or json file. The file contains a 3D array where windows are piled on axis 0, matrix rows are on axis 1 and columns on axis 2.
Parameters:  windows (numpy.array of floats) – 3D numpy array with axes (windows, rows, columns).
 output_prefix (str) – Output path where the file will be saved, an extension will be added based on the value of “format”.
 format (str) – Format in which to save windows. Can be either npy for numpy’s binary format, or json for a general purpose text format.

chromosight.utils.io.
write_patterns
(coords, output_prefix, dec=10)[source]¶ Writes coordinates to a text file.
Parameters:  coords (pandas.DataFrame) – Pandas dataframe containing the coordinates and score of one detected pattern per row.
 pattern_name (str) – Name of the pattern. Will be the basename of the output file.
 output_dir (str) – Output path where the file will be saved.
 dec (int) – Number of decimals to keep in correlation scores and pvalues.
chromosight.utils.plotting module¶
Chromosight’s plotting submodule contains utilities to visualize the pileup of detected patterns or the input matrix. It also implements an interactive map recording the coordinates of double clicks.

chromosight.utils.plotting.
click_finder
(mat, half_w=8, xlab=None, ylab=None)[source]¶ Given an input HiC matrix, show an interactive window and record coordinates where the user doubleclicks. When the interactive window is closed, the stack of windows around recorded coordinates is returned.
Parameters: Returns: 3D stack of images around coordinates recorded interactively. The shape of the stack is (N, w, w) where N is the number of coordinates and w is 2*half_w.
Return type: numpy.array

chromosight.utils.plotting.
pileup_plot
(pileup_pattern, output_prefix, name='pileup_patterns')[source]¶ Wrapper around matplotlib.pyplot.imshow to visualize the pileup of patterns detected by chromosight

chromosight.utils.plotting.
plot_whole_matrix
(clr, patterns, out=None, region=None, region2=None, log_transform=False)[source]¶ Visualise the input matrix with a set of patterns overlaid on top. Can optionally restrict the visualisation to a region.
Parameters:  mat (scipy.sparse.csr_matrix) – The whole genome HiC matrix to be visualized.
 patterns (pandas.DataFrame) – The set of patterns to be plotted on top of the matrix. One pattern per row, 3 columns: bin1, bin2 and score of types int, int and float, respectively.
 region (str) – The genomic range, in UCSC format, corresponding to rows to be plotted in the matrix. If not given, the whole matrix is used. It only region is given, but not region2, the matrix is subsetted on rows and columns to show a region on the diagonal.
 region2 (str) – The genomic range, in UCSC format, of columns to be plotted in the matrix. Region must also be provided, or this will be ignored.
 log_transform (bool) – Whether to log transform the matrix.
chromosight.utils.preprocessing module¶
Chromosight’s preprocessing submodule implements a number of functions to operate on HiC contact maps before detection. These functions can be used to improve the signal or filter unneeded signal. There are also functions to edit (zoom, crop, factorize) kernel matrices.

chromosight.utils.preprocessing.
check_missing_mask
(signal, mask)[source]¶ Ensure all elements defined as missing by the mask are set to zero in the signal. If this is not the case, raises an error.
Parameters:  signal (numpy.ndarray of floats or scipy.sparse.csr_matrix of floats) – The signal to be checked.
 mask (numpy.ndarray of bools or scipy.sparse.csr_matrix of bools) – The mask defining missing values as True and valid values as False.

chromosight.utils.preprocessing.
crop_kernel
(kernel, target_size)[source]¶ Crop a kernel matrix to target size horizontally and vertically. If the target size is even, the target size is adjusted to the next integer up.
Parameters:  kernel (numpy.ndarray of floats) – Image to crop.
 target_size (tuple of ints) – Tuple defining the target shape of the kernel, takes the form (rows, cols) where rows and cols are odd numbers.
Returns: cropped – New image no larger than target dimensions
Return type: numpy.ndarray of floats

chromosight.utils.preprocessing.
detrend
(matrix, detectable_bins=None, max_dist=None, smooth=False, fun=<function nanmean>, max_val=10)[source]¶ Detrends a HiC matrix by the distance law. The input matrix should have been normalised beforehandand.
Parameters:  matrix (scipy.sparse.csr_matrix) – The normalised intrachromosomal HiC matrix to detrend.
 detectable_bins (tuple) – Tuple containing a list of detectable rows and a list of columns on which to perform detrending. Poorly interacting indices have been excluded.
 max_dist (int) – Maximum number of bins from the diagonal at which to compute trend.
 smooth (bool) – Whether to use isotonic regression to smooth the trend.
 fun (callable) – Function to use on each diagonal to compute the trend.
 max_val (float or None) – Maximum value in the detrended matrix. Set to None to disable
Returns: The detrended HiC matrix.
Return type:

chromosight.utils.preprocessing.
diag_trim
(mat, n)[source]¶ Trim an upper triangle sparse matrix so that only the first n diagonals are kept.
Parameters:  mat (scipy.sparse.csr_matrix or numpy.ndarray) – The sparse matrix to be trimmed
 n (int) – The number of diagonals from the center to keep (0based).
Returns: The diagonally trimmed upper triangle matrix with only the first n diagonal.
Return type: scipy.sparse.dia_matrix or numpy.ndarray

chromosight.utils.preprocessing.
distance_law
(matrix, detectable_bins=None, max_dist=None, smooth=True, fun=<function nanmean>)[source]¶ Computes genomic distance law by averaging over each diagonal in the upper triangle matrix. If a list of detectable bins is provided, pixels in missing bins will be excluded from the averages. A maximum distance can be specified to define how many diagonals should be computed.
Parameters:  matrix (scipy.sparse.csr_matrix) – the input matrix to compute distance law from.
 detectable_bins (numpy.ndarray of ints) – An array of detectable bins indices to consider when computing distance law.
 max_dist (int) – Maximum distance from diagonal, in number of bins in which to compute distance law
 smooth (bool) – Whether to use isotonic regression to smooth the distance law.
 fun (callable) – A function to apply on each diagonal. Defaults to mean.
Returns: dist – the output genomic distance law.
Return type: np.ndarray
Example
>>> m = np.ones((3,3)) >>> m += np.array([1,2,3]) >>> m array([[2., 3., 4.], [2., 3., 4.], [2., 3., 4.]]) >>> distance_law(csr_matrix(m)) array([3. , 3.5, 4. ])

chromosight.utils.preprocessing.
erase_missing
(signal, valid_rows, valid_cols, sym_upper=True)[source]¶ Given a sparse matrix, set all pixels in missing (invalid) bins to 0.
Parameters:  signal (scipy.sparse.csr_matrix of floats) – Input signal on which to erase values.
 valid_rows (numpy.ndarray of ints) – Indices of rows considered valid (not missing).
 valid_cols (numpy.ndarray of ints) – Indices of columns considered valid (not missing).
 sym_upper (bool) – Define if the input signal is upper symmetric.
Returns: The input signal with all values in missing bins set to 0
Return type: scipy.sparse.csr_matrix

chromosight.utils.preprocessing.
factorise_kernel
(kernel, prop_info=0.999)[source]¶ Performs truncated SVD on an input kernel, returning the singular vectors necessary to retain a given proportion of information contained in the kernel.
Parameters:  kernel (numpy.ndarray of floats) – The input 2D kernel to factorise.
 prop_info (float) – Proportion of information to retain.
Returns: A tuple containing the truncated left and right singular matrices, where each singular vector has been multiplied by the square root of their respective singular values.
Return type: tuple of numpy.ndarrays of floats

chromosight.utils.preprocessing.
frame_missing_mask
(mask, kernel_shape, sym_upper=False, max_dist=None)[source]¶ Adds a frame around input mask, given a kernel. The goal of this frame is define margins around the matrix where the kernel will not perform convolution (denoted by 1). If the matrix is upper symmetric, a margin of half the kernel’s width is added below the diagonal and a maximum distance from the diagonal above which margins need not be drawn can be considered. Otherwise Margins are simply added on all 4 sides of the matrix.
signal kernel _________ ______ ____ #######     ==> # #   ___ # #   # # _____ # # ####### 
Parameters:  mask (scipy.sparse.csr_matrix of bool) – The mask around which to add margins.
 kernels_shape (tuple of ints) – The number of rows and kernel in the input kernel. Margins will be half these values.
 sym_upper (bool) – Whether the signal is a symmetric upper triangle matrix. If so, values on a margin below the diagonal will be masked.
 max_dist (int or None) – Number of diagonals to keep
Returns: framed_mask – The input mask with a padding of True around the edges. If sym_upper is True, a padding is also added below the diagonal.
Return type: scipy.sparse.csr_matrix of bool

chromosight.utils.preprocessing.
get_detectable_bins
(mat, n_mads=3, inter=False)[source]¶ Returns lists of detectable indices after excluding low interacting bins based on the proportion of zero pixel values in the matrix bins.
Parameters:  mat (scipy.sparse.coo_matrix) – A HiC matrix in tihe form of a 2D numpy array or coo matrix
 n_mads (int) – Number of median absolute deviation below the median required to consider bins nondetectable.
 inter (bool) – Whether the matrix is interchromosomal. Default is to consider the matrix is intrachromosomal (i.e. upper symmetric).
Returns: tuple of 2 1D arrays containing indices of low interacting rows and columns, respectively.
Return type: numpy ndarray

chromosight.utils.preprocessing.
make_missing_mask
(shape, valid_rows, valid_cols, max_dist=None, sym_upper=False)[source]¶ Given lists of valid rows and columns, generate a sparse matrix mask with missing pixels denoted as 1 and valid pixels as 0. If a max_dist is provided, upper symmetric matrices will only be flagged up to max_dist pixels from the diagonal.
Parameters:  shape (tuple of ints) – Shape of the mask to generate.
 valid_rows (numpy.ndarray of ints) – Array with the indices of valid rows that should be set to 0 in the mask.
 valid_cols (numpy.ndarray of ints) – Array with the indices of valid rows that should be set to 0 in the mask.
 max_dist (int or None) – The maximum diagonal distance at which masking should take place.
 sym_upper (bool) – Whether the matrix is symmetric upper. If so, max_dist is ignored
Returns: The mask containing False values where pixels are valid and True valid where pixels are missing
Return type: scipy.sparse.csr_matrix of bool

chromosight.utils.preprocessing.
resize_kernel
(kernel, kernel_res=None, signal_res=None, factor=None, min_size=7, quiet=False)[source]¶ Resize a kernel matrix based on the resolution at which it was defined and the signal resolution. E.g. if a kernel matrix was generated for 10kb and the input signal is 20kb, kernel size will be divided by two. If the kernel is enlarged, pixels are interpolated with a spline of degree 1. Alternatively, a resize factor can be provided. In the example above, the factor would be 0.5.
Parameters:  kernel (numpy.ndarray) – Kernel matrix.
 kernel_res (int) – Resolution for which the kernel was designed. Mutually exclusive with factor.
 signal_res (int) – Resolution of the signal matrix in basepair per matrix bin. Mutually exclusive with factor.
 factor (float) – Resize factor. Can be provided as an alternative to kernel_res and signal_res. Values above 1 will enlarge the kernel, values below 1 will shrink it.
 min_size (int) – Lower bound, in number of rows/column allowed when resizing the kernel.
 quiet (bool) – Suppress warnings if resize factor was adjusted.
Returns: resized_kernel – The resized input kernel.
Return type:

chromosight.utils.preprocessing.
set_mat_diag
(mat, diag=0, val=0)[source]¶ Set the nth diagonal of a symmetric 2D numpy array to a fixed value. Operates in place.
Parameters:  mat (numpy.ndarray) – Symmetric 2D array of floats.
 diag (int) – 0based index of the diagonal to modify. Use negative values for the lower half.
 val (float) – Value to use for filling the diagonal

chromosight.utils.preprocessing.
subsample_contacts
(M, n_contacts)[source]¶ Bootstrap sampling of contacts in a sparse HiC map.
Parameters:  M (scipy.sparse.coo_matrix) – The input HiC contact map in sparse format.
 n_contacts (int) – The number of contacts to sample.
Returns: A new matrix with a fraction of the original contacts.
Return type: scipy.sparse.coo_matrix

chromosight.utils.preprocessing.
sum_mat_bins
(mat)[source]¶ Compute the sum of matrices bins (i.e. rows or columns) using only the upper triangle, assuming symmetrical matrices.
Parameters: mat (scipy.sparse.coo_matrix) – Contact map in sparse format, either in upper triangle or full matrix. Returns: 1D array of bin sums. Return type: numpy.ndarray

chromosight.utils.preprocessing.
valid_to_missing
(valid, size)[source]¶ Given an array of valid indices, return the corrsesponding array of missing indices.
Parameters:  valid (numpy.ndarray of ints) – The valid indices.
 size (int) – The size of the matrix (maximum possible index + 1).
Returns: missing – The missing indices.
Return type: numpy.ndarray of ints

chromosight.utils.preprocessing.
zero_pad_sparse
(mat, margin_h, margin_v, fmt='coo')[source]¶ Adds margin of zeros around an input sparse matrix.
Parameters:  mat (scipy.sparse.csr_matrix) – The matrix to be padded.
 margin_h (int) – The width of the horizontal margin to add on the left and right of the matrix.
 margin_v (int) – The width of the vertical margin to add on the top and bottom of the matrix.
 fmt (string) – The desired scipy sparse format of the output matrix
Returns: The padded matrix of dimensions (m + 2 * margin_h, n + 2 * margin_v).
Return type: scipy.sparse.csr_matrix
Examples
>>> m = sp.csr_matrix(np.array([[1, 2], [10, 20]])) >>> zero_pad_sparse(m, 2, 1).toarray() array([[ 0, 0, 0, 0, 0, 0], [ 0, 0, 1, 2, 0, 0], [ 0, 0, 10, 20, 0, 0], [ 0, 0, 0, 0, 0, 0]])
chromosight.utils.stats module¶
Chromosight’s stats submodule contains helper function to compute statistical estimators from distributions

chromosight.utils.stats.
corr_to_pval
(corr, n, rho0=0)[source]¶ Given a list of Pearson correlation coefficient, convert them to twosided log10 pvalues. The pvalues are computed via the fisher transformation described on: https://w.wiki/Ksu
Parameters:  corr (numpy.array) – The Pearson correlation coefficients.
 n (int or numpy.array) – The number of observations used to compute correlation coefficients. Can be given as an array of the same size as corr to give the number of sample in each coefficient.
 rho0 (float) – The correlation value under h0. We test if corr is significantly different from rho0.
Returns: The array of log10transformed twosided pvalues, same size as corr.
Return type: numpy.array

chromosight.utils.stats.
fdr_correction
(pvals)[source]¶ Applies false discovery rate correction via the BenjaminiHochberg procedure to adjust input pvalues for multiple testing. .
Parameters: pvals (numpy.array of floats) – Array of uncorrected pvalues. Returns: fdr – Array of corrected pvalues (qvalues). Return type: numpy.array of floats