Hi-C, the high-throughput derivative of chromosome conformation capture (3C) technology, allows for the quantification of all DNA-DNA contacts genome-wide that are found within a population of cells. The output of a Hi-C experiment is stored in an interaction frequency (IF) matrix. At the restriction fragment (RF) resolution, most Hi-C IF matrices are sparse due to the required depth and high costs associated with sequencing. A majority of pair-wise RF-interactions receive a raw frequency of zero or one, with most contacts found at relatively short distances (<1 Mb). Typically, IFs are thus analyzed at a fixed resolution (e.g., 50 Kb) to increase their signal over noise ratio. The consequences of this reduction in resolution are that key interactions between fine-scale genomic elements (e.g., eQTL studies, enhancer/promoter interactions, chromatin looping events) may not be observed. A correct interpretation of Hi-C IF matrices relies on representing the observed data at the proper resolution, which involves a trade-off between signal and noise.
We describe two adaptive density estimation (ADE) techniques that consider the changing density of RF-interactions across a Hi-C IF matrix when reducing noise while retaining the highest-possible resolution. The first is a novel application of a Markov Random Field (MRF) to Hi-C data. To estimate true IF from Hi-C data, the MRF considers both (i) the immediate neighborhood of RF-interactions and (ii) Topologically Associating Domain boundaries. The second ADE algorithm is a kernel density estimation approach that implements a dynamic bandwidth to consider surrounding RF-interactions.
We validate our ADE algorithms by demonstrating that estimated matrices allow for higher accuracy in identifying true positive/negative contacts and provide a lower error when predicting IF across varying sequencing depths, compared to traditional fixed binning approaches. True positive interactions are those Hi-C RF-interactions found to be mediated by RNA polymerase II and CTCF (as identified by ChIA-PET – a 3C technology that incorporates chromatin immunoprecipitation). We also show that ADE Hi-C IF matrices correlate better with 5C data (a targeted sequencing 3C technology with high sequencing depth) when observing the same genomic regions.