Virtual ChIP-seq: Prediction of Transcription Factor Binding by Learning from the Transcriptome

Mehran Karimzadeh^1,2*, Michael M. Hoffman^1,2,3

1. Department of Medical Biophysics, University of Toronto; 2. Princess Margaret Cancer Centre; 3. Department of Computer Science, University of Toronto

Introduction: Cancer frequently harbors non-coding mutations, however, we only know about the causal role of a few of these mutations in oncogenesis. Most known causal non-coding mutations alter transcription factor binding. ChIP-seq is the gold standard method for identifying transcription factor binding sites. Using ChIP-seq on patient samples, however, is hampered by the amount of available biological material and cost of the experiment. While existing computational prediction of regulatory elements are not limited by expense or precious samples, they have low precision.

Approach: Our predictive model, Virtual ChIP-seq, uses ensemble learning to predict transcription factor binding using data on chromatin accessibility, genomic conservation, and binding characteristics of the transcription factor from previous experiments in other cell types. Virtual ChIP-seq also learns from the association of gene expression and transcription factor binding at different genomic regions. Therefore, not only does Virtual ChIP-seq predict indirect transcription factor binding, but it also predicts binding of transcription factors that are not sequence-specific.

Results: Virtual ChIP-seq predicts transcription factor binding with high precision in cell types or tissues with available RNA-seq and chromatin accessibility data. We show that by incorporating existing ChIP-seq data in our model, there is no longer a need to represent transcription factor sequence preferences in form of position weight matrix scores. Relying solely on transcription factor sequence preference produces many false positive and false negative results. Virtual ChIP-seq outperforms other available tools and can provide mechanistic interpretation for the regulatory effects of polymorphisms and somatic mutations in the non-coding genome.