ExpressionLncr: A Pipeline for Leveraging Latent Gene Expression Data in lncRNA Studies
1. Centre for Heart Lung Innovation, University of British Columbia, Vancouver, British Columbia, Canada; 2. Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
Background: Investigating the functionality of lncRNA and other RNA in regulating the genome is an active and growing area of research. Databases of non-coding RNAs such as NONCODE and LNCipedia contain 141k and 119k human annotated lncRNAs to date, respectively—necessitating informatics tools to sift through and investigate functionality en masse. A wealth of functional genomics data is already deposited in NCBI GEO but, likewise, harnessing these 80 thousand experiments on 2 million samples in an effective manner can be challenging for those without informatics resources.
Hypothesis: Existing functional genomics data can be leveraged to predict functionality of lncRNAs.
Methods: We have created a program called ExpressionLncr(na) to harness this latent information. It is a pipeline to investigate the potential expression of lncRNAs by leveraging existing NCBI GEO gene expression information. The software sources lncRNAs from lncRNA databases such as NONCODE or LNCipedia and is not restricted to lncRNAs, allowing user-specified chromosome features. Features are restricted to reference organisms with expression probe array annotation information in Ensembl. The tool computes matches for positional overlap between Ensembl expression probes and lncRNAs. Summary results from GEO DataSets relevant to these overlapping features are used to calculate presence or absence of expression at each lncRNA. Positive links between probe expression data and lncRNA position may suggest possible functionality of the lncRNA worth further investigation. ExpressionLncr is written as a collection of command-line scripts with an optional graphical interface. The software will be available in two formats, both freely available: as a set of tools for the popular Galaxy bioinformatics web platform, and as an operating system independent desktop application. Future work is planned to extend the pipeline to RNA-seq information.
Summary: ExpressionLncr is a bioinformatics pipeline to investigate the functionality of lncRNAs and other chromosomal features by computing positional overlap between lncRNA databases and existing gene expression probe information in NCBI GEO. Exploiting this latent information should help investigators interested in non-coding RNAs in planning new studies as well as prioritising candidate non-coding RNAs for molecular biology experiments.