Details about the methodology of each analysis module in AnnoLnc

All input sequences are aligned against the reference human genome hg19 with Blat (Kent, 2002). When a single sequence is aligned in multiple places, genome-wide best alignments are identified by standard pslSort and pslReps. In case of false-positive junction sites caused by mismatches or small indels, putative exons shorter than 20bp (as well as putative introns shorter than 40bp) are discarded. The derived splicing structure is compared with known transcripts in the gene model file by Cuffcompare (see details in "Pre-calculated expression profiles of known transcripts"). And known transcripts within upstream and downstream 5Kb are extracted and sent to the UCSC genome browser to be presented as genomic context of the lncRNA in pre-tuned custom track.

RNAfold v2.0.7 in the ViennaRNA package (Lorenz, et al., 2011) is employed to predict secondary structures, with option “--noLP” enabled to avoid undesirable isolated base pairs. When multiple candidates available, the one with minimum free energy is kept, as being recommended by the authors of ViennaRNA package.

The expression profile of input transcripts is estimated based on 64 RNA-Seq datasets (covering normal adult tissues, tumor cell lines as well as human embryonic stem cell lines). We mapped reads of 34 normal tissue/cell line samples to the human genome hg19 by TopHat (v1.4.1.1). Bam files of 30 cancer samples were downloaded directly from the CGHub (see here for the number of mapped reads and CGHub IDs).

Pre-calculated expression profiles of known transcripts

We generated a gene model (GM) gtf file covering human lncRNAs in lncRNAdb v2.0 (Quek, et al., 2015) based on GENCODE v19. Firstly, we downloaded human lncRNA sequences in the lncRNAdb and got transcript structures as described in "Genomic location". These transcript structures were compared with GENCODE v19 by Cuffcompare v2.1.1. If the code is “=” or “c”, the lncRNA was replaced by known transcript; otherwise it was considered as a “novel transcript” and merged into GENCODE v19. The expression of all annotated transcripts in the gene model file was pre-calculated by StringTie v1.0.4 with the options “–e -b”, and then normalized by the geometric method in normal and cancer samples separately.

On-the-fly expression estimation of input transcripts

Taking advantage of the local nature of RNA-Seq, we developed a novel quantification method named LocExpress for real-time estimating expression level based on pre-mapped reads (Hou et al., manuscript in preparation). Briefly, LocExpress will infer the Minimum Spanning Bundle (MSB) of an input transcript based on its genomic coordination as well as reads mapped in corresponding regions. Then the estimated relative abundance will be further adjusted and normalized based on original size (i.e. the total mass derived from all mapped reads) and reported in canonical FPKM unit. In our preliminary assessment, the LocExpress method usually takes less than one minute for one novel lncRNA across all samples, and the result is highly consistent with of the classical method (see the figure below). For the normal sample set, the FPKM of two replicates of each tissue/cell line are averaged to report to users.

34 normal samples and 30 cancer samples are treated separately. To avoid the duplicated GO annotation for isoforms, we firstly got expression profiles at the gene level by adding the FPKM of all transcripts of each gene in the gene model file. Then we filtered these genes as follows, and finally got 29798 genes left in the normal sample set and 25449 in the cancer sample set.

  1. FPKM filter. The sum of FPKM in all samples should be not less than 1.
  2. Tissue specific filter. The tissue specific score is calculated by the function “getsgene” in the R package rsgcc. If a gene has a score larger than 0.85, it’s not considered fit for the co-expression analysis.

For any submitted transcripts passed above filters, the Pearson correlation with genes is calculated. Then highly correlated genes will be reported by a “gradually decreased” criteria to remove putative false positives and retain true positives. If there are more than 10 genes with r>=0.9, we will stop here and do the GO enrichment analysis with these genes. If not, we will see whether there are 10 genes above the cutoff 0.8. This process will continue until the cutoff arrives 0.7. Negatively correlated genes are identified in a similar way. GO enrichment analysis for these correlated genes will be further conducted with the R package GOstats, and significantly enriched GO terms (adjusted p-value <= 0.01) will be reported as putative functional assignments of the input transcript.

AnnoLnc integrated 498 ChIP-Seq datasets covering 159 transcript factors (TFs) in 45 cell lines. Uniform peak files generated by the ENCODE project were downloaded from here. For each input transcript, AnnoLnc will search putative TF binding sites within upstream 5Kb and downstream 1Kb, and report all sites based on their relative position to the transcript, as “upstream transcriptional start site (TSS)”, “overlap with TSS”, “inside the lncRNA loci”, “overlap with transcriptional end site (TES)” and “downstream TES”.

TargetScan v6.0 (Friedman, et al., 2009) is employed to search putative miRNA binding sites. To reduce potential false positive rate, we run the prediction on 87 highly conserved miRNA families derived from miRcode (Jeggari et al. 2012). Then conservation scores in primate, mammal and vertebrate clades for each identified site will be calculated as described in Jeggari et al. For instance, 10 species in primates are used in TargetScan prediction, and if a binding site is identified in eight species, the conservation score in primates will be 8/10 = 0.8. In mammals and vertebrates, scores are calculated in the same way except that “mammals” are “non-primate mammals” (26 species) and “vertebrates” are “non-mammal vertebrates” (10 species). To further highlight high-confident sites, predicted sites will be further screened based on a pre-compiled 61 AGO CLIP-Seq datasets (see "Calling RNA-Protein interactions based on CLIP-Seq data" for more details about CLIP-Seq analysis), and hit sites will be marked as “CLIP-Seq support”.

Calling RNA-Protein interactions based on CLIP-Seq data

CLIP-Seq is the most widely used high-throughput method to detect RNA-Protein interactions. 112 CLIP-Seq datasets were integrated in AnnoLnc, covering 51 RNA binding proteins (RBPs) other than AGO from the GEO. All these data were reanalyzed uniformly in case of methodology bias. Briefly, we first trimmed the adapter by FASTX Clipper and only reads longer than 15nt were kept and mapped to the human genome hg19 by the algorithm BWA-backtrack v0.7.10-r789 with the options “-n 1 -i 0” (allow one alignment error). Then only unique mapped reads were kept. To improve precision, we took stringent criteria for site calling with PIPE-CLIP v1.0.0 (Chen, et al., 2014): FDR cutoffs for both enriched clusters and reliable mutations were set as 0.05 (cross-linking sites in HITS-CLIP data identified by deletion, insertion and substitution were combined).

To assess the reliability of this pipeline, we compared with published works. Hoell et al. defined global RNA targets of FET family proteins by PAR-CLIP. We downloaded raw reads of wild type FET proteins (FUS, EWSR1 and TAF15) from DDBJ (SRA025082) and performed analysis as being described above. To compare with the reported results (Supplementary Data 1, Hoell, et al., 2011), cross-linking sites identified by both methods were mapped to RefSeq IDs. Our pipeline shows pretty high precision (0.95 for FUS, 0.96 for EWSR1, and 0.91 for TAF).

Meanwhile, we further evaluated a HITS-CLIP set generated by Macias et al. for DGCR8 protein. We downloaded raw reads of all the four samples (D8.1, D8.2, T7.1 and T7.2) from GEO (GSE39086) and analyzed them as described above (D8.1 data was excluded since PIPE-CLIP failed to generate cross-linking sites with a “model failed to converge” error). The comparison with the original results downloaded from here also showes good precision (0.89 for D8.2, 0.74 for T7.1, and 0.78 for T7.2).

ab initio prediction of lncRNA-protein interaction

lncPro (Lu, et al., 2013) is employed for ab initio prediction of lncRNA-protein interaction in the whole human proteome. We downloaded 99,459 human protein sequences from the Ensembl, filtered 1,917 ones that cannot be processed by lncPro (containing “*”, “X”, “U” or length not within 30 -30,000 AA), and finally got 97,542 protein sequences. For efficiency, we modified the source code of lncPro to pre-calculate all protein features in batch.

To improve specificity, we further derived the statistical significance of interaction scores reported by lncPro based on empirical NULL distribution generated by random sequence shuffling (see figures below). Interactions with p value <= 0.01 are considered to be significant. To make the results more intuitive, Ensembl IDs are finally converted into HGNC gene symbols. If multiple Ensembl IDs are mapped to one gene symbol, the score with the smallest p value will be reported.

To exhaustively detect genetic association, AnnoLnc first scans all SNPs within the transcript region (upstream 5Kb to downstream 1Kb of each input transcript). Taking one of these SNPs for instance, a SNP is linked with a tagSNP if it is within the haplotype region (defined as r2 > 0.5, tagged by the tagSNP reported in the NHGRI GWAS Catalog (Welter, et al., 2014) (downloaded from the UCSC genome browser). Then this linked SNP, corresponding tagSNPs, traits/diseases, p values, significance (defined as p value <= 5e-8), LD values and populations from which derived these LD values, as well as supporting PubMed ID, will be reported by AnnoLnc.

For each submitted lncRNA, we incorporate the 46 way phyloP score (primates, mammals/placental and vertebrates) from the UCSC genome browser, and the derived allele frequency (DAF) (1000 Genomes Project Consortium, et al., 2010) of the YRI population (Yoruba in Ibadan, Nigeria) from here for every position (if has corresponding score) of both the exon and promoter region (upstream 1Kb). All these scores will be plotted in different tracks in the integrated plot. To give an overall view, we also calculate the mean scores for exon and promoter regions and finally present them in well-organized bar charts. The phastCons conserved elements (Siepel, et al., 2005) are downloaded from the database of UCSC genome browser. The score reported to users is the LOD score. Conserved elements shorter than 20bp are discarded in the table.