Uncovering the direct target genes of transcription factors (TFs) and annotating their functions is important for understanding their biological roles. Chromatin immunoprecipitation plus DNA sequencing (ChIP-seq), is a broadly used technique for identifying TF binding sites genome wide. To identify the genes that are directly regulated by the TF we developed a model that integrates the number of binding sites and their distances from a gene’s transcription start site to estimate the TF’s potential for regulating that gene. An alternative approach for identifying TF regulated genes is to determine the genes that are differentially expressed when the activity of the TF is perturbed, through stimulation, repression, knock-down or knockout. Some of the differentially expressed genes in such experiments are not direct targets of the TF so ChIP-seq data is needed to discriminate between directly regulated genes and secondary effects. We previously developed a software package, BETA (Wang et al, Nature Protocols 2013), which combined the regulatory potentials defined by ChIP-seq peaks with relevant differential gene expression data. Cistrome-GO, extends BETA, to provide functional enrichment analysis of gene regulation by TFs in human and mouse.
Cistrome-GO has two working modes. If the user provides both a TF’s ChIP-seq peak file and a differential expression analysis file (upon the TF’s perturbation), Cistrome-GO will perform an ensemble mode analysis based on the integration of the two types of data (Figure 1). If the user uploads only a TF ChIP-seq peak file, Cistrome-GO will perform the analysis in solo mode (Figure 2).
A TF can regulate the expression of its target genes by binding cis-regulatory elements, including promoters and enhancers. For some TFs, such as MYC, many binding sites are close to (< 1 kb) transcription start sites (TSS) of target genes. Such TFs are of the promoter-dominant type. On the other hand, the binding sites for TFs such as AR and ESR1 tend to be less enriched in promoters. These TFs are of the enhancer-dominant type.
When user uploads a TF’s ChIP-seq peak file, Cistrome-GO calculates the percentages of the most significant ChIP-seq peaks (top 2000, 5000, 10000 and all peaks separately, if applicable; ranked by -log10p-value) within 1kb of the TSS. If any of the calculated percentages is larger than 20%, the TF will be classified as promoter-dominant type. Otherwise, it will be classified as enhancer-dominant type. In Cistrome-GO, promoter-dominant type and enhancer-dominant type TFs have different parameters to calculate the regulatory potential score (RPscore).
The regulatory potential score (RPscore) of a gene is its likelihood of being regulated by a TF. In Cistrome-GO, the formula of RPscore calculation is (Figure 3),
The parameter d0 is the half-decay distance, with default value 1 kb for promoter-type TFs and 10 kb for enhancer-type TFs. (Users can also specify the parameter to other values in the advanced settings). All k binding sites near the TSS of gene g (within the distance 15 * d0) will be used in the calculation, and di is the distance between the ith peak’s center and TSS.
For a given input ChIP-seq peak file, an RPscore is calculated for each gene. The gene rank RRP is derived by decreasing RPscore.
If the user provides a differential expression analysis file in addition to a ChIP-seq peak file, an additional gene rank RDE is derived based on the increasing of P-value. In the ensemble mode of Cistrome-GO, the two ranks (RRP and RDE) are integrated by rank product. For example, if genei’s ranks are RRPi and RDEi separately, its rank product score is RRPi * RDEi. The integrated gene rank is derived based on the rank product score. The genes at the top of the list tend to have higher RPscores, together with more significant expression changes upon the TF’s perturbation; therefore, they are more likely to be true direct targets of the TF. In the advances settings, users can opt to analyze up-regulated genes, down-regulated genes, or all differentially expressed gene.
In the Cistrome-GO ensemble mode, the gene rank is based on the rank product score. In the solo mode, the gene rank is based on the RPscore by itself. In both modes, the top-ranked genes are most likely to be directly regulated by the TF. In order to determine the Gene Ontology (GO) terms and KEGG pathways that are significantly enriched in the genes at the top of the list in a threshold-free way, Cistrome-GO applies the minimum hypergeometric (mHG) test on the rank for each GO terms and pathways. An enrichment score, p-value, and an FDR are calculated for each GO term or pathway, and only the terms and pathways with FDR smaller than 0.2 will be reported.
The input ChIP-seq peak file is a tab-separated plain-text file in Browser Extensible Data (BED) format. This type of file is generated as output from our MACS2 ChIP-seq analysis software. The BED format plain-text file has at least the following 5 tab-separated columns.
The 5th column is the peak score, with larger numbers representing more confident peak calls, and is used by Cistrome-GO to rank the peaks. In MACS2 the score is the -log10 (p-value). Click here for an example peak file.
An optional differential expression (DE) analysis input file can augment the ChIP-seq information. This file contains the results of a DE analysis of the transcriptional response to a perturbation of the TF of interest. The output file of the DESeq2 RNA-seq analysis can be uploaded as the input differential expression analysis file directly. This is a plain text file tab delimited with the following 7 tab-separated columns. Click here for an example file.
For the 1st column (geneName), Cistrome-GO supports official gene symbols, RefSeq, and Ensembl IDs. The 7th column (adj-pvalue) is used in Cistrome-GO to rank genes. The 3rd column (log2FoldChang) is used to distinguish between up-regulated and down-regulated genes. Cistrome-GO also supports a plain-text differential expression analysis file with three columns.Click here for an example file.
Cistrome-GO supports two species (human and mouse) and two genome assembly versions for each species (hg38 and hg19 for human; mm10 and mm9 for mouse).
The number of the most significant ChIP-seq peaks used to calculate the RPscore. The peaks are ranked based on decreasing of –log10 p-value. The default value is 10,000. The user can set this parameter by a number, for example 20000, or the characters all, which means using all peaks.
In Cistrome-GO, the promoter-dominant and enhancer-dominant TF types use different decay distances d0 in the calculation of the RPscore (1 kb for the promoter-dominant type and 10kb for the enhancer-dominant type by default). The user can also select a half-decay distance.
The users can indicate the format of differential expression analysis file. The default format is DESeq2 output format (with the 1st column for gene ID, the 3rd column for logFoldChange and the 7th column for FDR used in Cistrome-GO).
The users can choose to only consider up-regulated or down-regulated genes in differential expression analysis file into the Cistrome-GO analysis. The default is to consider all genes in the differential expression analysis file.
To determine whether the regulatory type of the TF is promoter- or enhancer-dominant the percentage of binding with 1kb of gene TSSs is determined (1kb between peak center and TSS). This is shown in a bar-plot of percentages for the top peaks (top 2000, 5000, 10000 and all peaks separately, if applicable; ranked by -log10p-value). If any of the percentages is larger than 20%, the TF is classified as promoter-dominant. Otherwise, it is classified as enhancer-dominant. For the promoter-dominant TFs, a decay distance of 1kb is used in the RPscore calculation; 10kb is used for the enhancer-dominant TFs.
Firstly, we separate genes in the defferential expression list into up-regulated genes and down-regulated genes using logFoldChange. Then using FDR<=0.05 to get significantly differentially expressed genes and regard them as the true positive. We use RP score to predict whether a gene is significantly differentially expressed. Thus we can get the receiver operating characteristic curve and precision-recall curve and their AUC to show the association between peaks and defferential expressed genes.
For the solo mode of Cistrome-GO, the output table contains the following 5 columns.
For the ensemble mode of Cistrome-GO, the output table contains the following 6 columns.
The GO and pathway enrichment output contain 4 tables, for KEGG pathways, CC (Cellular Components), MF (Molecular Functions) and BP (Biological Processes) separately. The terms with FDRs smaller than 0.2 are reported. Each table contains the following 4 columns.
|Linux||ubuntu 18.04.1 LTS||71.0.3578.98||61.0.1||n/a||n/a|
Currently, Cistrome-GO supports two species (human and mouse) and two genome assembly versions for each species (hg38 and hg19 for human; mm10 and mm9 for mouse). The user can use liftover function of UCSC Genome Browser to convert ChIP-seq peak file from other genome assembly version to the version supported by Cistrome-GO.
The input ChIP-seq peak file of Cistrome-GO must be a tab-separated BED format plain-text file, with at least 5 columns. The 5th column is used in Cistrome-GO to rank peaks by the decreasing of the value. The user can use a custom script to convert ChIP-seq peak file to the required format.
Cistrome-GO supports official gene symbol, RefSeq ID, and Ensembl ID. The user can use DAVID gene ID conversion tool to convert the other types of gene ID to one of the above three types.
Cistrome-GO also supports a plain-text differential expression analysis file with three columns: 1) geneName, 2) log FoldChange, 3) p-value/FDR. The user can use a custom script to convert differential expression analysis file to the three-column format.
The user can use the ensemble mode of Cistrome-GO, when the user has RNA-seq data before and after perturbing the TF’s level (such as overexpression, knock-down or knock-out), and the TF’s ChIP-seq data is generated in one of the above two conditions.
Cistrome-GO does not provide a list of genes as a TF’s targets. Instead, it returns a gene rank. In the ensemble mode of Cistrome-GO, the gene rank is based on the increasing of rank product score. In the solo mode, the gene rank is based on the decreasing of RPscore. In both modes, genes in the top of the list are more likely to be directly regulated by the TF.
The manuscript of Cistrome-GO is in preparation.