Cistrome-GO: a webserver for functional enrichment analysis of transcription factor binding sites

Overview

Uncovering the direct target genes of transcription factors (TFs) and annotating their functions is important for understanding their biological roles. Chromatin immunoprecipitation plus DNA sequencing (ChIP-seq), is a broadly used technique for identifying TF binding sites genome wide. To identify the genes that are directly regulated by the TF we developed a model that integrates the number of binding sites and their distances from a gene’s transcription start site to estimate the TF’s potential for regulating that gene. An alternative approach for identifying TF regulated genes is to determine the genes that are differentially expressed when the activity of the TF is perturbed, through stimulation, repression, knock-down or knockout. Some of the differentially expressed genes in such experiments are not direct targets of the TF so ChIP-seq data is needed to discriminate between directly regulated genes and secondary effects. We previously developed a software package, BETA (Wang et al, Nature Protocols 2013), which combined the regulatory potentials defined by ChIP-seq peaks with relevant differential gene expression data. Cistrome-GO, extends BETA, to provide functional enrichment analysis of gene regulation by TFs in human and mouse.

Cistrome-GO has two working modes. If the user provides both a TF’s ChIP-seq peak file and a differential expression analysis file (upon the TF’s perturbation), Cistrome-GO will perform an ensemble mode analysis based on the integration of the two types of data (Figure 1). If the user uploads only a TF ChIP-seq peak file, Cistrome-GO will perform the analysis in solo mode (Figure 2).

Figure1
Figure 1. The workflow of Cistrome-GO (ensemble mode)
Figure2
Figure 2. The workflow of Cistrome-GO (solo mode)

Method summary

1. Classification of TF as promoter-dominant or enhancer-dominant type

A TF can regulate the expression of its target genes by binding cis-regulatory elements, including promoters and enhancers. For some TFs, such as MYC, many binding sites are close to (< 1 kb) transcription start sites (TSS) of target genes. Such TFs are of the promoter-dominant type. On the other hand, the binding sites for TFs such as AR and ESR1 tend to be less enriched in promoters. These TFs are of the enhancer-dominant type.

When user uploads a TF’s ChIP-seq peak file, Cistrome-GO calculates the percentages of the most significant ChIP-seq peaks (top 2000, 5000, 10000 and all peaks separately, if applicable; ranked by -log10p-value) within 1kb of the TSS. If any of the calculated percentages is larger than 20%, the TF will be classified as promoter-dominant type. Otherwise, it will be classified as enhancer-dominant type. In Cistrome-GO, promoter-dominant type and enhancer-dominant type TFs have different parameters to calculate the regulatory potential score (RPscore).

2. Calculation of RPscore

The regulatory potential score (RPscore) of a gene is its likelihood of being regulated by a TF. In Cistrome-GO, the formula of RPscore calculation is (Figure 3),

The parameter d0 is the half-decay distance, with default value 1 kb for promoter-type TFs and 10 kb for enhancer-type TFs. (Users can also specify the parameter to other values in the advanced settings). All k binding sites near the TSS of gene g (within the distance 15 * d0) will be used in the calculation, and di is the distance between the ith peak’s center and TSS.

For a given input ChIP-seq peak file, an RPscore is calculated for each gene. The gene rank RRP is derived by decreasing RPscore.

Figure 3. Schema of regulatory potential score (RPscore) calculation for one gene
3. Integration with differential expression data by rank product

If the user provides a differential expression analysis file in addition to a ChIP-seq peak file, an additional gene rank RDE is derived based on the increasing of P-value. In the ensemble mode of Cistrome-GO, the two ranks (RRP and RDE) are integrated by rank product. For example, if genei’s ranks are RRPi and RDEi separately, its rank product score is RRPi * RDEi. The integrated gene rank is derived based on the rank product score. The genes at the top of the list tend to have higher RPscores, together with more significant expression changes upon the TF’s perturbation; therefore, they are more likely to be true direct targets of the TF. In the advances settings, users can opt to analyze up-regulated genes, down-regulated genes, or all differentially expressed gene.

4. GO and pathway analysis based on gene ranking

In the Cistrome-GO ensemble mode, the gene rank is based on the rank product score. In the solo mode, the gene rank is based on the RPscore by itself. In both modes, the top-ranked genes are most likely to be directly regulated by the TF. In order to determine the Gene Ontology (GO) terms and KEGG pathways that are significantly enriched in the genes at the top of the list in a threshold-free way, Cistrome-GO applies the minimum hypergeometric (mHG) test on the rank for each GO terms and pathways. An enrichment score, p-value, and an FDR are calculated for each GO term or pathway, and only the terms and pathways with FDR smaller than 0.2 will be reported.

Input files

1. ChIP-seq peak file (required)

The input ChIP-seq peak file is a tab-separated plain-text file in Browser Extensible Data (BED) format. This type of file is generated as output from our MACS2 ChIP-seq analysis software. The BED format plain-text file has at least the following 5 tab-separated columns.

  1. chromosome
  2. chromStart
  3. chromEnd
  4. name
  5. score

The 5th column is the peak score, with larger numbers representing more confident peak calls, and is used by Cistrome-GO to rank the peaks. In MACS2 the score is the -log10 (p-value). Click here for an example peak file.

2. Differential expression analysis file (optional)

An optional differential expression (DE) analysis input file can augment the ChIP-seq information. This file contains the results of a DE analysis of the transcriptional response to a perturbation of the TF of interest. The output file of the DESeq2 RNA-seq analysis can be uploaded as the input differential expression analysis file directly. This is a plain text file tab delimited with the following 7 tab-separated columns. Click here for an example file.

  1. geneName
  2. baseMean
  3. log2FoldChange
  4. log fold change Standard Error
  5. statistic
  6. pvalue
  7. adj-pvalue

For the 1st column (geneName), Cistrome-GO supports official gene symbols, RefSeq, and Ensembl IDs. The 7th column (adj-pvalue) is used in Cistrome-GO to rank genes. The 3rd column (log2FoldChang) is used to distinguish between up-regulated and down-regulated genes. Cistrome-GO also supports a plain-text differential expression analysis file with three columns.Click here for an example file.

  1. geneName
  2. logFoldChange
  3. p-value or FDR

Parameters

1. Species

Cistrome-GO supports two species (human and mouse) and two genome assembly versions for each species (hg38 and hg19 for human; mm10 and mm9 for mouse).

2. Peak number to use

The number of the most significant ChIP-seq peaks used to calculate the RPscore. The peaks are ranked based on decreasing of –log10 p-value. The default value is 10,000. The user can set this parameter by a number, for example 20000, or the characters all, which means using all peaks.

3. Customized half-decay distance (kb)

In Cistrome-GO, the promoter-dominant and enhancer-dominant TF types use different decay distances d0 in the calculation of the RPscore (1 kb for the promoter-dominant type and 10kb for the enhancer-dominant type by default). The user can also select a half-decay distance.

4. Expression file format

The users can indicate the format of differential expression analysis file. The default format is DESeq2 output format (with the 1st column for gene ID, the 3rd column for logFoldChange and the 7th column for FDR used in Cistrome-GO).

5. Genes for enrichment analysis

The users can choose to only consider up-regulated or down-regulated genes in differential expression analysis file into the Cistrome-GO analysis. The default is to consider all genes in the differential expression analysis file.

Output

1. Bar plot of top peaks percentages at promoters

To determine whether the regulatory type of the TF is promoter- or enhancer-dominant the percentage of binding with 1kb of gene TSSs is determined (1kb between peak center and TSS). This is shown in a bar-plot of percentages for the top peaks (top 2000, 5000, 10000 and all peaks separately, if applicable; ranked by -log10p-value). If any of the percentages is larger than 20%, the TF is classified as promoter-dominant. Otherwise, it is classified as enhancer-dominant. For the promoter-dominant TFs, a decay distance of 1kb is used in the RPscore calculation; 10kb is used for the enhancer-dominant TFs.

2. ROC and PRC of association between differential expression and peaks (only for ensemble mode)

Firstly, we separate genes in the defferential expression list into up-regulated genes and down-regulated genes using logFoldChange. Then using FDR<=0.05 to get significantly differentially expressed genes and regard them as the true positive. We use RP score to predict whether a gene is significantly differentially expressed. Thus we can get the receiver operating characteristic curve and precision-recall curve and their AUC to show the association between peaks and defferential expressed genes.

3. Table of gene regulatory potential rank

For the solo mode of Cistrome-GO, the output table contains the following 5 columns.

  1. Gene: official gene symbol
  2. Coordinate: the coordinate of the gene in the genome
  3. Visualization: the external links for visualizing the gene and ChIP-seq peak file in WashU Epigenome Browser and UCSC Genome Browser
  4. RPscore: the calculated regulatory potential score
  5. Rank: the rank of genes based on the decreasing of RPscores

For the ensemble mode of Cistrome-GO, the output table contains the following 6 columns.

  1. Gene: official gene symbol
  2. Coordinate: the coordinate of the gene in the genome
  3. Visualization: the external links for visualizing the gene and ChIP-seq peak file in WashU Epigenome Browser and UCSC Genome Browser
  4. RPscore: the calculated regulatory potential score
  5. DE p-value: the p-value of the gene from the input differential expression analysis file (upon the TF’s perturbation)
  6. Rank: the rank of genes based on the increase of rank product scores
4. Tables of GO and pathway enrichment

The GO and pathway enrichment output contain 4 tables, for KEGG pathways, CC (Cellular Components), MF (Molecular Functions) and BP (Biological Processes) separately. The terms with FDRs smaller than 0.2 are reported. Each table contains the following 4 columns.

  1. GO/KEGG terms: the name and external link for each enriched GO or KEGG term
  2. Enrichment score: the enrichment is defined as (b/n)/(B/N). N: the total number of genes; B: the total number of genes in the considered GO / KEGG term; n: the cutoff of top genes used to get the mHG p-value; b: the number of genes among the top n genes in the considered GO / KEGG term
  3. P-value: the mHG p-value
  4. FDR: Benjamini-Hochberg adjusted p-value
  5. Genes: Gene number of the GO term; Click the number to check RP scores of genes in the GO term

Browser Compatibility

OS Version Chrome Firefox Safari Microsoft Edge
MacOS 10.11.6 71.0.3578.98 64.0 10.1.2 n/a
Linux ubuntu 18.04.1 LTS 71.0.3578.98 61.0.1 n/a n/a
Windows 10 72.0.3610.2 64.0 n/a 44.177763.1.0

FAQ

1. Which genome assemblies can be used by Cistrome-GO ?

Currently, Cistrome-GO supports two species (human and mouse) and two genome assembly versions for each species (hg38 and hg19 for human; mm10 and mm9 for mouse). The user can use liftover function of UCSC Genome Browser to convert ChIP-seq peak file from other genome assembly version to the version supported by Cistrome-GO.

2. Can Cistrome-GO use a ChIP-seq peak file not generated by MACS2?

The input ChIP-seq peak file of Cistrome-GO must be a tab-separated BED format plain-text file, with at least 5 columns. The 5th column is used in Cistrome-GO to rank peaks by the decreasing of the value. The user can use a custom script to convert ChIP-seq peak file to the required format.

3. Can Cistrome-GO be used to analyze a differential expression analysis file based on other types of gene ID (such as Entrez, UniGene, etc)?

Cistrome-GO supports official gene symbol, RefSeq ID, and Ensembl ID. The user can use DAVID gene ID conversion tool to convert the other types of gene ID to one of the above three types.

4. If the user has a differential expression analysis file not generated by DESeq2, how to use Cistrome-GO?

Cistrome-GO also supports a plain-text differential expression analysis file with three columns: 1) geneName, 2) log FoldChange, 3) p-value/FDR. The user can use a custom script to convert differential expression analysis file to the three-column format.

5. When does the user choose the ensemble mode (integration of ChIP-seq and differential expression) of Cistrome-GO?

The user can use the ensemble mode of Cistrome-GO, when the user has RNA-seq data before and after perturbing the TF’s level (such as overexpression, knock-down or knock-out), and the TF’s ChIP-seq data is generated in one of the above two conditions.

6. How are target genes assigned in Cistrome-GO?

Cistrome-GO does not provide a list of genes as a TF’s targets. Instead, it returns a gene rank. In the ensemble mode of Cistrome-GO, the gene rank is based on the increasing of rank product score. In the solo mode, the gene rank is based on the decreasing of RPscore. In both modes, genes in the top of the list are more likely to be directly regulated by the TF.

7. How to cite Cistrome-GO?

The manuscript of Cistrome-GO is in preparation.