About proChIPdb

Welcome to proChIPdb, the PROkaryotic Chromatin ImmunoPrecipitation DataBase! This tool enables microbiologists to easily browse 271 chip-seq and chip-exo profiles for various transcription factors (TFs) across 13 organisms. Currently, most of our profiles are for Escherichia coli. We provide curated tables of binding sites, interactive plots and genome viewers, as well as comparisons against literature binding sites.

How to Use this Site

Begin by selecting your organism of interest from the splash page. Next, you can browse the table of transcription factors from our dataset page. proChIPdb covers well-studied transcription factors, as well as a wide scope of E. coli y-TFs (relatively poorly characterized transcription factors).

You can also search for transcription factors or genes using the search button in the upper left corner, or download all site data to do your own custom analysis.

Once you’ve reached a transcription factor dashboard, you can browse its binding sites and target genes in the curated table. Interested in a specific binding site? Click on it to see the peak in our genome viewer. You can also click through the tabs in the upper right panel to see global characterizations of the transcription factor’s binding, such as its binding motif, peak width, peak location relative to its target genes, and the concordance between the data in proChIPdb and other databases of transcription factor binding. For more details on the dashboard components, see the sections below.

It is our hope that proChIPdb will enrich your research by allowing easy access to this compendium of binding intensity data, curated binding sites, and summary statistics. If you are trying to understand how genes are transcriptionally regulated (e.g. differential expression result interpretation, gene module identification, relative binding strength comparison, binding motif analysis), then proChIPdb can provide detailed information relevant to you.

Transcription Factor Dashboards

Each dashboard features a specific transcription factor (for a specific organism and strain) and its ChIP results for at least one condition. Most transcription factors were characterized under M9 media conditions, and some were also characterized under other relevant conditions. In the following section, we include examples of each page element from the E. coli K-12 Fur dashboard.

Metadata Panel

Located in the left-side column of the page, this panel contains basic details about the organism, strain, media used, and supplement(s), if applicable. Hovering over thebutton adjacent to the media name will reveal additional details about the composition of the media. Supplements are included in detailed, nested dictionaries which describe the concentrations in the units used by their original publications. The accession number from GEO or SRA is also displayed, along with the PMID and DOI of the associated publication.

Links to the transcription factor's page on external databases are also included here, featuring EcoCyc, RegulonDB, UniProt, the Protein Data Bank, Pseudomonas Genome DB, and AureoWiki. Use these to access up-to-date information about the transcription factor, including relevant journal articles and protein structures.

Below the main links, there may be additional links under the heading "iModulons". iModulons are machine-learning derived gene groupings from analysis of transcriptomes, which can be associated with transcription factors. If the page's transcription factor is predicted to regulate any iModulons, then those iModulons will be listed as links here. Since iModulons capture independent signals in a dataset, they may combine the effects of several transcription factors or approximate nonlinear responses as multiple iModulons, which means that some cases (including Fur) will have more than one associated iModulon. Links go to iModulonDB, which has an about page at which you can learn more about this approach. It may be interesting to compare the genes from the iModulon with the binding sites from proChIPdb, as well as to use iModulonDB to learn more about the transcription factor's activity over a wide range of conditions.

The example on the left shows the metadata for Fur. Fur is the ferric uptake regulator, meaning it controls genes relevant to iron transport. It was tested with two conditions: iron-replete (Fe) and iron-starvation (DPD) conditions. In the presence of iron, it is expected that Fur will bind to DNA and repress its target genes. In the presence of DPD, DPD will bind any free iron to create iron starvation conditions, leading to decreased Fur binding (more details available here). More information about Fur is available at the provided links. Fur is a major cellular regulator, so it takes part in the regulation of several iModulons.

Binding Site Table

This table represents all identified binding locations for the transcription factor. Tabs across the top correspond to each binding condition. Each row represents a curated binding peak. Clicking on a row will update the Genome Viewer to display a zoomed in plot of the corresponding peak. The rows are initially sorted by genome location, and the tallest peaks can easily be found by sorting by descending "Peak Intensity" (clicking the column header twice). In most cases, peaks were identified by processing the sequence read data with MACE. The columns are as follows:

  • TF-Condition-#: A unique identifier for the peak, ordered by genome location.
  • Start Position & End Position: The genome locations of the peak in base pairs are determined by MACE. For each peak, a single-nucleotide resolution border peak is detected by using the Chebyshev Inequality and the border is matched by using the Gale–Shapley stable matching algorithm. For E. coli K-12 MG1655, these locations refer to the NC_000913.3 genome.
  • Peak Intensity: Peak signal intensity is determined with two replicates using MACE peak calling algorithm, which employs Shannon’s relative entropy (H) to compute signal to noise values across the genome for each replicate. Note that the noise value varies for each condition because it captures variation due to inexact cross-linking, exonuclease digestion, dynamic protein conformations, and PCR amplification.
  • Closest gene: Each peak is assigned to its closest gene, according to genomic positions. This column will contain a gene name regardless of the distance to the gene.
  • Operon: If a transcriptional unit exists downstream of the binding peak and is within 500 base pairs of the peak, its genes will be listed here. If that operon also has a page in an online database, such as RegulonDB for E. coli, then the operon will appear as a link to that page. There may be two operons if they diverge in both directions downstream of the binding peak.
Binding Site Table

In the above table, we can learn a great deal about Fur binding. The DPD tab shows relatively low peak intensities, because DPD induces iron starvation that suppresses Fur binding. Switching to the Fe tab, we can sort by peak intensity to see the strongest binding events when Fur binding is stimulated. For example, the binding site Fur-11 has a very high peak intensity and corresponds to the entCEBAH operon, which produces the iron chelator enterobactin. Note that searching on our search page for any of the target genes in this table would return this page as a result.

Genome Viewer

proChIPdb’s genome viewer provides access to a complete, genome-wide view of the transcription factor’s activity under the given conditions. The genome viewer was made using igv.js, and it visualizes bigWig ChIP read files generated from bam files using deeptools. From top to bottom, the features and tracks of the viewer are:

  • Toolbar: The toolbar contains a search bar, which can be used to search for specific loci by gene location or gene name. It also has useful tools for adding cursor guides, a center line, toggling track labels, downloading the current view, and adjusting the zoom level.
  • Genome Location: The white bar with red outlines may help orient you to which region of the genome is currently displayed. Underneath that, an axis is labeled with specific base pair numbers.
  • Nucleotides: If the viewer is zoomed close enough, color-coded nucleotide labels will appear.
  • Genes: Gene annotations as provided in the genome. Click on an element to see additional details, such as the gene name, locus tag, and product description.
  • Published TUs: Transcriptional unit annotations, as provided by the Bitome.
  • Published TSS: Transcription start sites, as provided by the Bitome.
  • Published TFBS: Transcription factor binding sites, as provided by the Bitome.
  • Data Tracks: The y-axis of these plots is the number of ChIP reads mapped to the x-axis nucleotide. Replicate numbers and unique conditions (if applicable) are indicated by the track name. To download this data for custom analysis, click “Download bigWig Files” in the upper left corner of the pane.
Genome Viewer Download bigWig files

In the above genome viewer, the bottom four rows show the ChIP data for Fur binding (2 conditions with 2 replicates each). Note that in the DPD conditions, the y axis does not reach a very high value because none of the binding events are particularly strong; this means that noise dominates our view in those conditions. On the other hand, the Fe condition creates several strong peaks. If you'd like, you can scroll back up to the table for the Fe condition and select a binding site like Fur-11. This will zoom the genome viewer into that binding site to see its specific shape. The two peaks for Fur-11 align with the annotated "Fur" binding sites in the "Published TFBS" row. The target genes (starting with entC) can be seen in the "Genes" row. In our less studied transcription factors, this view represents a powerful opportunity for discovery.

Feature Visualization Panel

The tabs within this panel provide additional characterization of the data. In the upper right hand corner of each tab, the menu button (menu) enables PNG, SVG, and data download.

Feature Visualization – Width

This tab shows a histogram of the binding peak widths from the binding site table. Hover over each bar to see a count of peaks that fall within its corresponding bin.

Feature Visualization – Position

In this tab, proChIPdb compares the binding locations relative to each target gene. For each gene (as listed in the Closest Gene column of the Binding Site Table), the distance from the gene start site to the binding site is measured in base pairs and normalized to the length of the gene. Points are then plotted with this value on the x axis and the peak intensity (S/N) on the y axis. Clustering on the x axis indicates the distance at which the transcription factor usually exerts its influence on gene expression. Hover over a point to view more details, such as the gene name.

Feature Visualization – Motif

If this tab exists, then it contains a sequence logo of a significantly enriched motif and its corresponding E-value (E < 0.001). Motifs provide valuable insight about which sequences will bind the TF. Sub-tabs across the bottom of the panel allow you to select from each of the conditions in the dashboard. The menu button in the upper right hand corner of each tab allows download of both the image itself and the position weight matrix (PWM) of the motif.

These motifs were generated using MEME-ChIP (parameters: meme-minw=5bp, meme-maxw=45bp, -meme-nmotifs=4, filter-thresh=0.001), run on the binding peaks extended with a 20bp margin.

In addition, a final tab named using a PMID may be available; this contains the sequence logo as it appears in the publication from which the data is from. You are encouraged to refer to the original publications for more details about those sequence logos.

Feature Visualization – Venn

The venn diagram compares the target genes obtained in proChIPdb to other target gene sets in the literature. Hover over a section of the venn diagram to see the genes it contains. Areas of agreement indicate strong evidence of direct regulation, and areas of disagreement represent opportunities to improve literature annotation or elucidate condition-specific differences in binding. The specific literature sources are mentioned below each diagram. For E. coli, EcoCyc’s transcriptional regulatory network was used.

Feature Visualization

The search page can be reached from any page by clicking "Search" in the upper right hand corner. You have the option to search transcription factors by name (e.g. "AtoC"), by PMID (e.g. "25222563") or by accession number (e.g. "GSE54901"). You can also search genes by name (e.g. "thrA"; includes common synonyms) or locus tag (e.g. "b0002"). Leave both options selected to return all relevant results. Search terms are case-insensitive. Each result that appears below the search bar will be a link to a transcription factor dashboard. The portion of the result that matches your search term will appear bolded and underlined.

Transcription factor results will simply list the name, organism, strain, PMID, and accession number of the matching page. Gene results are associated with specific binding peaks upstream of the gene of interest, so they include additional details:

  • Gene name and locus tag (in parentheses)
  • Binding site identifier in the form "Transcription factor name - #"
  • Organism and strain name
  • Condition name in the form "transcription factor name + condition". The condition may be the media type or the supplement used, for example.
  • Peak strength as a signal to noise (S/N) ratio. Note that these values cannot be directly mathematically compared because noise varies from condition to condition, but the peak strength will still provide a valuable indicator of how important the peak is in the ChIP profile.

Download

As described throughout this page, any specific content on a transcription factor dashboard may be downloaded using the buttons in the panels. If you would prefer to download all of the proChIPdb data, you can do so by following the link in the lower left corner of the splash page (or here). The folder is organized as follows:

  • Organisms: The first level allows you to select an organism, named "g_species" (e.g. "e_coli"). JSON files used for searching the database are also stored in the top level folder. These JSON files may be useful as full lists of all TFs or target genes.
  • Strains: After selecting an organism, choose a strain in the second level. Strains are usually named after the RefSeq accession number of the strain's genome. This level also contains a "TF_list" file that lists each TF with some additional metadata, and is used to generate the dataset page and metadata panels. Use the TF_lists as your guide for matching each of the files to their appropriate TFs. If the strain you selected does not have a TF_list available, the TF_list should be in the "all_other" organism folder.
  • Site Content: Each subfolder for a given strain contains the individual files for the TFs, organized by category.
    • annotation: Genome annotations, including GFF files for the four annotation tracks in the genome viewer.
    • binding_widths: Input files for the binding width histograms in the feature visualization panel.
    • bw: BigWig files representing the ChIP data, which are displayed in the genome viewer.
    • curated_input: GFF files containing curated binding peak locations.
    • positions: Input files for the position scatter plot in the feature visualization panel.
    • sequence: FASTA and other sequence files for the strain, used in the genome viewer.
    • table: Binding site tables.
    • venn: Input files for the venn diagrams in the feature visualization panel.

References

proChIPdb manuscript coming soon!

Methods

Annotation Sources

Additional References

Contact Us

To ask questions, provide feedback, report an issue, or collaborate with us, please email us at be-chip-pro@eng.ucsd.edu.