Prepare input data • TOP

TOP predicts the quantitative TF occupancy from ChIP-seq around candidate TF binding sites using the site-centric approach.

We first identify candidate binding sites by motif scanning with a permissive threshold (using FIMO).

Then, for each cell type, we count (normalized) DNase and/or ATAC cleavage events occurring within 100 bp of the candidate binding site.

Similarly, we quantify TF occupancy in terms of ChIP-seq read counts within 100 bp of the candidate binding site, and this serves as the target of our regression when training TOP.

We simplify the chromatin accessibility data into predictive features using five bins that aggregate the number of cleavage events occurring within the motif itself, as well as within two non-overlapping flanking regions upstream and downstream, using the same binning scheme used in the MILLIPEDE model (Luo and Hartemink, 2013).

Input data

The input data for TOP includes a data frame for each TF x cell type of interest:

The input data frame (see example below) should contains:

Columns of the candidate binding sites: chr, start, end, site name, PWM score, strand, (and optionally p-value) from FIMO motif scanning.
Columns of DNase-seq or ATAC-seq bin counts: Five bins (MILLIPEDE M5 bins) around motif matches.
Optional: one column of ChIP-seq measurement, only needed if you want to train your own models. It could be quantitative TF occupancy (asinh transformed ChIP-seq read counts) or binary TF binding labels (obtained using ChIP-seq peaks).

You can include additional steps to correct DNase-seq or ATAC-seq bias. Yet, our binning approach (with M5 bins) should make TOP robust to DNase-seq or ATAC-seq cleavage bias.

If you have data for many TFs in many cell types or conditions, We recommend using Snakemake to automate the whole process, and we provided some example pipelines in our companion website to automate the data processing for many TFs in many cell types.

You can follow the procedure below, use our pipeline, or write your own scripts/pipelines to prepare the input data.

In order to prepare input data using TOP functions, please have the following R packages installed: GenomicRanges, Rsamtools, data.table, ggplot2, as well as the following command line tools: bedtools, bwtool, and fimo from the MEME suite and bedGraphToBigWig and bigWigAverageOverBed from UCSC binary utilities directory.

Example data preparation procedure

Here, we show an example procedure with several steps for preparing input data for CTCF in K562 cell type using the human genome assembly version hg38.

Load TOP R package

library(TOP)

Step 1: Find TF motif matches using FIMO software

To scan for TF motif matches,
we use the FIMO software from the MEME suite.

Download hg38 reference genome FASTA file and save it as hg38.fa.

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz
gunzip -c hg38.analysisSet.fa.gz > hg38.fa

Generate the chrom.sizes file which will be needed later.

index_fa('hg38.fa', chromsize_file='hg38.chrom.sizes')

Download the CTCF motif file MA0139.1.meme (in MEME format) from JASPAR.

wget https://jaspar.genereg.net/api/v1/matrix/MA0139.1.meme

We use FIMO with the threshold p-value < 1e-5 in most of our analyses. You can also set thresh = "1e-4" (FIMO’s default threshold), if you need more motif matches.

Save results in MA0139.1_1e-5.fimo.txt, which will be used in the next step.

Please set fimo_path to your command line path to executable.

fimo_motif_matches(motif_file='MA0139.1.meme', 
                   sequence_file='hg38.fa', 
                   thresh_pValue=1e-5, 
                   outname='MA0139.1_1e-5.fimo.txt',
                   fimo_path='fimo')

Step 2: Get candidate TF binding sites

We take motif matches obtained from FIMO as candidate binding sites, and add 100 bp flanking regions on both sides of the motifs, then filter candidate sites by FIMO p-value, and filter the candidate sites falling in ENCODE blacklist regions.

Download ENCODE blacklist from ENCODE portal and save as blacklist.hg38.bed.gz.

wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
mv ENCFF356LFX.bed.gz blacklist.hg38.bed.gz

Obtain the candidate sites

# fimo_file: FIMO result file.
# thresh_pValue: FIMO p-value threshold.
# blacklist_file: file with ENOCDE blacklist regions.
sites <- process_candidate_sites(fimo_file='MA0139.1_1e-5.fimo.txt', 
                                 thresh_pValue=1e-5, 
                                 blacklist_file='blacklist.hg38.bed.gz')

Step 3: Count DNase- or ATAC-seq genome-wide cleavage

We use ATAC-seq reads from K562 cell line (ENCODE ID: ENCSR868FGK) for example.

There are three replicate samples in this study. Here we only use one replicate sample (ENCFF534DCE.bam, and we renamed it as K562.ATAC.bam) as an example.

We first sort and index the BAM file, and obtain the total number of mapped reads from the idxstats file, which will be used later when normalizing read counts by library sizes.

# Download the BAM file from ENCODE
wget https://www.encodeproject.org/files/ENCFF534DCE/@@download/ENCFF534DCE.bam
# Rename the bam file
mv ENCFF534DCE.bam K562.ATAC.bam

# This BAM file has already been sorted, so we skip the sorting step. 
sort_index_idxstats_bam('K562.ATAC.bam', sort=FALSE, index=TRUE, idxstats=TRUE)

Next, we count the DNase or ATAC cuts along the genome, and save the genome counts in BigWig files. This step may take a while especially for large BAM files (the example BAM file we used here is very large, ~9.4 GB), but it only needs to be done once. These BigWig files could be for extracting the DNase or ATAC count matrices around the candidate sites for different motifs.

For ATAC-seq, to address the offsets inherent in ATAC-seq reads, we shift ATAC-seq read start positions by aligning the signal across strands, thereby obtaining more accurate Tn5 binding locations (Buenrostro et al., 2013).

# bam_file: sorted BAM file.
# chrom_size_file: file of genome sizes by chromosomes.
# data_type: DNase or ATAC.
# outdir: directory for saving the BigWig files of genome counts, same as outdir in get_sites_counts().
# outname: prefix for the BigWig files, same as genomecount_name in get_sites_counts().
count_genome_cuts(bam_file='K562.ATAC.bam', 
                  chrom_size_file='hg38.chrom.sizes', 
                  data_type='ATAC',
                  outdir='processed_data',
                  outname='K562.ATAC')

Step 4: Get DNase- or ATAC-seq count matrices around candidate sites, then normalize, bin and transform the counts

Get DNase- or ATAC-seq read counts matrix around candidate sites:

We utilize bwtool to extract read counts from the BigWig files generated earlier.

# genomecount_dir: directory for genome counts, same as outdir in count_genome_cuts().
# genomecount_name: file prefix for genome counts, same as outname in count_genome_cuts().
count_matrix <- get_sites_counts(sites,
                                 genomecount_dir='processed_data',
                                 genomecount_name='K562.ATAC')
saveRDS(count_matrix, "processed_data/CTCF.K562.ATAC.counts.mat.rds")

Normalize, bin and transform counts:

# count_matrix: DNase (or ATAC) read counts matrix.
# idxstats_file: the 'idxstats.txt' file generated by sort_index_idxstats_bam().
# ref_size: Reference library size (default: 50 million for ATAC-seq, 100 million for DNase-seq).
# transform: Transformation for DNase (or ATAC) counts (default: 'asinh').
bins <- normalize_bin_transform_counts(count_matrix, 
                                       idxstats_file='K562.ATAC.bam.idxstats.txt', 
                                       ref_size=5e7,
                                       transform='asinh')

Make a data frame for candidate sites combining motif match information and transformed ATAC (or DNase) counts in five MILLIPEDE bins:

combined_data <- data.frame(sites, bins)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value', paste0('bin', 1:ncol(bins)))

saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.combined.data.rds')

head(combined_data, 3)
#>    chr start   end  name pwm.score strand  p.value      bin1      bin2 bin3
#> 1 chr1 11122 11341 site1   22.4839      - 9.17e-09 0.2232134 0.0000000    0
#> 2 chr1 11180 11399 site2   20.7581      - 5.22e-08 0.0000000 0.2232134    0
#> 3 chr1 24681 24900 site3   16.5645      - 1.31e-06 0.0000000 0.0000000    0
#>   bin4 bin5
#> 1    0    0
#> 2    0    0
#> 3    0    0

We can then apply TOP models to the data to make predictions, see “Predict TF occupancy using trained TOP model” for examples.

Prepare ChIP-seq data (optional if you want to train your own model)

If you want to train your own model, you would also need to prepare ChIP data and add those to your input data.

Download CTCF K562 ChIP-seq BAM files (ENCODE ID: ENCSR000EGM, two replicates: ENCFF172KOJ and ENCFF265ZSP).

# Download the ChIP-seq BAM files
wget https://www.encodeproject.org/files/ENCFF172KOJ/@@download/ENCFF172KOJ.bam
wget https://www.encodeproject.org/files/ENCFF265ZSP/@@download/ENCFF265ZSP.bam

# Rename the BAM files
mv ENCFF172KOJ.bam CTCF.K562.ChIPseq.rep1.bam
mv ENCFF265ZSP.bam CTCF.K562.ChIPseq.rep2.bam

Sort and index the BAM files and obtain the number of mapped reads.

# The BAM files have already been sorted, so we skip the sorting step. 
sort_index_idxstats_bam('CTCF.K562.ChIPseq.rep1.bam', sort=FALSE, index=TRUE, idxstats=TRUE)
sort_index_idxstats_bam('CTCF.K562.ChIPseq.rep2.bam', sort=FALSE, index=TRUE, idxstats=TRUE)

Count ChIP-seq reads around candidate sites (merge ChIP-seq replicates), and normalize to the reference ChIP-seq library size (default: 20 million).

sites_chip <- count_normalize_chip(sites,
                                   chip_bam_files=c('CTCF.K562.ChIPseq.rep1.bam',
                                                    'CTCF.K562.ChIPseq.rep2.bam'),
                                   chrom_size_file='hg38.chrom.sizes')

Combine motif match information, ATAC (or DNase) bins, and ChIP-seq counts into a data frame.

combined_data <- data.frame(sites, bins, chip = sites_chip$chip)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value', 
                                paste0('bin', 1:ncol(bins)), 'chip')
saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.ChIP.combined.data.rds')

head(combined_data, 3)
#>    chr start   end  name pwm.score strand  p.value      bin1      bin2 bin3
#> 1 chr1 11122 11341 site1   22.4839      - 9.17e-09 0.2232134 0.0000000    0
#> 2 chr1 11180 11399 site2   20.7581      - 5.22e-08 0.0000000 0.2232134    0
#> 3 chr1 24681 24900 site3   16.5645      - 1.31e-06 0.0000000 0.0000000    0
#>   bin4 bin5 chip
#> 1    0    0    0
#> 2    0    0    0
#> 3    0    0    0

If we want to train TOP logistic version to predict TF binding probability, we can instead use binary ChIP labels (from ChIP-seq peaks)

Download ChIP-seq peaks for CTCF in K562.

# Download ChIP-seq peaks
wget https://www.encodeproject.org/files/ENCFF660GHM/@@download/ENCFF660GHM.bed.gz
# Rename the BAM files
mv ENCFF660GHM.bed.gz CTCF.K562.ChIPseq.peaks.bed.gz

sites_chip_labels <- add_chip_peak_labels_to_sites(sites, 
                                                   chip_peak_file='CTCF.K562.ChIPseq.peaks.bed.gz')

combined_data <- data.frame(sites, bins, chip_label = sites_chip_labels$chip_label)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value', 
                             paste0('bin', 1:ncol(bins)), 'chip_label')
saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.ChIPlabels.combined.data.rds')