TOP predicts the quantitative TF occupancy from ChIP-seq around candidate TF binding sites using the site-centric approach.
We first identify candidate binding sites by motif scanning with a permissive threshold (using FIMO).
Then, for each cell type, we count (normalized) DNase and/or ATAC cleavage events occurring within 100 bp of the candidate binding site.
Similarly, we quantify TF occupancy in terms of ChIP-seq read counts within 100 bp of the candidate binding site, and this serves as the target of our regression when training TOP.
We simplify the chromatin accessibility data into predictive features using five bins that aggregate the number of cleavage events occurring within the motif itself, as well as within two non-overlapping flanking regions upstream and downstream, using the same binning scheme used in the MILLIPEDE model (Luo and Hartemink, 2013).
The input data for TOP includes a data frame for each TF x cell type of interest:
The input data frame (see example below) should contains:
Columns of the candidate binding sites: chr, start, end, site
name, PWM score, strand, (and optionally p-value) from FIMO
motif scanning.
Columns of DNase-seq or ATAC-seq bin counts: Five bins (MILLIPEDE M5 bins) around motif matches.
Optional: one column of ChIP-seq measurement, only needed if you want to train your own models. It could be quantitative TF occupancy (asinh transformed ChIP-seq read counts) or binary TF binding labels (obtained using ChIP-seq peaks).
You can include additional steps to correct DNase-seq or ATAC-seq bias. Yet, our binning approach (with M5 bins) should make TOP robust to DNase-seq or ATAC-seq cleavage bias.
If you have data for many TFs in many cell types or conditions, We recommend using Snakemake to automate the whole process, and we provided some example pipelines in our companion website to automate the data processing for many TFs in many cell types.
You can follow the procedure below, use our pipeline, or write your own scripts/pipelines to prepare the input data.
In order to prepare input data using TOP
functions,
please have the following R packages installed: GenomicRanges,
Rsamtools,
data.table,
ggplot2, as well as the
following command line tools: bedtools, bwtool, and fimo from the MEME suite and bedGraphToBigWig
and bigWigAverageOverBed
from UCSC binary
utilities directory.
Here, we show an example procedure with several steps for preparing input data for CTCF in K562 cell type using the human genome assembly version hg38.
Load TOP R package
To scan for TF motif matches,
we use the FIMO
software from the MEME
suite.
Download hg38 reference genome FASTA file and save it as
hg38.fa
.
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz
gunzip -c hg38.analysisSet.fa.gz > hg38.fa
Generate the chrom.sizes
file which will be needed
later.
index_fa('hg38.fa', chromsize_file='hg38.chrom.sizes')
Download the CTCF motif file MA0139.1.meme
(in MEME
format) from JASPAR.
wget https://jaspar.genereg.net/api/v1/matrix/MA0139.1.meme
We use FIMO with the threshold p-value < 1e-5
in most
of our analyses. You can also set thresh = "1e-4"
(FIMO’s
default threshold), if you need more motif matches.
Save results in MA0139.1_1e-5.fimo.txt
, which will be
used in the next step.
Please set fimo_path
to your command line path to
executable.
fimo_motif_matches(motif_file='MA0139.1.meme',
sequence_file='hg38.fa',
thresh_pValue=1e-5,
outname='MA0139.1_1e-5.fimo.txt',
fimo_path='fimo')
We take motif matches obtained from FIMO as candidate binding sites, and add 100 bp flanking regions on both sides of the motifs, then filter candidate sites by FIMO p-value, and filter the candidate sites falling in ENCODE blacklist regions.
Download ENCODE blacklist from ENCODE
portal and save as blacklist.hg38.bed.gz
.
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
mv ENCFF356LFX.bed.gz blacklist.hg38.bed.gz
Obtain the candidate sites
# fimo_file: FIMO result file.
# thresh_pValue: FIMO p-value threshold.
# blacklist_file: file with ENOCDE blacklist regions.
sites <- process_candidate_sites(fimo_file='MA0139.1_1e-5.fimo.txt',
thresh_pValue=1e-5,
blacklist_file='blacklist.hg38.bed.gz')
We use ATAC-seq reads from K562 cell line (ENCODE ID:
ENCSR868FGK
) for example.
There are three replicate samples in this study. Here we only use one
replicate sample (ENCFF534DCE.bam
, and we renamed it as
K562.ATAC.bam
) as an example.
We first sort and index the BAM file, and obtain the total number of mapped reads from the idxstats file, which will be used later when normalizing read counts by library sizes.
# Download the BAM file from ENCODE
wget https://www.encodeproject.org/files/ENCFF534DCE/@@download/ENCFF534DCE.bam
# Rename the bam file
mv ENCFF534DCE.bam K562.ATAC.bam
# This BAM file has already been sorted, so we skip the sorting step.
sort_index_idxstats_bam('K562.ATAC.bam', sort=FALSE, index=TRUE, idxstats=TRUE)
Next, we count the DNase or ATAC cuts along the genome, and save the genome counts in BigWig files. This step may take a while especially for large BAM files (the example BAM file we used here is very large, ~9.4 GB), but it only needs to be done once. These BigWig files could be for extracting the DNase or ATAC count matrices around the candidate sites for different motifs.
For ATAC-seq, to address the offsets inherent in ATAC-seq reads, we shift ATAC-seq read start positions by aligning the signal across strands, thereby obtaining more accurate Tn5 binding locations (Buenrostro et al., 2013).
# bam_file: sorted BAM file.
# chrom_size_file: file of genome sizes by chromosomes.
# data_type: DNase or ATAC.
# outdir: directory for saving the BigWig files of genome counts, same as outdir in get_sites_counts().
# outname: prefix for the BigWig files, same as genomecount_name in get_sites_counts().
count_genome_cuts(bam_file='K562.ATAC.bam',
chrom_size_file='hg38.chrom.sizes',
data_type='ATAC',
outdir='processed_data',
outname='K562.ATAC')
Get DNase- or ATAC-seq read counts matrix around candidate sites:
We utilize bwtool to extract read counts from the BigWig files generated earlier.
# genomecount_dir: directory for genome counts, same as outdir in count_genome_cuts().
# genomecount_name: file prefix for genome counts, same as outname in count_genome_cuts().
count_matrix <- get_sites_counts(sites,
genomecount_dir='processed_data',
genomecount_name='K562.ATAC')
saveRDS(count_matrix, "processed_data/CTCF.K562.ATAC.counts.mat.rds")
Normalize, bin and transform counts:
# count_matrix: DNase (or ATAC) read counts matrix.
# idxstats_file: the 'idxstats.txt' file generated by sort_index_idxstats_bam().
# ref_size: Reference library size (default: 50 million for ATAC-seq, 100 million for DNase-seq).
# transform: Transformation for DNase (or ATAC) counts (default: 'asinh').
bins <- normalize_bin_transform_counts(count_matrix,
idxstats_file='K562.ATAC.bam.idxstats.txt',
ref_size=5e7,
transform='asinh')
Make a data frame for candidate sites combining motif match information and transformed ATAC (or DNase) counts in five MILLIPEDE bins:
combined_data <- data.frame(sites, bins)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value', paste0('bin', 1:ncol(bins)))
saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.combined.data.rds')
head(combined_data, 3)
#> chr start end name pwm.score strand p.value bin1 bin2 bin3
#> 1 chr1 11122 11341 site1 22.4839 - 9.17e-09 0.2232134 0.0000000 0
#> 2 chr1 11180 11399 site2 20.7581 - 5.22e-08 0.0000000 0.2232134 0
#> 3 chr1 24681 24900 site3 16.5645 - 1.31e-06 0.0000000 0.0000000 0
#> bin4 bin5
#> 1 0 0
#> 2 0 0
#> 3 0 0
We can then apply TOP models to the data to make predictions, see “Predict TF occupancy using trained TOP model” for examples.
If you want to train your own model, you would also need to prepare ChIP data and add those to your input data.
Download CTCF K562 ChIP-seq BAM files (ENCODE ID:
ENCSR000EGM
, two replicates: ENCFF172KOJ
and
ENCFF265ZSP
).
# Download the ChIP-seq BAM files
wget https://www.encodeproject.org/files/ENCFF172KOJ/@@download/ENCFF172KOJ.bam
wget https://www.encodeproject.org/files/ENCFF265ZSP/@@download/ENCFF265ZSP.bam
# Rename the BAM files
mv ENCFF172KOJ.bam CTCF.K562.ChIPseq.rep1.bam
mv ENCFF265ZSP.bam CTCF.K562.ChIPseq.rep2.bam
Sort and index the BAM files and obtain the number of mapped reads.
# The BAM files have already been sorted, so we skip the sorting step.
sort_index_idxstats_bam('CTCF.K562.ChIPseq.rep1.bam', sort=FALSE, index=TRUE, idxstats=TRUE)
sort_index_idxstats_bam('CTCF.K562.ChIPseq.rep2.bam', sort=FALSE, index=TRUE, idxstats=TRUE)
Count ChIP-seq reads around candidate sites (merge ChIP-seq replicates), and normalize to the reference ChIP-seq library size (default: 20 million).
sites_chip <- count_normalize_chip(sites,
chip_bam_files=c('CTCF.K562.ChIPseq.rep1.bam',
'CTCF.K562.ChIPseq.rep2.bam'),
chrom_size_file='hg38.chrom.sizes')
Combine motif match information, ATAC (or DNase) bins, and ChIP-seq counts into a data frame.
combined_data <- data.frame(sites, bins, chip = sites_chip$chip)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value',
paste0('bin', 1:ncol(bins)), 'chip')
saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.ChIP.combined.data.rds')
head(combined_data, 3)
#> chr start end name pwm.score strand p.value bin1 bin2 bin3
#> 1 chr1 11122 11341 site1 22.4839 - 9.17e-09 0.2232134 0.0000000 0
#> 2 chr1 11180 11399 site2 20.7581 - 5.22e-08 0.0000000 0.2232134 0
#> 3 chr1 24681 24900 site3 16.5645 - 1.31e-06 0.0000000 0.0000000 0
#> bin4 bin5 chip
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
If we want to train TOP logistic version to predict TF binding probability, we can instead use binary ChIP labels (from ChIP-seq peaks)
Download ChIP-seq peaks for CTCF in K562.
# Download ChIP-seq peaks
wget https://www.encodeproject.org/files/ENCFF660GHM/@@download/ENCFF660GHM.bed.gz
# Rename the BAM files
mv ENCFF660GHM.bed.gz CTCF.K562.ChIPseq.peaks.bed.gz
sites_chip_labels <- add_chip_peak_labels_to_sites(sites,
chip_peak_file='CTCF.K562.ChIPseq.peaks.bed.gz')
combined_data <- data.frame(sites, bins, chip_label = sites_chip_labels$chip_label)
colnames(combined_data) <- c('chr','start','end','name','pwm.score','strand','p.value',
paste0('bin', 1:ncol(bins)), 'chip_label')
saveRDS(combined_data, 'processed_data/CTCF_MA0139.1_1e-5.K562.ATAC.M5.ChIPlabels.combined.data.rds')