ChIP-seq data processing and relative and quantitative signal normalization for Saccharomyces cerevisiae

Kris G Alavattam; Bradley M Dickson; Rina Hirano; Rachel Dell; Toshio Tsukiyama

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

ChIP-seq data processing and relative and quantitative signal normalization for Saccharomyces cerevisiae

KA Kris G Alavattam email

BD Bradley M Dickson

RH Rina Hirano

RD Rachel Dell

TT Toshio Tsukiyama email

Last updated date: Dec 16, 2024 Views: 334 Forks: 0

Download PDF

Ask a question

How to cite

Favorite

ChIP-seq data processing and relative and quantitative signal normalization for Saccharomyces cerevisiae

Kris G. Alavattam^1,*, Bradley M. Dickson^2,‡, Rina Hirano^1,‡, Rachel Dell¹, and Toshio Tsukiyama^1,*

¹Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA

²Department of Epigenetics, Van Andel Institute, Grand Rapids, MI 49503, USA

‡These authors contributed equally to this work.

*For correspondence, contact kalavattam@gmail.com or ttsukiya@fredhutch.org.

Abstract

Chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq) is a widely used technique for genome-wide analyses of protein-DNA interactions. This protocol provides a guide to ChIP-seq data processing in Saccharomyces cerevisiae, with a focus on signal normalization to address data biases and enable meaningful comparisons within and between samples. Designed for researchers with minimal bioinformatics experience, it includes practical overviews and refers to scripting examples for key tasks, such as configuring computational environments, trimming and aligning reads, processing alignments, and visualizing signals. This protocol introduces siQ-ChIP and normalized (proportional) signal computation for absolute and relative comparisons of ChIP-seq data, respectively. While steps for spike-in normalization are included for context, siQ-ChIP and normalized coverage are recommended as superior alternatives due to their mathematical rigor and reliability. A particular emphasis on S. cerevisiae-specific processing and a focus on robust signal normalization distinguishes this protocol from others.

Background

Chromatin immunoprecipitation followed by high-throughput DNA sequencing (ChIP-seq) is a widely used technique for studying protein-DNA interactions across the genome (Barski et al., 2007; Johnson et al., 2007; Mikkelsen et al., 2007; Robertson et al., 2007). ChIP-seq identifies regions bound by proteins such as histones, transcription factors, and other chromatin-associated factors, making it central to chromatin biology, epigenetics, and other fields. Typically performed on many cells at once, the method begins with the cross-linking of chromatin to capture DNA-protein interactions. The cross-linked chromatin is then isolated, fragmented, and immunoprecipitated using antibodies specific to the target protein. The associated DNA is recovered and sequenced with next-generation sequencing (NGS) technology. The sequenced reads (see General Note #1) are aligned to a reference genome (see General Note #2), and the resulting alignments are processed to generate a genome-wide “signal” that reflects the frequency of DNA regions interacting with the target protein.

ChIP-seq signal, typically shown as a histogram of read alignment coverage along the genome (x-axis), lends itself to comparisons of protein distributions within and across samples. However, signal variability makes it difficult to link enrichment levels (y-axis) to the biological activity of a protein, particularly across different experimental conditions. Factors such as cell state, cell number, cross-linking, fragmentation, DNA amplification, library preparation, and sequencing conditions complicate establishing a consistent scale for comparing protein enrichment, while poor antibody specificity can further undermine accuracy (Dickson et al., 2020; Jain et al., 2015; Marx, 2019; Park, 2009; Uhlen et al., 2016).

To address variability in ChIP-seq signal, researchers have developed various normalization methods (see General Note #3), including spike-in controls (Bonhoure et al., 2014; Chen et al., 2015; Egan et al., 2016; Grzybowski et al., 2015; Orlando et al., 2014). Spike-in normalization involves adding a known quantity of exogenous chromatin to experimental samples as a reference for signal scaling. However, evidence indicates that spike-ins often fail to reliably support comparisons within and between samples (Dickson et al., 2020, 2023). The recently developed sans spike-in quantitative ChIP (siQ-ChIP) method overcomes these limitations by measuring absolute protein-DNA interactions across the genome (Dickson et al., 2020, 2023). This protocol introduces the computation of siQ-ChIP-adjusted signal for absolute, quantitative comparisons of ChIP-seq data within and between samples, as well as normalized (proportional) coverage (Dickson et al., 2023) for relative comparisons. Although steps for computing spike-in-adjusted coverage are included for context, siQ-ChIP-scaled and normalized coverage are strongly recommended as mathematically rigorous and more effective tools for ChIP-seq analyses.

The protocol also provides guidance on computational setup, data organization, read processing, and signal visualization, with a focus on Saccharomyces cerevisiae and its ribosomal DNA (rDNA) locus, a site of high biological interest (D’Alfonso et al., 2024) that requires appropriate handling in ChIP-seq data processing. To support this, the Procedure section covers program installations and computational environment setup, while the Data Analysis section provides guidance on data acquisition and processing.

Software and datasets

Programs, genomes, and experimental resources.

os	version
macOS	15.1.1 24B91
Ubuntu (Linux)	18.04.6 LTS (Bionic Beaver)

Table 1. Operating systems used to test and run the protocol. Operating systems (OS) and respective versions used to implement, test, and run the protocol.

program	version	os
Atria	4.0.3	macOS, Linux
Awk (BSD)	20200816	macOS
Awk (GNU)	4.1.4	Linux
Bash	3.2.57(1)	macOS
Bash	4.4.20(1)	Linux
Bowtie2	2.5.4	macOS, Linux
Conda	24.7.1	macOS, Linux
GNU Parallel	20170422	macOS, Linux
IGV ("with Java Included")	2.18.4	macOS
Julia	1.8.5	macOS, Linux
Mamba	1.5.9	macOS, Linux
Miniforge3	24.7.1-2	macOS, Linux
Python	3.12.7	macOS, Linux
Samtools	1.21	macOS, Linux
SLURM	24.05.4	Linux
Zsh	5.9	macOS

Table 2. Programs used to test and run the protocol. Programs used to implement, test, and run the protocol are listed here, excluding dependencies and, except for pyBigWig, Python libraries. While most program versions do not need strict adherence, the following version guidelines are recommended for compatibility: Atria (Chuan et al., 2021) 4.0.0 or later, installed with Julia (Bezanson et al., 2012, 2017) 1.8.5; Bash 3.2.0 or later; deepTools (Ramírez et al., 2016) 3.5.0 or later; GNU Parallel (Tange, 2018) 20150222 or later; Python (Van Rossum and Drake, 2009) 3.6.0 or later; and Zsh version 5.0 or later.

species	genome	draft_version	database
Saccharomyces cerevisiae	S288C	R64-5-1 20240529	Saccharomyces Genome Database
Schizosaccharomyces pombe	972h-	2024-10-13	Pombase

Table 3. Species genomes used in the protocol. Species genomes used for alignment, data processing (with the Schizosaccharomyces pombe genome required only for spike-in normalization), and signal visualization, including versions and source databases. Draft versions do not need to be strictly followed. For more information, refer to Data Analysis A, H, and I.

genotype	state	factor	strain	strain_full	vol_in	vol_all	mass_in	mass_ip
WT	G1	Hho1	6336	yTT6336	20	300	72.5	2.7
WT	G1	Hho1	6337	yTT6337	20	300	81.1	5
WT	G2/M	Hho1	6336	yTT6336	20	300	104.9	6.6
WT	G2/M	Hho1	6337	yTT6337	20	300	85.2	6.1
WT	Q	Hho1	6336	yTT6336	20	300	72.7	116.9
WT	Q	Hho1	6337	yTT6337	20	300	69.6	70.6
WT	G1	Hmo1	7750	yTT7750	20	300	79.9	8.4
WT	G1	Hmo1	7751	yTT7751	20	300	63.6	3.2
WT	G2/M	Hmo1	7750	yTT7750	20	300	32.4	5.4
WT	G2/M	Hmo1	7751	yTT7751	20	300	93.4	3.4
WT	Q	Hmo1	7750	yTT7750	20	300	67.9	27.4
WT	Q	Hmo1	7751	yTT7751	20	300	106.6	14.8

Table 4. Experimental samples and parameters used to compute siQ-ChIP α proportionality constants. Parameters include input volumes (“vol_in”) and total volumes before the removal of input (“vol_all”) in microliters (µL), as well as input chromatin masses (“mass_in”) and IP chromatin masses (“mass_ip”) in nanograms (ng), measured during ChIP-seq benchwork (Dickson et al., 2020, 2023). Additionally, average fragment lengths in base pairs (bp; “length_in,” “length_ip”) are required but do not need to be included in the table, as they can be directly derived from sample BAM files during α computation (see Data Analysis G). Additional terms: “genotype,” genetic constitution of sample; “WT,” wild type; “state,” cell-cycle phase; “G₁,” cells in the first gap phase of the cell cycle, preparing for DNA synthesis; “G₂/M,” a mix of cells in the second gap phase (G₂), undergoing DNA synthesis, and mitosis (M), representing the cell cycle substages leading to and including cell division; “Q,” cells in quiescence, a reversible state of cell cycle withdrawal; “factor,” immunoprecipitated protein; “Hho1,” S. cerevisiae histone H1; “Hmo1,” S. cerevisiae chromatin-associated high mobility group family member protein; “strain,” S. cerevisiae strain identifier, which can be substituted with “replicate” (or “rep”) and corresponding identifier; “strain_full,” the complete S. cerevisiae identifier that includes a three-letter prefix indicating the lab of origin (“yTT” represents “yeast Toshio Tsukiyama”). For more details, refer to Data Analysis G. Instructions for downloading experimental samples are available in the protocol repository, protocol_chipseq_signal_norm (see Software and Datasets B); see also Data Analysis A.

Companion GitHub repository for protocol implementation.

This protocol is accompanied by the GitHub repository protocol_chipseq_signal_norm (github.com/kalavattam/protocol_chipseq_signal_norm), which contains driver and utility scripts, functions, and a Markdown notebook, workflow.md, that provides code examples with explanatory text for implementing most steps outlined in the Procedure and Data Analysis sections, all organized and documented according to principles in (Noble, 2009; Ziemann et al., 2023).

The examples cover the following:

Initializing variables for directory paths, files, and computational environments, among other things.
Defining script parameters.
Organizing input and output files within structured directory systems.
Validating paths, files, dependencies, etc.
Automating experiments with driver scripts that coordinate processes like read trimming, alignment, post-processing, and signal track generation. Most driver scripts accept serialized lists (e.g., comma-delimited strings) of FASTQ or BAM input files, which can be efficiently generated using the utility script find_files.sh (see General Note #4).
Parallelizing tasks on high-performance computing clusters configured to use SLURM (Yoo et al., 2003) or on local or remote systems using GNU Parallel (Tange, 2018).
Capturing detailed logs for all executed commands to support troubleshooting and reproducibility.

Additionally, the protocol_chipseq_signal_norm repository also includes tab-separated value (TSV) files for downloading experimental datasets (see Data Analysis C) and a TSV file containing essential siQ-ChIP metadata and parameters (see Data Analysis G).

Procedure

Note: This protocol is designed for use on Linux and macOS systems with the programs and genome species described in Tables 1–3. It has not been tested on Windows systems. All code examples have been tested with Bash and Zsh shells.

Install and configure Miniforge.

Miniforge is an open source tool for managing bioinformatics software in isolated environments. This protocol uses two environments: env_align for read alignment and processing, and env_analyze for signal computation and visualization. It is recommended to uninstall other software managers (e.g., Anaconda) before installing Miniforge.

Determine the appropriate Miniforge installer to use.

Check the operating system (OS) and system architecture:

# OS
uname -a # e.g., "Linux" or "Darwin" (for macOS)

# Architecture
uname -m # e.g., "x86_64" for Intel/AMD, "arm64" for ARM
Download and install Miniforge.

Replace OS and architecture with the system details:

https="https://github.com/conda-forge/miniforge/releases/latest/download"
scr_ins="Miniforge3-OS-architecture.sh"
curl -L -O "${https}/${scr_ins}"
bash "${scr_ins}"

Allow Miniforge to initialize the conda base environment automatically (see Troubleshooting #1).
Configure the Miniforge channels.

Edit the .condarc file to prioritize conda-forge and bioconda channels (see Troubleshooting #2):

channels:
- conda-forge
- bioconda
channel_priority: flexible

Clone the protocol repository and install project environments.

After configuring Miniforge, clone the protocol repository, protocol_chipseq_signal_norm, and use its script install_envs.sh to set up project environments.

Clone the repository.

mkdir -p "${HOME}/repos"
cd "${HOME}/repos"
git clone https://github.com/kalavattam/protocol_chipseq_signal_norm.git
cd protocol_chipseq_signal_norm
Install environments with install_envs.sh.

bash scripts/install_envs.sh --env_nam "env_align" --yes
bash scripts/install_envs.sh --env_nam "env_analyze" --yes

Install and configure Atria for FASTQ adapter and quality trimming.

To promote accurate alignment of sequenced reads, it is important to remove adapter sequences and low-quality base calls from FASTQ files. For this, we use Atria (Chuan et al., 2021), a tool written in Julia (Bezanson et al., 2012, 2017) that excels in adapter and quality trimming. Follow these steps to install and configure Atria:

Install Julia.

Download Julia 1.8.5 for the appropriate OS and system architecture from the official page, and unpack the file in the HOME directory. Add Julia to the PATH by appending the following line to the shell configuration file (e.g., .bashrc, .bash_profile, or .zshrc), and then source the file:

echo 'export PATH="$PATH:${HOME}/julia-1.8.5/bin"' >> ~/.bashrc
source ~/.bashrc

Replace .bashrc with .bash_profile, .zshrc, etc. as needed.
Clone and build Atria.

Clone the Atria repository, activate env_analyze (the environment containing its dependencies), and build Atria using Julia:

cd "${HOME}/repos"
git clone https://github.com/cihga39871/Atria.git
cd Atria
mamba activate env_analyze
julia build_atria.jl
Add Atria to PATH.

Locate the Atria binary and add its path to the shell configuration file. For example,

echo 'export PATH="$PATH:${HOME}/path/to/Atria/bin"' >> ~/.bashrc
source ~/.bashrc

Replace path/to with the path to the Atria binary (bin) directory, and replace .bashrc with .bash_profile, .zshrc, etc. if a different shell configuration file is used. Ensure the env_analyze environment is active when running Atria.

Install and configure Integrative Genomics Viewer (IGV).

Integrative Genomics Viewer (IGV) is a graphical tool for the interactive exploration of ChIP-seq and other genomic data (Robinson et al., 2011, 2017, 2023; Thorvaldsdóttir et al., 2013). To install IGV, visit the IGV download page, select the appropriate bundle for the OS (e.g., “With Java Included”), unzip the file, and move the application to a preferred directory.

Data analysis

Prepare and concatenate FASTA and GFF3 files for model and spike-in organisms.

This section describes the generation of concatenated (merged) FASTA and GFF3 files for the model organism S. cerevisiae and the spike-in control organism Schizosaccharomyces pombe. The concatenated FASTA file is used to generate Bowtie 2 index files (Langmead et al., 2009, 2019; Langmead and Salzberg, 2012) (see Data Analysis B), enabling simultaneous alignment of sequenced reads from both organisms (see General Notes #1 and #2, and Data Analysis E). The resulting alignments support the generation of spike-in-normalized signal tracks (see Data Analysis H). The concatenated GFF3 file enables visualization of signal tracks with gene and feature annotations for both organisms (see Data Analysis I).

Refer to workflow.md for detailed steps on using prepare_files_sc_sp.sh, which automates the following tasks:

Download FASTA and GFF3 files from the Saccharomyces Genome Database (S. cerevisiae) and Pombase (S. pombe).
Process the files by standardizing chromosome names and removing incompatible formatting.
1. In the S. pombe FASTA file, chromosome names are prefixed with “SP_” to enable the downstream separation of S. pombe alignments from S. cerevisiae alignments.
2. In the S. cerevisiae GFF3 file, gene and autonomously replicating sequence (ARS) Name fields are reassigned from their systematic names (e.g., “YEL021W”) to their standard, more interpretable names (e.g., “URA3”). This step is unnecessary for the S. pombe file, as its Name fields already use standard names.
Concatenate the processed files for alignment and visualization.

Note: If spike-in normalization is not needed, aligning reads to only the S. cerevisiae (model organism) FASTA file is sufficient, eliminating the need for a concatenated genome. In this case, prepare_files_sc_sp.sh can still be used, as it provides processed, non-concatenated S. cerevisiae FASTA and GFF3 files.

Generate Bowtie 2 indices from the concatenated FASTA file.

In this section, Bowtie 2 (Langmead et al., 2009, 2019; Langmead and Salzberg, 2012) index files are generated using the processed, concatenated S. cerevisiae and S. pombe FASTA file (see Data Analysis A). The index files are essential for aligning reads from both organisms in a single step (see Data Analysis E), which supports downstream spike-in normalization (see Data Analysis H).

See workflow.md for instructions on decompressing the concatenated FASTA file (if necessary) and running bowtie2-build to generate index files. Note: If spike-in normalization is not required, index files can be generated using the only processed S. cerevisiae FASTA file.

Obtain and organize ChIP-seq FASTQ files.

In this section, FASTQ files are retrieved, organized, and prepared for downstream analyses. The process includes generating TSV files with sample FASTQ file information and accompanying File Transfer Protocol (FTP) links, assigning custom names to the FASTQ files, and using a script to automate file downloads and organization.

Generate a TSV file with FTP links.

Use the European Nucleotide Archive (ENA) (O’Cathail et al., 2024) Browser to create a TSV file listing FASTQ files with FTP links:
1. Visit ebi.ac.uk/ena/browser, enter a BioProject (Barrett et al., 2012) or Gene Expression Omnibus Series (GSE) (Barrett et al., 2013; Clough and Barrett, 2016; Edgar et al., 2002) accession number in the “Enter accession” field, and navigate to the corresponding page.
2. Open the “Show Column Selection” menu and select only the checkboxes for “fastq_ftp” (the FTP links) and “sample_title” (the experiment names).
3. Click “TSV” under “Download report” to save the file.
Add custom names to the TSV file.
1. Add a fourth column to the downloaded tabular lists with the header “custom_name,” populated with user-defined names in the format “assay_genotype_state_treatment_factor_strain/replicate.” This format places stable attributes (e.g., assay type) on the left and variable attributes (e.g., strains or replicates) on the right (see General Note #5). Note: Instead of using “ChIP-seq” for the assay type, we use “IP” to represent the immunoprecipitate and “in” to represent the input control. Ensure that entries in the new column are tab-separated.
2. Rename the completed TSV file.
Pre-prepared TSV files for various datasets (Dickson et al., 2020, 2023; Swygert et al., 2019, 2021) are available in repository protocol_chipseq_signal_norm.
Use the TSV file to download FASTQ files.

FASTQ files listed in the TSV file can be downloaded and organized by running execute_download_fastqs.sh, which automates the download process, creates symbolic links based on custom names (see General Note #6), and supports both paired- and single-end reads (see General Note #7) from FTP addresses and other sources. For an example of how to run execute_download_fastqs.sh, refer to workflow.md.

Use Atria to perform adapter and quality trimming of sequenced reads.

Here, ChIP-seq reads are trimmed for adapter sequences and low-quality bases with the program Atria (Chuan et al., 2021). This process is automated with the script execute_trim_fastqs.sh. For a practical example of its usage, see workflow.md (see also General Note #8).

Align sequenced reads with Bowtie 2 and process the read alignments.

This section focuses on aligning ChIP-seq reads to a concatenated S. cerevisiae/S. pombe genome using Bowtie 2 (Langmead et al., 2009, 2019; Langmead and Salzberg, 2012) and processing the resulting alignments with Samtools (Danecek et al., 2021; H. Li et al., 2009). In the processing, multi-mapping alignments are preserved (see General Note #9), which is necessary for analyzing signals at repetitive loci, such as the S. cerevisiae ribosomal DNA (rDNA) locus.

Check workflow.md for an implementation of the following steps:

Run execute_align_fastqs.sh to align sequenced reads to the concatenated genome (see Data Analysis A and B). This script manages parallelization and log generation, and writes alignment output (BAM files) to designated directories. For paired-end reads, use the --req_flg flag to retain only “properly paired” alignments (see General Note #10). To retain multi-mapping alignments with up to five mismatches, set the --mapq argument to 1 (i.e., alignments must have a mapping quality, or MAPQ, of 1; see General Note #11); this preserves signal in repetitive regions, unlike stricter thresholds (e.g., MAPQ 20 or 30) that exclude these alignments.
Run execute_filter_bams.sh to filter BAM files for S. cerevisiae (main) and S. pombe (spike-in) alignments, saving the filtered files to separate directories for each organism. Note: This step can be skipped if spike-in normalization is unnecessary and reads were aligned to only the S. cerevisiae genome.

Note: Depending on the analysis requirements, either concatenated genomes (FASTA and index files) or individual processed genomes can be used. For workflows that do not involve spike-in normalization, the processed S. cerevisiae genome files alone are sufficient.

Compute normalized (or raw) coverage.

This section provides a guide to computing “normalized coverage” as defined in (Dickson et al., 2023). In normalized coverage, y-axis values represent the proportion of sequencing fragments overlapping each genomic position, summing to 1 across the genome (i.e., summing to unity). This ensures signal tracks are properly scaled probability distributions, highlighting the relative distribution of fragments and enabling meaningful comparisons across datasets or conditions. This portion of the workflow also supports raw (unadjusted) coverage and facilitates output in bigWig or bedGraph formats for downstream signal computation (see Data Analysis G) and visualization (see Data Analysis I).

Run execute_compute_coverage.sh to calculate per-sample coverage. By default, the script outputs normalized coverage; use the --raw flag to generate raw coverage instead. Refer to workflow.md for an example implementation.

Compute coverage with the sans spike-in quantitative ChIP-seq (siQ-ChIP) method.

This section focuses on coverage computation using the sans spike-in quantitative ChIP-seq (siQ-ChIP) method (Dickson et al., 2020, 2023). At the core of siQ-ChIP is the concept that the immunoprecipitation of chromatin fragments represents an equilibrium binding reaction, in which reactants and products balance dynamically, governed by classical mass conservation laws. Mass conservation principles enable the experimentally measured IP mass to be interpreted as the result of a competitive binding reaction, influenced by antibody affinities and the concentrations of chromatin and antibodies. Sequencing reveals the genomic distribution of the IP product, recorded as normalized coverage (see Data Analysis F). The quantitative scale of siQ-ChIP, expressed in absolute physical units, is defined by the product of this sequencing-derived distribution and the IP mass as measured through methods such as fluorometric quantification (e.g., Qubit) or spectrometry (e.g., NanoDrop). By establishing a quantitative scale, siQ-ChIP enables a precise measure of IP reaction efficiency, expressed as the ratio of bound (IP) signal to total (input) signal, for any genomic interval. To generate siQ-ChIP-scaled coverage, we compute a proportionality constant, α, that connects the sequencing-derived data to the underlying IP reaction dynamics. Specifically, α is a scaling factor—essentially the IP efficiency per equation 6 of (Dickson et al., 2023), derived from variables such as the total DNA mass in the IP product, etc.—ensuring the signal tracks (coverage) reflect absolute, rather than relative, quantities. By multiplying α by the ratio of IP normalized coverage to input normalized coverage, siQ-ChIP generates a quantitative measure of protein-DNA binding efficiency, consistent with the physical scale of the IP reaction. In the following workflow, we determine α and use it to compute siQ-ChIP-scaled coverage tracks from normalized coverage (see Data Analysis F).

Consult workflow.md for a detailed example implementing the following steps:

Assign variables for serialized strings representing the IP BAM files and corresponding IP normalized coverage tracks (in bigWig or bedGraph format). Also, define a variable for a tab-separated metadata table specifying experimental parameters needed to compute α for each sample (see Software and Datasets A and Table 4). Columns in the metadata table are programmatically assessed to match each IP sample with its corresponding experimental parameters (see Troubleshooting #3).
Run execute_calculate_scaling_factor_alpha.sh to compute sample-specific α values. This script processes the IP BAM files and metadata table to calculate scaling factors, outputting them to a TSV file. The script assumes that IP and input files are located in the same directory and share identical file name formatting except for the prefix: For a given IP file, the corresponding input file is identified by replacing “IP” in the filename with “in” (see General Note #5).
To generate siQ-ChIP-scaled coverage tracks in bigWig or bedGraph format, use the computed α values with the corresponding IP and input normalized coverage tracks by running execute_deeptools_compare.sh, which uses the deepTools software suite (Ramírez et al., 2016). Input files and scaling factors can be specified directly or supplied through the TSV file generated in Step #2. For each sample, the script scales the ratio of IP to input normalized coverage by multiplying it with α.

Compute coverage with the spike-in method.

Here, we cover spike-in normalization, a method that attempts to address variability in ChIP-seq experiments by using exogenous chromatin as a reference to scale endogenous signal (Bonhoure et al., 2014; Chen et al., 2015; Egan et al., 2016; Grzybowski et al., 2015; Orlando et al., 2014). While this method can reveal global shifts in protein-DNA interaction levels, it relies on the assumption that spike-in chromatin behaves consistently across samples, which is often violated due to factors like variability in spike-in chromatin recovery and differences in sample composition (Dickson et al., 2020, 2023). Initially, spike-in normalization was considered a practical alternative to standard sequencing depth normalization (Bonhoure et al., 2014; Orlando et al., 2014), which assumes relatively constant global signal levels across samples and is therefore inadequate for detecting biological changes that affect overall protein abundance, such as epitope masking by chromatin remodeling or chemical or genetic epitope depletion. However, the siQ-ChIP method provides a mathematically rigorous alternative, using the quantitative properties of the immunoprecipitation reaction to compute absolute protein-DNA interaction levels from experimental data. This eliminates the need for exogenous controls, addressing the limitations of spike-ins while enabling reliable comparisons across regions, samples, and conditions. Though steps for spike-in normalization are included here for contextual reference, we very strongly recommend using normalized and siQ-ChIP-scaled coverage instead, as these methods provide more reliable and effective tools for, respectively, relative and absolute ChIP-seq analyses.

workflow.md provides an example of how to implement and execute the following steps:

Assign to a variable a serialized string of S. cerevisiae (model organism) IP BAM files. Because spike-in normalization is a semiquantitative method, comparisons should be limited to groups of related samples. For example, in workflow.md, samples immunoprecipitated for Hho1 are compared across G₁, G₂/M, and Q cell cycle states. This approach differs from siQ-ChIP scaling, an absolute normalization method that enables direct comparisons across different immunoprecipitated factors (see Data Analysis G). The script used to compute scaling factors, execute_calculate_scaling_factor_spike.sh, requires S. cerevisiae and S. pombe (spike-in organism) sample BAM files to be organized into subdirectories “sc” and “sp” within the same parent directory, with each subdirectory containing corresponding files (see Troubleshooting #4).
Run execute_calculate_scaling_factor_spike.sh to compute sample-specific scaling factors derived from spike-in controls. The script processes S. cerevisiae and corresponding S. pombe IP BAM files, outputting scaling factors to a TSV file.
Use the script relativize_scaling_factors.py with the TSV file of scaling factors (Step #2) to adjust each sample’s scaling factor within a group to the maximum value in that group. This process puts the scaling factors on a relative scale from 0 to 1, preventing potential signal inflation (e.g., coverage values being multiplied by factors greater than 1).
To generate spike-in-scaled coverage tracks in bigWig or bedGraph format, use the computed scaling factors with the corresponding S. cerevisiae IP and input BAM files by running execute_deeptools_compare.sh, which uses the deepTools software suite (Ramírez et al., 2016). Input files and scaling factors can be specified directly or provided through the TSV file generated in Step #2 and adjusted in Step #3. For each sample, the script computes the ratio of IP to input coverage and scales it by multiplying this ratio with the spike-in-derived scaling factor.

Visualize signal tracks with IGV.

To visually explore signal tracks with respect to organism chromosomes and feature annotations, follow these steps:

Run IGV.

Launch IGV by double-clicking its application icon.
Load a FASTA genome.

In IGV, go to Genomes > Load Genome from File… and select a FASTA file, such as the S. cerevisiae/S. pombe combined genome (see Data Analysis A). If it is compressed, decompress the file.
Load a corresponding GFF3 file.

Navigate to File > Load from File… and select a GFF3 file, such as the S. cerevisiae/S. pombe combined GFF3 file (see Data Analysis A). Alternatively, drag and drop the file into IGV’s interface. The file can be compressed or not.
Load bigWig or bedGraph signal tracks (see Data Analysis F, G, and H).

Load signal tracks by repeating the same process as in Step #3.

This configuration provides an interactive platform for examining signal tracks in the context of genomic features and annotations.

General notes and troubleshooting

General notes

What are reads?

In NGS, reads refer to the DNA sequences generated by sequencing platforms. This protocol focuses on ChIP-seq datasets with reads generated on Illumina platforms.

Illumina uses a process called “sequencing by synthesis” (Slatko et al., 2018), where DNA fragments from ChIP-seq libraries—collections of DNA fragments enriched for specific protein-DNA interactions—are attached to a flow cell and amplified to form clusters, each representing many copies of a single fragment. Sequencing occurs by synthesizing the complementary strand and incorporating fluorescently labeled nucleotides, one at a time. As each nucleotide is added, a camera captures the fluorescent signal, allowing the sequence to be determined. These sequences, often called “short reads,” typically range from 50 to 300 base pairs, depending on platform specifications.

Reads are typically stored in the FASTQ file format, which includes both sequence data and quality scores indicating the confidence level of each nucleotide base call. Analysts can use these scores to identify and filter out low-quality data. Depending on the platform, reads can be single-end, where sequencing occurs from one end of the fragment, or paired-end, where sequencing occurs from both ends (see General Note #7).
What are reference genomes?

A reference genome is a digital database of nucleic acid sequences that represents the typical set of feature annotations (e.g., genes and other genomic features) and overall genome structure for a species. It serves as a standard for aligning reads (see General Note #1) from ChIP-seq and other genomic assays. Constructed by sequencing DNA from one or more individuals of a species, a reference genome is assembled into a contiguous sequence by piecing together read data from various sequencing technologies. For many species, it is a composite sequence that aims to capture the genetic diversity of the species rather than representing any single individual. Many reference genomes are continuously updated and refined as new sequencing technologies and data become available.
What is data normalization?

While the term originally referred to transforming data to fit a normal distribution, in modern bioinformatics, normalization typically means making datasets comparable by adjusting for systematic biases or effects that are not of primary interest. For example, in ChIP-seq experiments, variations in cell state, sequencing depth, sample preparation, or library composition can introduce biases that affect the apparent enrichment of protein-DNA interactions. Normalization methods adjust for these discrepancies, aiming to enable more accurate comparisons within and across samples and experimental conditions.
On find_files.sh.

The utility script find_files.sh is designed to simplify the process of locating files in a specified directory using the find command (see man7.org/linux/man-pages/man1/find.1.html and ss64.com/mac/find.html). The script minimizes the need for manual file listing, supports complex filtering options, and promotes reproducibility in bioinformatics workflows. It supports searches for various file types, including FASTQ, BAM, and TXT files, returning results as a single comma-separated string (or, when called with flag --fastqs for FASTQ files, a single semicolon- and comma-separated string) that can be passed to driver scripts. For more information, see the find_files.sh documentation.
On the “assay_genotype_state_treatment_factor_strain/replicate” naming scheme.

For legibility and reproducibility, we recommend user-defined filenames following the format “assay_genotype_state_treatment_factor_strain/replicate.” This structure places stable attributes (e.g., assay type) on the left and more variable attributes (e.g., replicates) on the right. Below is a breakdown of each attribute:
1. “Assay” refers to the next-generation sequencing method used for the samples (e.g., “RNA-seq,” “ATAC-seq,” “Hi-C”). However, rather than using the term “ChIP-seq,” we use “IP” and “in” (for input) to distinguish between these two types of ChIP-seq data.
2. “Genotype” refers to samples’ genetic background, e.g., “WT” (wild type) or “SMC4-off” (conditional depletion of SMC4 through a Tet-Off system).
3. The term “state” signifies samples’ positions in the cell cycle. For example, samples could be in “log” (logarithmic) growth, a mixture of active cell cycle stages, or in specific stages of the cell cycle such as “G1” (G₁), “G2M” (G₂/M, i.e., a mixture of the G₂ and mitotic stages), or “Q” (quiescence).
4. “Treatment” signifies an experimental intervention applied to samples, such as a drug or control chemical. For example, H3K27me3 IP samples might be treated with a “vehicle” (control) or an EZH2 inhibitor.
5. “Factor” represents the protein targeted for immunoprecipitation.
6. If one or more attributes (such as “state,” “genotype,” or “treatment”) are not relevant for a particular set of samples, we omit them from the custom names.
Below is an example of how a custom name might be constructed, with the “state” and “treatment” attributes omitted as they are not relevant in this context. Also, FTP addresses are replaced with ellipses for brevity.

Given the following row in the downloaded table:

run_accession    sample_title    fastq_ftp
SRR7175368    Brn1 in Log Replicate 1 Input    ...

Create the following custom name:

run_accession    sample_title    fastq_ftp    custom_name
SRR7175368    Brn1 in Log Replicate 1 Input    ...    in_WT_log_Brn1_rep1
Best practices for downloading and naming files.

To avoid confusion or errors in downstream analyses, we recommend against renaming downloaded files directly. Instead, retain the original filenames and create symbolic links (“symlinks”) with custom names. Symlinks, created using the ln -s command, act as pointers to the original files, allowing one to reference them with more convenient names without modifying the original files. This preserves file integrity while providing flexibility.

We also advocate for managing downloads and symbolic link creation through TSV (or similar) files. This approach simplifies the process and serves as documentation, which prevents—or can help troubleshoot—uncertainty about filenames or sources.
Contrasting paired- and single-end short-read sequencing.

Paired-end short-read sequencing involves sequencing both ends of DNA fragments, effectively demarcating entire fragments generated during processes such as ChIP-seq library preparation. This approach obviates the need for the fragment modeling performed with single-end sequenced data—as detailed in publications such as (Landt et al., 2012; Nakato and Shirahige, 2018)—enabling more accurate identification of protein-DNA binding sites (e.g., through peak calling) and improving the resolution of closely spaced binding events. Additionally, paired-end sequencing enhances alignment accuracy to repetitive or highly similar genomic regions, reducing ambiguity for reads that might otherwise align to multiple locations (Chung et al., 2011). Despite these advantages, single-end sequencing, which sequences only one end of DNA fragments, was historically simpler and less expensive. However, advancements in sequencing technology and economies of scale from widespread adoption have made paired-end sequencing as affordable as—or even cheaper than—single-end sequencing.
Default adapter handling by Atria.

If the arguments --adapter1 and --adapter2 (for paired-end sequenced reads) are not specified, Atria defaults to using Illumina TruSeq single and combinatorial dual index adapter sequences, which are suitable for most ChIP-seq applications.
1. Adapter sequence 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
2. Adapter sequence 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
For more information, refer to the Atria and Illumina Adapter Sequences documentation.
On the determination of multi-mapping alignments by Bowtie 2.

The determination of multi-mapping alignments by Bowtie 2 involves complex criteria that extend beyond the scope of this protocol. For readers seeking an in-depth understanding, we recommend the following resources:
1. The Bowtie 2 publications (Langmead et al., 2009, 2019; Langmead and Salzberg, 2012).
2. The Bowtie 2 manual.
3. A detailed explanation by John Urban (archived version), which offers a nuanced discussion on how Bowtie 2 defines and handles multi-mapping alignments (referred to as “true multireads” in the post).
What does it mean for alignments to be properly paired?

The definition of properly paired alignments varies by aligner. With Bowtie 2, the term “properly paired” refers to paired-end alignments where both reads align in a manner consistent with the expected orientation, distance, and fragment length determined during library preparation. Typically, this means that the forward read and its corresponding reverse read align to the same chromosome (or reference sequence) in an inward-facing orientation and within a specified distance from each other. This distance is defined by the Bowtie 2 parameters --minins (minimum fragment length) and --maxins (maximum fragment length). In this protocol, we retain the default values for these parameters when running Bowtie 2. For more details, refer to the Bowtie 2 manual.
What are MAPQ scores?

In the process of aligning reads, Bowtie 2 assigns a MAPQ score to each alignment. This score reflects the confidence in the correctness of the alignment. The calculation of MAPQ scores varies with the alignment program, and Bowtie 2 employs an approach that differs from the standard definition (H. Li et al., 2008) (for more information on the standard definition, refer to the Sequence Alignment/Map Format Specification). While a detailed explanation of Bowtie 2’s method is beyond the scope of this protocol, we encourage readers interested in a deeper understanding to explore the following resources:
1. The Bowtie 2 publications (Langmead et al., 2009, 2019; Langmead and Salzberg, 2012).
2. The Bowtie 2 manual.
3. Two in-depth blog posts by John Urban (here and here; archived versions here and here), which present detailed discussion on how Bowtie 2 calculates MAPQ scores, including foundational concepts and code examples.
4. A post by Devon Ryan on seqanswers.com (J.-W. Li et al., 2012) that elucidates the MAPQ scoring logic through the description of a relevant C function.

Troubleshooting

What to do if “no” is chosen during conda initialization in Miniforge installation.

If “no” was selected during the Miniforge installation prompt for conda initialization, manual initialization can be performed with the following commands:

# For Bash
eval "$(~/miniforge3/bin/conda shell.bash hook)"
conda init
source ~/.bashrc # Or source ~/.bash_profile

# For Zsh
eval "$(~/miniforge3/bin/conda shell.zsh hook)"
conda init
source ~/.zshrc

To ensure conda and mamba are automatically initialized when the terminal is opened, add the following block to the shell configuration file (e.g., .bashrc, .bash_profile, .zshrc, etc.; preferably near the bottom):

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$(
    '/home/username/miniforge3/bin/conda' \
        'shell.shell' \
        'hook' \
            2> /dev/null
)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/username/miniforge3/etc/profile.d/conda.sh" ]; then
        . "/home/username/miniforge3/etc/profile.d/conda.sh"
    else
        export PATH="/home/username/miniforge3/bin:$PATH"
    fi
fi
unset __conda_setup

if [ -f "/home/username/miniforge3/etc/profile.d/mamba.sh" ]; then
    . "/home/username/miniforge3/etc/profile.d/mamba.sh"
fi
# <<< conda initialize <<<

Replace shell.shell depending on the shell in use: e.g., shell.bash for Bash or shell.zsh for Zsh. Also, replace /home/username with the home directory and username.
What to do if a .condarc file was not generated by Miniforge.

If Miniforge did not automatically generate a .condarc file in the home directory, create and populate one manually with the following commands:

touch ~/.condarc
cat << EOF > ~/.condarc
channels:
- conda-forge
- bioconda
channel_priority: flexible
EOF
Ensuring metadata and filename consistency.

To ensure compatibility between the metadata parsing script and the siQ-ChIP metadata table processed by execute_calculate_scaling_factor_alpha.sh, input filenames must adhere to the expected naming convention: “assay_genotype_state_treatment_factor_strain/replicate” (see Data Analysis C and General Note #5). The following components are required:
1. “assay:” Specifies either “IP” (immunoprecipitate) or “in” (input), and must be followed by an underscore.
2. “factor:” Indicates the target protein or factor (e.g., “Hho1”), flanked by underscores.
3. “strain/replicate:” Denotes the biological strain or replicate ID (e.g., “6336,” “rep1”), which appears at the end of the filename.
Optional metadata fields, such as “genotype” or “treatment,” can be included if separated by underscores. For example, “_G1_untreated_” or “_log_” are acceptable additions. Nonconforming filenames will cause errors when running the script execute_calculate_scaling_factor_alpha.sh (see Step #3 in Data Analysis G).
Directory structure for spike-in scaling factor computation.

To function correctly, the script execute_calculate_scaling_factor_spike.sh requires a specific directory structure and file naming convention for input files to operate correctly. The primary input files must be coordinate-sorted S. cerevisiae (“sc”) IP BAM files. Based on these file paths, the script automatically derives paths to additional required files S. cerevisiae input BAM files and S. pombe (“sp”) IP and input BAM files.

These files must be organized under a common parent directory with the following structure:

align_bowtie2_global/flag-2_mapq-1/sc # IP and input BAM files for S. cerevisiae alignments
align_bowtie2_global/flag-2_mapq-1/sp # IP and input BAM files for S. pombe alignments

The script derives file paths using systematic substitutions to the file paths and names; for more information, refer to the documentation for execute_calculate_scaling_factor_spike.sh.

If any of these files are missing, the script will print an error message and terminate. Additionally, nonconformity to the described directory structure results in errors. In addition to execute_calculate_scaling_factor_spike.sh, refer to workflow.md and Data Analysis E.

Acknowledgments

This work was supported by R35GM139429 to T.T., Uehara Memorial Foundation Research Fellowship to R.H., and Osamu Hayaishi Memorial Scholarship for Study Abroad to R.H. Additional support was provided by the Genomics & Bioinformatics Shared Resource (RRID:SCR_022606), part of the Fred Hutch/University of Washington/Seattle Children’s Cancer Consortium (P30 CA015704).

Competing interests

The authors declare that they have no competing interests, financial or otherwise.

References

Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K. D., Resenchuk, S., Tatusova, T., et al. (2012). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res., 40(Database issue), D57–D63.

Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., Marshall, K. A., Phillippy, K. H., Sherman, P. M., Holko, M., et al. (2013). NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res., 41(Database issue), D991–D995.

Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D. E., Wang, Z., Wei, G., Chepelev, I. and Zhao, K. (2007). High-resolution profiling of histone methylations in the human genome. Cell, 129(4), 823–837.

Bezanson, J., Edelman, A., Karpinski, S. and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Rev. Soc. Ind. Appl. Math., 59(1), 65–98.

Bezanson, J., Karpinski, S., Shah, V. B. and Edelman, A. (2012). Julia: A fast dynamic language for technical computing. In arXiv [cs.PL]. arXiv. Retrieved from http://arxiv.org/abs/1209.5145

Bonhoure, N., Bounova, G., Bernasconi, D., Praz, V., Lammers, F., Canella, D., Willis, I. M., Herr, W., Hernandez, N., Delorenzi, M., et al. (2014). Quantifying ChIP-seq data: a spiking method providing an internal reference for sample-to-sample normalization. Genome Res., 24(7), 1157–1168.

Chen, K., Hu, Z., Xia, Z., Zhao, D., Li, W. and Tyler, J. K. (2015). The overlooked fact: Fundamental need for spike-in control for virtually all genome-wide analyses. Mol. Cell. Biol., 36(5), 662–667.

Chuan, J., Zhou, A., Hale, L. R., He, M. and Li, X. (2021). Atria: an ultra-fast and accurate trimmer for adapter and quality trimming. GigaByte, 2021, gigabyte31.

Chung, D., Kuan, P. F., Li, B., Sanalkumar, R., Liang, K., Bresnick, E. H., Dewey, C. and Keleş, S. (2011). Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data. PLoS Comput. Biol., 7(7), e1002111.

Clough, E. and Barrett, T. (2016). The Gene Expression Omnibus database. Methods Mol. Biol., 1418, 93–110.

D’Alfonso, A., Micheli, G. and Camilloni, G. (2024). rDNA transcription, replication and stability in Saccharomyces cerevisiae. Semin. Cell Dev. Biol., 159-160, 1–9.

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., et al. (2021). Twelve years of SAMtools and BCFtools. Gigascience, 10(2). https://doi.org/10.1093/gigascience/giab008

Dickson, B. M., Kupai, A., Vaughan, R. M. and Rothbart, S. B. (2023). Streamlined quantitative analysis of histone modification abundance at nucleosome-scale resolution with siQ-ChIP version 2.0. Sci. Rep., 13(1), 7508.

Dickson, B. M., Tiedemann, R. L., Chomiak, A. A., Cornett, E. M., Vaughan, R. M. and Rothbart, S. B. (2020). A physical basis for quantitative ChIP-sequencing. J. Biol. Chem., 295(47), 15826–15837.

Edgar, R., Domrachev, M. and Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res., 30(1), 207–210.

Egan, B., Yuan, C.-C., Craske, M. L., Labhart, P., Guler, G. D., Arnott, D., Maile, T. M., Busby, J., Henry, C., Kelly, T. K., et al. (2016). An alternative approach to ChIP-seq normalization enables detection of genome-wide changes in histone H3 lysine 27 trimethylation upon EZH2 inhibition. PLoS One, 11(11), e0166438.

Grzybowski, A. T., Chen, Z. and Ruthenburg, A. J. (2015). Calibrating ChIP-seq with nucleosomal internal standards to measure histone modification density genome wide. Mol. Cell, 58(5), 886–899.

Jain, D., Baldi, S., Zabel, A., Straub, T. and Becker, P. B. (2015). Active promoters give rise to false positive “Phantom Peaks” in ChIP-seq experiments. Nucleic Acids Res., 43(14), 6959–6968.

Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science, 316(5830), 1497–1502.

Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P., et al. (2012). ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res., 22(9), 1813–1831.

Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4), 357–359.

Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10(3), R25.

Langmead, B., Wilks, C., Antonescu, V. and Charles, R. (2019). Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 35(3), 421–432.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079.

Li, H., Ruan, J. and Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18(11), 1851–1858.

Li, J.-W., Schmieder, R., Ward, R. M., Delenick, J., Olivares, E. C. and Mittelman, D. (2012). SEQanswers: an open access community for collaboratively decoding genomes. Bioinformatics, 28(9), 1272–1273.

Marx, V. (2019). What to do about those immunoprecipitation blues. Nat. Methods, 16(4), 289–292.

Mikkelsen, T. S., Ku, M., Jaffe, D. B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.-K., Koche, R. P., et al. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature, 448(7153), 553–560.

Nakato, R. and Shirahige, K. (2018). Sensitive and robust assessment of ChIP-seq read distribution using a strand-shift profile. Bioinformatics, 34(14), 2356–2363.

Noble, W. S. (2009). A quick guide to organizing computational biology projects. PLoS Comput. Biol., 5(7), e1000424.

O’Cathail, C., Ahamed, A., Burgin, J., Cummins, C., Devaraj, R., Gueye, K., Gupta, D., Gupta, V., Haseeb, M., Ihsan, M., et al. (2024). The European Nucleotide Archive in 2024. Nucleic Acids Res. https://doi.org/10.1093/nar/gkae975

Orlando, D. A., Chen, M. W., Brown, V. E., Solanki, S., Choi, Y. J., Olson, E. R., Fritz, C. C., Bradner, J. E. and Guenther, M. G. (2014). Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep., 9(3), 1163–1170.

Park, P. J. (2009). ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet., 10(10), 669–680.

Ramírez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, A. S., Heyne, S., Dündar, F. and Manke, T. (2016). deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res., 44(W1), W160–W165.

Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen, G., Bernier, B., Varhol, R., Delaney, A., et al. (2007). Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods, 4(8), 651–657.

Robinson, J. T., Thorvaldsdottir, H., Turner, D. and Mesirov, J. P. (2023). igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). Bioinformatics, 39(1). https://doi.org/10.1093/bioinformatics/btac830

Robinson, J. T., Thorvaldsdóttir, H., Wenger, A. M., Zehir, A. and Mesirov, J. P. (2017). Variant review with the Integrative Genomics Viewer. Cancer Res., 77(21), e31–e34.

Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G. and Mesirov, J. P. (2011). Integrative genomics viewer. Nat. Biotechnol., 29(1), 24–26.

Slatko, B. E., Gardner, A. F. and Ausubel, F. M. (2018). Overview of next-generation sequencing technologies: Overview of next-generation sequencing. Curr. Protoc. Mol. Biol., 122(1), e59.

Swygert, S. G., Kim, S., Wu, X., Fu, T., Hsieh, T.-H., Rando, O. J., Eisenman, R. N., Shendure, J., McKnight, J. N. and Tsukiyama, T. (2019). Condensin-dependent chromatin compaction represses transcription globally during quiescence. Mol. Cell, 73(3), 533–546.e4.

Swygert, S. G., Lin, D., Portillo-Ledesma, S., Lin, P.-Y., Hunt, D. R., Kao, C.-F., Schlick, T., Noble, W. S. and Tsukiyama, T. (2021). Local chromatin fiber folding represses transcription and loop extrusion in quiescent cells. Elife, 10. https://doi.org/10.7554/eLife.72062

Tange, O. (2018). Gnu Parallel 2018. Zenodo.

Thorvaldsdóttir, H., Robinson, J. T. and Mesirov, J. P. (2013). Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform., 14(2), 178–192.

Uhlen, M., Bandrowski, A., Carr, S., Edwards, A., Ellenberg, J., Lundberg, E., Rimm, D. L., Rodriguez, H., Hiltke, T., Snyder, M., et al. (2016). A proposal for validation of antibodies. Nat. Methods, 13(10), 823–827.

Van Rossum, G. and Drake, F. L., Jr. (2009). Python 3 Reference Manual: (Python Documentation Manual Part 2). Createspace.

Yoo, A. B., Jette, M. A. and Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing (pp. 44–60). Berlin, Heidelberg: Springer Berlin Heidelberg.

Ziemann, M., Poulain, P. and Bora, A. (2023). The five pillars of computational reproducibility: bioinformatics and beyond. Brief. Bioinform., 24(6). https://doi.org/10.1093/bib/bbad375

How to cite：

Alavattam, K G, Dickson, B M, Hirano, R, Dell, R and Tsukiyama, T(2024). ChIP-seq data processing and relative and quantitative signal normalization for Saccharomyces cerevisiae. Bio-protocol Preprint. bio-protocol.org/prep2770.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

This protocol preprint was submitted via the Bio-protocol Exchange "Submit a Preprint" track.

Share your protocol with your peers.

Submit a Preprint Protocol