featureCounts: an efficient general purpose program for assigning sequence reads to genomic features

Yang Liao, Gordon K. Smyth, Wei Shi

Bioinformatics·2013·Biochemistry, Genetics and Molecular Biology·28,218 citations

10 min read

Read the full paper at DOI or on arxiv

TL;DR

featureCounts is a general-purpose read summarization/counting tool for both RNA-seq and DNA-seq, supporting single-end and paired-end reads/fragments and feature-level or meta-feature (e.g., gene) summarization.

Briefing Cornell Notes

Briefing

This paper addresses a practical but surprisingly under-discussed bottleneck in next-generation sequencing (NGS) analysis pipelines: after reads are aligned to a reference genome, downstream biological interpretation often requires summarizing (counting) how many reads (or read pairs/fragments) overlap predefined genomic features such as exons, genes, promoters, or ChIP-seq peak intervals. The authors’ research question is essentially: can we build a general-purpose read summarization/counting program that is both accurate across common experimental scenarios (RNA-seq and DNA-seq; single-end and paired-end; feature-level and gene/meta-feature-level summarization) and substantially more efficient in time and memory than existing widely used tools?

This matters because read counting is ubiquitous. Count-based statistical methods for differential expression and differential binding rely on these summaries, and the computational cost of counting can rival or exceed alignment cost—especially when the number of features is large (e.g., many exons/transcripts) or when multiple counting modes and filtering options are needed. Yet, the literature and tooling have historically focused more on alignment and on statistical modeling than on the engineering of efficient, correct counting.

Methodologically, featureCounts is presented as a software solution with both algorithmic design and empirical benchmarking. The study design is comparative: the authors implement featureCounts and evaluate it against established tools—GenomicRanges/IRanges (summarizeOverlaps and countOverlaps), HTSeq’s htseq-count, and BEDTools (coverageBED)—using representative RNA-seq and ChIP-seq datasets plus a simulation designed to stress the “many reference sequences/contigs” case. They also provide a theoretical complexity analysis to explain observed performance differences.

For RNA-seq benchmarking, they use the SEQC/MAQC project dataset: 6.8 million pairs of 101 bp reads from a Universal Human Reference RNA sample, aligned to GRCh37 using Subjunc (which identifies exon–exon junctions and provides base-level mapping). Gene-level counting is performed where exons are treated as features and genes as meta-features (i.e., a read is assigned to a gene if it overlaps any exon of that gene). The annotation uses NCBI RefSeq build 37.2 with 25,702 genes and 225,071 exons. For single-end evaluation, they count only the first read of each pair. For paired-end evaluation, they count fragments (defined by the two ends).

Key accuracy comparisons are reported via concordance of counts. For single-end reads, featureCounts and summarizeOverlaps produce identical gene counts for every gene. htseq-count counts slightly fewer reads; the discrepancy is traced to an annotation boundary interpretation: htseq-count treats the right-most base of each feature as open (excluded), while featureCounts and summarizeOverlaps treat it as closed (included), consistent with the GFF specification. After modifying the htseq-count annotation by shifting feature ends by one base, htseq-count yields identical counts.

For paired-end fragments, summarizeOverlaps counts far fewer fragments than featureCounts and htseq-count. The authors attribute this to summarizeOverlaps requiring both ends of a fragment to be successfully mapped before assignment. featureCounts and htseq-count can assign fragments when only one end maps. In their results, featureCounts counts 4,796,948 fragments versus summarizeOverlaps’ 3,942,439 fragments, while htseq-count counts 4,769,913 fragments. They further validate the additional fragments by showing that 92% of the fragments counted by featureCounts but not by summarizeOverlaps are assigned to genes that also have at least 100 fragments supported by both ends mapped, and only 0.1% of the extra fragments are assigned to genes with no support from any fully mapped fragments. This supports the practical correctness of including one-end-mapped fragments.

They also examine why featureCounts counts slightly more fragments than htseq-count. The paper explains that featureCounts can use overlap size information to resolve cases where a fragment overlaps multiple genes with different numbers of mapped reads from that fragment; htseq-count treats such cases as ambiguous and may assign none. In their dataset, over 86% of fragments counted by featureCounts but not by htseq-count are assigned to genes that already have at least 100 unambiguous fragments, and only 0.2% of extra fragments are not supported by commonly assigned fragments.

The efficiency results for RNA-seq are striking. On the SEQC RNA-seq dataset (single-end reads), featureCounts runs in 1.0 minute and uses 16 MB peak memory, whereas summarizeOverlaps takes 12.1 minutes and uses 3,400 MB when run “whole genome at once,” and 41.7 minutes with 661 MB when run chromosome-by-chromosome. htseq-count takes 22.7 minutes and uses 101 MB. For paired-end fragments, featureCounts is similarly far faster and more memory efficient (the paper emphasizes an order-of-magnitude speed advantage and much lower memory usage). The authors also note that summarizeOverlaps was run chromosome-by-chromosome to reduce memory, but it still used about 20× more memory than featureCounts.

For ChIP-seq benchmarking, they use a mouse mammary stem cell dataset targeting the broad H3K27me3-associated region. The dataset contains 15 million pairs of 35 bp reads. Reads are mapped to mm9 using Subread, and fragments are filtered to those with both paired reads mapped and fragment lengths between 50 bp and 500 bp. Features are broad gene bodies plus a 3 kb upstream promoter region; genes are the features counted at the gene level (multi-overlap counting is expected, since a fragment may overlap multiple genes). featureCounts is compared to IRanges countOverlaps, htseq-count, and BEDTools coverageBED. Here, featureCounts is configured to count multi-overlap fragments.

In this ChIP-seq scenario, featureCounts and countOverlaps yield identical fragment counts for every gene, while featureCounts is much faster and uses far less memory. The reported fragment counts are 5,392,155 for featureCounts and countOverlaps. Runtime and memory: featureCounts takes 0.9 minutes and uses 4 MB, while countOverlaps takes 24.4 minutes and uses 7,000 MB when run on the whole genome at once, and 36.6 minutes with 783 MB when run by chromosome. htseq-count counts fewer fragments (4,978,050 in union mode; 4,993,644 in intersection-nonempty mode) and takes about 36 minutes with 31 MB memory. coverageBED counts slightly fewer fragments (5,366,902) and takes 4.4 minutes with 41 MB memory. The paper states featureCounts is about five times faster and uses about 10 times less memory than the next most efficient tool.

To test performance when the number of reference sequences is large, they simulate RNA-seq reads from a fragmented/incompletely assembled budgerigar genome assembly (Assemblathon 2). The annotation includes 16,204 genes and 153,724 exons across 2,850 scaffolds. They generate 8 million 100 bp single-end reads sampled from annotated exonic regions, create SAM records with filled mapping information, and summarize at gene level. featureCounts counts 7,924,065 reads in 0.6 minutes using 15 MB. summarizeOverlaps matches the counts but is slower: 12.6 minutes and 2,400 MB when run whole genome at once, or 53.3 minutes and 262 MB when run by scaffold. htseq-count counts slightly fewer reads (7,912,439) and takes 12.1 minutes with 78 MB. This shows featureCounts’ efficiency advantage grows in the “many contigs” regime.

The authors also provide a theoretical analysis of time and space complexity. They derive featureCounts time complexity as approximately $O (f lo g f + r k_{1})$ where $f$ is the number of features, $r$ is the number of reads, and $k_{1}$ is the number of features included in a genomic bin. Space complexity is given as $O (f + b_{1})$ where $b_{1}$ is the number of bins. They compare these to other tools’ complexities (e.g., summarizeOverlaps and htseq-count include additional $r lo g r$ or $r lo g f$ terms due to sorting and tree-based searches). The theoretical discussion aligns with empirical results: featureCounts avoids expensive sorting of reads and uses a hierarchical search structure.

The core algorithmic contributions are described as: (1) chromosome hashing to quickly match reference sequence names between reads and annotations; (2) genome binning into non-overlapping 128 kb bins; (3) within each bin, grouping features into blocks such that the number of blocks is roughly the square root of the number of features in the bin; and (4) a hierarchical search that narrows candidate features by checking bins, then blocks, then features. featureCounts assigns a “hit” if any overlap of at least 1 bp exists between the read/fragment and a feature, and it supports both feature-level and meta-feature-level summarization (e.g., exons to genes). It also provides options for handling multi-overlap reads/fragments (exclude them or count them for all overlapped features), with recommendations tailored to RNA-seq versus ChIP-seq.

Limitations are not framed as a formal “limitations” section, but some constraints are implicit in the methodology and evaluation. The benchmarks focus on gene-level summarization for RNA-seq and broad gene-body promoter regions for ChIP-seq, not transcript-level isoform quantification (which is a different problem). The evaluation also relies on specific aligners (Subjunc for RNA-seq and Subread for ChIP-seq) and specific annotation formats (GFF/SAF) and counting modes (e.g., multi-overlap handling). Additionally, the paper’s performance comparisons are run on a particular high-performance computing environment and often with single-thread execution; while featureCounts supports multithreading, the reported timings in the comparisons are conservative. Finally, the correctness validation is based on concordance with other tools and biological plausibility checks (e.g., support from fully mapped fragments), but not on downstream differential expression outcomes.

Practically, the implications are clear: featureCounts is positioned as a drop-in, general-purpose counting engine for NGS pipelines that need accurate and fast read summarization with flexible options. Who should care? Practitioners running RNA-seq differential expression workflows (e.g., using edgeR/limma), ChIP-seq differential binding/epigenetic analyses, and large-scale studies where counting time and memory become limiting. Bioinformatics pipeline developers benefit from the availability of both a command-line tool (Subread) and an R wrapper (Rsubread) that integrates with Bioconductor statistical packages.

Overall, the paper’s core contribution is demonstrating that a carefully engineered counting algorithm—chromosome hashing plus bin/block hierarchical feature search—can deliver substantial speed and memory improvements (often an order of magnitude) while maintaining high concordance in gene-level counts across RNA-seq and ChIP-seq use cases, including challenging scenarios with many reference sequences.

Cornell Notes

featureCounts provides an accurate, general-purpose method to assign aligned NGS reads/fragments to genomic features (exons/genes/intervals) with extensive options for RNA-seq and DNA-seq. Across RNA-seq, ChIP-seq, and simulated multi-contig scenarios, it matches other tools’ counts while running much faster and using far less memory, largely due to chromosome hashing and a hierarchical bin/block feature search implemented in C.

What problem does the paper target, and why is it important?

After alignment, many analyses require counting how many reads/fragments overlap predefined genomic features (exons/genes/promoters/intervals). This read summarization step is ubiquitous in count-based differential expression/binding pipelines and can be a major computational bottleneck.

What study design is used to evaluate featureCounts?

The authors benchmark featureCounts against established tools (GenomicRanges/IRanges, HTSeq htseq-count, and BEDTools) on representative RNA-seq and ChIP-seq datasets plus a simulation with many contigs/scaffolds, comparing both accuracy (count concordance) and efficiency (runtime and peak memory).

What data and annotation were used for the RNA-seq benchmark?

SEQC/MAQC RNA-seq: 6.8 million pairs of 101 bp reads aligned to GRCh37 using Subjunc; gene-level counting based on RefSeq build 37.2 with 25,702 genes and 225,071 exons.

How does featureCounts define a “hit” when assigning reads to features?

It calls a hit if any overlap of at least 1 bp exists between the read/fragment and a feature, accounting for alignment details such as insertions/deletions via the CIGAR string.

What accuracy result was reported for single-end RNA-seq counts?

featureCounts and summarizeOverlaps produced identical gene counts for every gene; htseq-count initially differed slightly due to an inclusive/exclusive interpretation of feature endpoints, which was corrected by shifting annotation ends by one base.

How did featureCounts differ from summarizeOverlaps for paired-end fragments?

summarizeOverlaps required both ends of a fragment to map before assigning it, whereas featureCounts could assign fragments with only one end mapped. featureCounts therefore counted more fragments (4,796,948 vs 3,942,439).

What validation did the authors provide for the extra one-end-mapped fragments?

They report that 92% of fragments counted by featureCounts but not by summarizeOverlaps were assigned to genes that also had at least 100 fragments supported by both ends mapped; only 0.1% of extra fragments were assigned to genes with no support from fully mapped fragments.

What were the efficiency results for RNA-seq gene-level counting?

On the SEQC RNA-seq data (single-end), featureCounts took 1.0 minute and used 16 MB, versus summarizeOverlaps at 12.1 minutes/3,400 MB (whole genome at once) and htseq-count at 22.7 minutes/101 MB.

What were the efficiency and accuracy results for the ChIP-seq benchmark?

For H3K27me3 broad gene regions, featureCounts and countOverlaps matched fragment counts exactly (5,392,155). featureCounts took 0.9 minutes and used 4 MB, while countOverlaps took 24.4 minutes and used 7,000 MB (whole genome at once).

What algorithmic ideas explain featureCounts’ speed?

Chromosome hashing for fast reference-name matching and a hierarchical feature search using 128 kb genomic bins and within-bin feature blocks, implemented in C with optional multithreading.

Review Questions

What are the two hierarchical levels (bins and blocks) used by featureCounts, and how do they reduce the search space for overlaps?
Why did htseq-count initially disagree with featureCounts/summarizeOverlaps on RNA-seq single-end counts, and how was the discrepancy resolved?
In paired-end RNA-seq, what is the practical consequence of requiring both ends mapped (summarizeOverlaps) versus allowing one-end-mapped fragments (featureCounts)?
How do the authors support that additional fragments counted by featureCounts are likely correctly assigned?
What does the theoretical complexity analysis suggest about why featureCounts scales better with the number of reads than some competing approaches?

Key Points

1
featureCounts is a general-purpose read summarization/counting tool for both RNA-seq and DNA-seq, supporting single-end and paired-end reads/fragments and feature-level or meta-feature (e.g., gene) summarization.
2
It assigns reads precisely by checking overlap of at least 1 bp between the read/fragment and genomic features, accounting for alignment structure (e.g., indels/junctions via CIGAR).
3
Across RNA-seq gene-level counting, featureCounts matches summarizeOverlaps exactly for single-end reads; differences with htseq-count were due to feature endpoint inclusivity and were fixed by adjusting annotation boundaries.
4
For paired-end RNA-seq, featureCounts counts more fragments than summarizeOverlaps because it can assign fragments when only one end maps; the authors validate that most extra fragments map to genuinely expressed genes (92% supported by genes with 200 fully mapped fragments; only 0.1% unsupported by any fully mapped fragments).
5
On SEQC RNA-seq, featureCounts ran in 1.0 minute using 16 MB, versus summarizeOverlaps at 12.1 minutes/3,400 MB (whole genome) and htseq-count at 22.7 minutes/101 MB.
6
On ChIP-seq (H3K27me3 broad regions), featureCounts matched countOverlaps exactly while running in 0.9 minutes and using 4 MB, compared with countOverlaps at 24.4 minutes/7,000 MB.
7
The speed and memory gains come from chromosome hashing plus a hierarchical overlap search using 128 kb genomic bins and within-bin feature blocks, implemented in C (with an R wrapper calling the same compiled code).

Highlights

“featureCounts and summarizeOverlaps yielded identical counts for every gene” (SEQC RNA-seq, single-end).

featureCounts counted 4,796,948 fragments vs summarizeOverlaps’ 3,942,439 (paired-end), and “92% of fragments…were assigned to genes that also had at least 100 assigned fragments with both ends mapped.”

On SEQC RNA-seq (single-end), featureCounts: 1.0 min, 16 MB; summarizeOverlaps: 12.1 min, 3,400 MB (whole genome at once).

On H3K27me3 ChIP-seq broad regions, featureCounts matched countOverlaps exactly (5,392,155 fragments) while using 4 MB and taking 0.9 min; countOverlaps used 7,000 MB and took 24.4 min (whole genome at once).

featureCounts’ hierarchical search uses “chromosome hashing” and “genome bins and feature blocks” (128 kb bins; blocks sized so the number of blocks per bin is about the square root of the number of features).

Topics

Bioinformatics software
Next-generation sequencing (NGS) analysis
Read alignment post-processing
RNA-seq quantification
ChIP-seq read counting
Genomic interval overlap algorithms
Differential expression and differential binding pipelines
Computational efficiency and algorithmic complexity

Mentioned

featureCounts
Subread
Rsubread
Bioconductor
GenomicRanges
IRanges
HTSeq
htseq-count
BEDTools
coverageBED
Subjunc
Subread aligner
edgeR
limma
SAM
BAM
GFF
SAF
NCBI RefSeq
Ensembl (cited)
Yang Liao
Gordon K. Smyth
Wei Shi
Leming Shi
Charles Wang
Aaron Lun
Robinson et al. (edgeR)
Law et al. (limma)
Anders (HTSeq)
Quinlan and Hall (BEDTools)
Aboyoun et al. (GenomicRanges)
Pages et al. (IRanges)
Bradnam et al. (Assemblathon 2)
Howard et al. (Assemblathon 2)
Pal et al. (ChIP-seq dataset)
Shi et al. (MAQC/SEQC)
NGS - Next-generation sequencing
RNA-seq - RNA sequencing
ChIP-seq - Chromatin immunoprecipitation sequencing
DNA-seq - DNA sequencing
SEQC - Sequencing Quality Control
MAQC - Microarray Quality Control
GRCh37 - Genome Reference Consortium human build 37
RefSeq - NCBI Reference Sequences
GFF - General Feature Format
SAF - Simplified Annotation Format
CIGAR - Compact Idiosyncratic Gapped Alignment Report
MB - Megabytes
CPU - Central Processing Unit
H3K27me3 - Trimethylation of histone H3 lysine 27
TSS - Transcription start site
RCT - Randomized controlled trial (not used in this paper)