Second-generation PLINK: rising to the challenge of larger and richer datasets

Christopher Chang, Carson C. Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun Purcell, James J. Lee

GigaScience·2015·Biochemistry, Genetics and Molecular Biology·13,567 citations

8 min read

Read the full paper at DOI or on arxiv

TL;DR

The paper’s core goal is to make PLINK scalable for “larger and richer” genetic datasets by improving both performance and compatibility with modern data characteristics.

Briefing Cornell Notes

Briefing

This paper addresses a practical bottleneck in genome-wide association studies (GWAS) and population genetics: the original PLINK codebase (PLINK 1.x) is widely used, but it was designed around a data representation and computational patterns that do not scale well to today’s “larger and richer” genetic datasets. The authors’ research question is essentially engineering-focused: how can PLINK be redesigned to (i) run substantially faster on modern multicore hardware, (ii) handle datasets larger than available RAM, and (iii) prepare for future compatibility needs arising from imputation outputs (probabilistic genotype calls), phasing information, and multiallelic variants.

This matters because the field has shifted from modest, hard-call genotype matrices toward whole-genome sequencing and imputed datasets that are both larger and more information-rich. Many downstream analyses—identity-by-descent (IBD), genomic relationship matrices (GRMs), linkage disequilibrium (LD) pruning, haplotype block estimation, and permutation-based association tests—are computationally heavy and are often executed repeatedly inside pipelines. When software cannot exploit modern CPU parallelism or cannot operate directly on compressed/packed representations, researchers either lose time, lose statistical power (e.g., fewer permutations), or are forced to use high-end computing resources.

Methodologically, the paper is a technical note describing the design and implementation of “second-generation” PLINK. The first major release from this effort is PLINK 1.9, which is positioned as a drop-in replacement for PLINK 1.07 in most workflows. The authors evaluate performance using timing benchmarks across multiple machines and datasets, comparing PLINK 1.9 against PLINK 1.07 (and sometimes against other tools such as GCTA, Haploview, and PERMORY). The paper does not present a randomized clinical/biological study; instead, it uses empirical computational experiments.

The core methodological changes in PLINK 1.9 include: (1) extensive use of bit-level parallelism by rewriting inner loops to operate on packed genotype data; (2) improved “bit population count” (popcount) implementations using SSE2-based algorithms for portability; (3) multithreading and optional distributed computation via a cluster-splitting flag; (4) memory efficiency by avoiding storing the full genotype matrix in RAM for many operations; and (5) algorithmic improvements in several statistical tests and analyses.

A key example of the bit-parallel approach is the identity-by-state (IBS) computation. The authors replace per-marker, per-pair loops with operations over blocks of markers using XOR and bit population counts, masking missing calls. They report that PLINK 1.9 can process a block of 960 markers in less than twice the time that PLINK 1.07 takes to handle a single marker—an illustration of how bit-parallelism changes the effective scaling of inner-loop work.

They also introduce an early-termination strategy to reduce the computational complexity of Hardy–Weinberg equilibrium (HWE) exact tests and Fisher’s exact tests. For SNP-HWE, they argue that while a naive exact method is size- $O (n)$ in the total contingency table entries, only $O (n)$ likelihood terms meaningfully affect the final p-value because likelihoods decay super-geometrically. They terminate when additional terms become too small to change the p-value under IEEE-754 double precision.

For haplotype block estimation, PLINK 1.9 accelerates the existing confidence-interval-based method (from Gabriel et al., via Haploview) by streamlining diplotype frequency computations, reducing the number of likelihood evaluations needed to classify variant pairs as “strong LD,” and avoiding large intermediate tables by updating only sliding-window counts. They also incorporate pruning logic that can skip classification of many variant pairs once it becomes impossible for a block to satisfy the required “strong LD vs recombination” ratio.

The paper’s performance comparisons are organized into tables. Across datasets and machines, the authors report speedups frequently exceeding two orders of magnitude and sometimes reaching three orders of magnitude for common operations. For example, in Table 1 (initialization and basic I/O, $- - f r e q$ ), the ratio PLINK 1.07/PLINK 1.90 ranges from modest to very large depending on dataset and machine. On the largest chromosome dataset (chr1), PLINK 1.9 runs $35.0$ times faster on Mac-2 and $5.3$ times faster on Mac-12, while on Linux64-512 the ratio is $5.3$ and on Win64-2 it is $8.1$ . For the full 1000 Genomes phase 1 chromosome 1 SNP-only subset (chr1snp), ratios include $10.2$ on Linux32-8 and $2.2$ on Win64-2.

More striking improvements appear in compute-heavy tasks. Table 2 (IBS matrices and clustering) shows that for synth2p on Linux64-512, IBS matrix calculation drops from $\sim 98 k$ seconds in PLINK 1.07 to $25.3$ seconds in PLINK 1.90 (ratio $\sim 3.9 k$ ). For synth1p on Linux64-512, the ratio is $400$ (PLINK 1.07 $1492$ seconds vs PLINK 1.90 $3.7$ seconds). For chr1snp on Linux64-512, the ratio is $320$ ( $\sim 14 k$ seconds vs $43.1$ seconds). These results are consistent with the authors’ claim of 1–4 orders of magnitude speedups for several operations.

Table 3 (GRM calculation, $- - mak e - g r m - bin$ ) compares PLINK 1.9 to GCTA. On synth1p for Linux64-512, GCTA takes $4.4$ seconds while PLINK 1.90 takes $8.3$ seconds (ratio $0.53$ , meaning GCTA is faster there). However, on other platforms PLINK 1.9 is much faster than GCTA or avoids platform limitations (GCTA cannot run on OS X or Windows in the authors’ comparison). For example, on synth1p for Win64-2, GCTA takes $367.2$ seconds vs PLINK 1.90 taking $6.6$ seconds (ratio $56$ ).

Table 4 (LD-based pruning, $- - in d e p - p ai r w i se$ ) shows that PLINK 1.9 can be dramatically faster, but also that some steps remain single-threaded in PLINK 1.9 (notably $- - in d e p - p ai r w i se$ as of the paper). For synth1 on Linux64-512, PLINK 1.07 takes $462$ seconds and PLINK 1.90 takes $0.60$ seconds (ratio $770$ ). For synth2 on Linux64-512, PLINK 1.07 is $\sim 120 k$ seconds vs PLINK 1.90 $26.4$ seconds (ratio $4.5 k$ ). For chr1 on Linux64-512, PLINK 1.07 is $\sim 950 k$ seconds vs PLINK 1.90 $1553.3$ seconds (ratio $610$ ).

Table 5 (haplotype block estimation, $- - b l oc k s$ ) indicates that PLINK 1.9’s rewrite can turn previously infeasible runs into feasible ones. Many PLINK 1.07 runs are marked “nomem” (ran out of memory). Where PLINK 1.07 completes, PLINK 1.9 is faster; for synth1 with $- - l d - w in d o w - k b = 500$ on Mac-12, PLINK 1.07 takes $\sim 45 k$ seconds vs PLINK 1.90 $3873.0$ seconds (ratio $1.3$ ). For chr1 with $- - l d - w in d o w - k b = 500$ on Win64-2, PLINK 1.07 is “nomem” while PLINK 1.90 takes $1037.2$ seconds (ratio not applicable due to failure).

Table 6 (association analysis permutation tests) demonstrates that PLINK 1.9 extends fast permutation strategies (PERMORY) and makes Fisher’s exact tests practical in permutation contexts. For synth1 with $- - t r e n d$ and $- - m p er m = 10000$ , on Linux64-512 PLINK 1.07 is $\sim 17 k$ seconds vs PLINK 1.90 $285.0$ seconds (ratio $2.8$ ). On Win32-2, PLINK 1.07 $\sim 35 k$ vs PLINK 1.90 $1444.2$ (ratio $61.5$ ). For Fisher’s exact permutation tests on synth1, on Linux64-512 PLINK 1.07 $\sim 120 k$ vs PLINK 1.90 $35 k$ (ratio $35 k /120 k$ reported as $3.4$ ). For QT association ( $- - a ssoc (QT)$ ) on Linux64-512, PLINK 1.07 $\sim 29 k$ vs PLINK 1.90 $29.2$ seconds (ratio $990$ ).

The authors acknowledge limitations that follow from their engineering choices. The most important is that PLINK 1.9 still uses the PLINK 1 binary file format, which cannot represent probabilistic genotype calls, phase, or multiallelic variants. They therefore propose PLINK 2.0 with a new core file format that can represent probabilities, phase, and multiallelic variants efficiently, while also maintaining an equivalent representation to the PLINK 1 format to avoid performance regressions. They also note that PLINK is not designed for analyses where a single coordinate system is inadequate (e.g., structural variation in whole-exome/whole-genome sequencing) and recommend PLINK/SEQ for such tasks.

Practical implications are clear: users can upgrade to PLINK 1.9 to obtain major speedups and better scalability, including the ability to handle datasets too large for RAM. This benefits researchers running GWAS pipelines, population genetics analyses, and permutation-heavy association tests, especially those without access to high-end computing clusters. The paper also signals that future compatibility with modern imputation and sequencing outputs will require PLINK 2.0’s richer data model.

Overall, the paper’s contribution is not a new statistical method per se, but a comprehensive systems-and-algorithms update to a foundational genomics tool, with extensive empirical evidence that careful bit-level and memory-aware engineering can yield order-of-magnitude performance improvements while preserving usability as a drop-in replacement for most existing PLINK workflows.

Cornell Notes

Chang et al. present PLINK 1.9, a second-generation update to the widely used PLINK GWAS/population-genetics toolkit. They redesign core computations using bit-level parallelism, improved exact-test algorithms, multithreading/distributed execution, and memory-efficient data handling, reporting frequent 1–3+ order-of-magnitude speedups and RAM scalability; they also outline PLINK 2.0’s plan for probabilistic, phased, and multiallelic data support.

What problem does the paper target in PLINK’s current capabilities?

PLINK 1.x is too slow and insufficiently scalable for rapidly growing GWAS/population-genetics datasets, and its original file format cannot represent probabilistic genotype calls, phase information, or multiallelic variants from modern imputation/sequencing pipelines.

What is the main contribution of PLINK 1.9 in this paper?

A comprehensive performance, scaling, and usability update that accelerates many operations by 1–4 orders of magnitude via bit-level parallelism, improved popcount, multithreading/cluster support, and memory-efficient computation.

What study design is used to evaluate performance?

Empirical computational benchmarks: timing runs on seven machines across multiple synthetic and real datasets, comparing PLINK 1.9 to PLINK 1.07 and sometimes to tools like GCTA/Haploview/PERMORY.

How does PLINK 1.9 achieve major speedups for genotype-matrix computations?

By replacing per-call loops with bitwise operations on packed genotype data (e.g., XOR plus bit population count), processing many markers simultaneously and masking missing calls.

What algorithmic change improves Hardy–Weinberg equilibrium and Fisher’s exact tests?

An early-termination strategy that reduces expected complexity from size- $O (n)$ to roughly $O (n)$ by stopping when remaining likelihood terms are too small to affect the p-value under double-precision arithmetic.

What does the paper report about performance on identity-by-state (IBS) and clustering?

Large speedups: for synth1p on Linux64-512, IBS matrix calculation drops from $1492$ seconds (PLINK 1.07) to $3.7$ seconds (PLINK 1.90), a ratio of $400$ ; for synth2p on Linux64-512, $\sim 98 k$ seconds to $25.3$ seconds (ratio $\sim 3.9 k$ ).

How does PLINK 1.9 handle memory constraints for large datasets?

It avoids keeping the full genomic data matrix in RAM for many functions, loading only single markers or small windows; for sample-by-sample matrix computations it uses distributed splitting (e.g., $- - p a r a l l e l$ ) to compute GRM blocks separately.

What is the key limitation that motivates PLINK 2.0?

PLINK 1.9 still relies on the PLINK 1 binary format, which cannot represent probabilistic calls, phase, or multiallelic variants; PLINK 2.0 introduces a new format and extends functions to support these data types.

What practical users should take away from the paper?

Most existing PLINK pipelines can upgrade to PLINK 1.9 for major speed and scalability gains; for analyses requiring probabilistic/phased/multiallelic data, users should anticipate PLINK 2.0’s richer format and translation layer.

Review Questions

Which specific low-level technique (bitwise parallelism) is used to accelerate IBS/related computations, and why does it change runtime so dramatically?
Explain the early-termination idea for exact tests: what mathematical property allows complexity to drop from $O (n)$ to about $O (n)$ ?
From the benchmark tables, identify one operation where PLINK 1.9 turns an infeasible (out-of-memory) run into a feasible one, and cite the reported timing evidence.
What design choice in PLINK 1.9 enables RAM scalability, and how does $- - p a r a l l e l$ help with sample-by-sample matrices like GRMs?
Why is PLINK 2.0 needed even though PLINK 1.9 improves speed substantially? What data types remain unsupported in PLINK 1.9?

Key Points

1
The paper’s core goal is to make PLINK scalable for “larger and richer” genetic datasets by improving both performance and compatibility with modern data characteristics.
2
PLINK 1.9 accelerates many operations by 1–4 orders of magnitude using bit-level parallelism (XOR + popcount on packed genotype data) and optimized inner-loop logic.
3
Exact-test computations are sped up via early termination: SNP-HWE/Fisher’s exact tests avoid evaluating terms that cannot affect p-values in double precision, reducing expected complexity from $O (n)$ to about $O (n)$ .
4
Memory scalability is achieved by avoiding loading the full genotype matrix for many tasks and by splitting large matrix computations across parallel runs (e.g., GRM block computation with $- - p a r a l l e l$ ).
5
Benchmark evidence shows dramatic runtime reductions for IBS/IBD-related tasks, LD pruning, GRM computation, haplotype block estimation, and permutation-based association tests.
6
PLINK 1.9 remains limited by the PLINK 1 binary format (no probabilities, phase, or multiallelic variants), motivating PLINK 2.0’s new multi-representation file format and translation layer.
7
The authors explicitly position PLINK as not intended for structural-variation analyses requiring different coordinate-system handling, recommending PLINK/SEQ instead.

Highlights

“Replacement of these loops with bit-parallel logic is, by itself, enough to speed up numerous operations by more than one order of magnitude.”

For IBS matrix calculation on synth1p (Linux64-512), PLINK 1.07 takes 1492 seconds vs PLINK 1.90 3.7 seconds (ratio 400).

For IBS matrix calculation on synth2p (Linux64-512), PLINK 1.07 takes ∼98k seconds vs PLINK 1.90 25.3 seconds (ratio ∼3.9k).

The authors claim early termination reduces SNP-HWE/Fisher exact computations from O(n) to O(n​) “with no loss of precision.”

PLINK 2.0 will introduce “a new data format capable of efficiently representing probabilities, phase, and multiallelic variants,” plus function extensions to use that information.

Topics

Genome-wide association studies (GWAS)
Population genetics
Computational genomics
Exact statistical testing (HWE, Fisher’s exact test)
Linkage disequilibrium (LD) and haplotype block estimation
Identity-by-descent (IBD) and identity-by-state (IBS)
Genomic relationship matrices (GRM)
Permutation testing and multiple testing
High-performance computing (HPC) for bioinformatics
Data formats for genotype/variant representations

Mentioned

PLINK
GigaScience
GCTA
BEAGLE
IMPUTE2
GATK
VCFtools
BCFtools
Haploview
PERMORY
BOOST
pigz
SSE2
OpenMP/pthreads-style threading (implementation via C threading primitives)
HAPGEN2
simuRare
TopCoder (GWAS Speedup contest)
PLINK/SEQ
Christopher Chang
Carson C. Chow
Laurent CAM Tellier
Shashaank Vattikuti
Shaun Purcell
James J. Lee
Vincent (referenced in the popcount/IBS illustration; likely an internal example name)
Dalke, Harley, Lauradoux, Mathisen, Walisch (popcount-related prior work)
Wigginton, Cutler, Abecasis (SNP-HWE exact test)
Mehta and Patel (Algorithm 643 / FEXACT)
Requena and Martín Ciudad (network algorithm improvements)
Gabriel et al. (haplotype block model)
Wall and Pritchard (D′ performance assessment)
Friedman, Hastie, Höfling, Tibshirani (coordinate-descent LASSO reference)
Vattikuti et al. (compressed sensing to GWAS)
Loh, Baym, Berger (compressive genomics)
Browning and Browning (fastIBD reference)
Su, Marchini, Donnelly (Hapgen2)
Xu, Wu, Song, Zhang (simuRare)
Defays (complete-link method reference)
Ewens, Li, Spielman (QTDT robustness discussion)
GWAS - Genome-wide association studies
IBD - Identity-by-descent
IBS - Identity-by-state
GRM - Genomic relationship matrix
HWE - Hardy–Weinberg equilibrium
FEXACT/Fisher - Fisher’s exact test
LD - Linkage disequilibrium
MAF - Minor allele frequency
QTDT - Quantitative transmission disequilibrium test
QFAM - Quantitative family-based permutation procedure
GRM - Genomic relationship matrix
SNP - Single-nucleotide polymorphism
VCF - Variant Call Format
BCF - Binary Call Format
GRM - Genomic relationship matrix
SSE2 - Streaming SIMD Extensions 2
IBD report - PLINK’s $--Z-genome$ output
RCT - Randomized controlled trial (not used here; included only as a general acronym reference)