Calculating sample size estimates for RNA sequencing data

Steven N Hart; Terry M Therneau; Yuji Zhang; Gregory A Poland; Jean-Pierre Kocher

doi:10.1089/cmb.2012.0283

Calculating sample size estimates for RNA sequencing data

J Comput Biol. 2013 Dec;20(12):970-8. doi: 10.1089/cmb.2012.0283. Epub 2013 Aug 20.

Authors

Steven N Hart¹, Terry M Therneau, Yuji Zhang, Gregory A Poland, Jean-Pierre Kocher

Affiliation

¹ 1 Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic , Rochester, Minnesota.

Abstract

Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression?

Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments.

Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Animals
Gene Expression Profiling / methods*
High-Throughput Nucleotide Sequencing / methods*
Humans
Models, Biological*
RNA, Messenger / genetics*
Sample Size

Substances

RNA, Messenger

Grants and funding

U01 AI089859/AI/NIAID NIH HHS/United States