Combining Batches of Microarray Data. ComBat is a widely-used software for reducing batch e ffects when combining microarray data from different labs, experiments, or hybridization batches, or technology platforms. It utilizes an empirical Bayesian linear modeling approach to robustly account for technical variability across multiple high-throughput studies. Software for ComBat is available for download through the sva Bioconductor package and at GitHub.

ComBat publications:

  1. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007 Jan;8(1):118-27. PubMed PMID: 16632515.
  2. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010 Oct;11(10):733-9. PubMed PMID: 20838408; PubMed Central PMCID: PMC3880143.
  3. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012 Mar 15;28(6):882-3. PubMed PMID: 22257669; PubMed Central PMCID: PMC3307112.


PathoScope 2.0 is a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at:

Pathoscope is now on GitHub!

PathoScope publications:

  1. Francis OE, Bendall M, Manimaran S, Hong C, Clement NL, Castro-Nallar E, Snell Q, Schaalje GB, Clement MJ, Crandall KA, Johnson WE. Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome Res. 2013 Oct;23(10):1721-9. PubMed PMID: 23843222; PubMed Central PMCID: PMC3787268.
  2. Hong C, Manimaran S, Shen Y, Perez-Rogers JF, Byrd AL, Castro-Nallar E, Crandall KA, Johnson WE. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome. 2014;2:33. PubMed PMID: 25225611; PubMed Central PMCID: PMC4164323.
  3. Byrd AL, Perez-Rogers JF, Manimaran S, Castro-Nallar E, Toma I, McCaffrey T, Siegel M, Benson G, Crandall KA, Johnson WE. Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data. BMC Bioinformatics. 2014 Aug 4;15:262. PubMed PMID: 25091138; PubMed Central PMCID: PMC4131054.
  4. Hong C, Manimaran S, Johnson WE. PathoQC: Computationally ecient read pre-processing and quality control for high throughput sequencing datasets. Cancer Informatics (to appear).


Adaptive Signature Selection and Identication in GeNome-wide proling data: ASSIGN utilizes Bayesian factor regression model to identify genomic biomarkers for applications in pathway proling, drug responsiveness, environmental exposure, and infectious disease diagnosis. ASSIGN can estimate background signal and adapt biomarker signatures into different biological contexts. ASSIGN enables robust and context-specific pathway analyses by efficiently capturing pathway activity in heterogeneous sets of samples and across profiling technologies. The ASSIGN framework is based on a flexible Bayesian factor analysis approach that allows for simultaneous profiling of multiple correlated pathways and for the adaptation of pathway signatures into specific disease. Software for our approach is available for download through the ASSIGN Bioconductor package and at GitHub.

ASSIGN publication:

  1. Shen Y, Rahman M, Piccolo SR, Gusenleitner D, El-Chaar NN, et al. ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways. Bioinformatics. 2015 Jan 22;PubMed PMID: 25617415. 


Single Channel Array Normalization (SCAN) is a microarray normalization method to facilitate personalized-medicine workflows. Rather than process microarray samples as groups, which can introduce biases and present logistical challenges (for example, if groups of samples had to be renormalized repeatedly in personalized-medicine workflows), SCAN normalizes each sample individually by modeling and removing probe- and array-specific background noise using only data from within each array. The Universal Probability Code (UPC) method is an extension of SCAN that produces “barcode” values that estimate the probability a given gene is active in a specific sample. This method can be applied not only to one-color microarrays but also to two-color microarrays and RNA-sequencing data.

SCAN-UPC publications:

  1. Piccolo SR, Sun Y, Campbell JD, Lenburg ME, Bild AH, Johnson WE. A single-sample microarray normalization method to facilitate personalized-medicine workflows. Genomics. 2012 Dec;100(6):337-44. PubMed PMID: 22959562; PubMed Central PMCID: PMC3508193.
  2. Piccolo SR, Withers MR, Francis OE, Bild AH, Johnson WE. Multiplatform single-sample estimates of transcriptional activation. Proc Natl Acad Sci U S A. 2013 Oct 29;110(44):17778-83. PubMed PMID: 24128763; PubMed Central PMCID: PMC3816418.


GUNUMAP is a software suite for aligning next sequencing data from DNA-seq, BS-seq, and RNA-seq (including small RNAs, RNA editing) experiments. It uses a highly accurate probabilistic alignment approach that incorporates base uncertainty into the alignment algorithm. Click Here for the most recent version of GNUMAP used in our BMC Bioinformatics publication (BS-Seq; Hong el al. 2013).

GNUMAP publications:

  1. Clement NL, Snell Q, Clement MJ, Hollenhorst PC, Purwar J, Graves BJ, Cairns BR, Johnson WE. The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing. Bioinformatics. 2010 Jan 1;26(1):38-45. PubMed PMID: 19861355.
  2. Clement NL, Clement MJ, Snell Q, Johnson WE. Parallel Mapping Approaches for GNUMAP. IPDPS. 2011;PubMed PMID: 23396612; PubMed Central PMCID: PMC3565456.
  3. Hong C, Clement NL, Clement S, Hammoud SS, Carrell DT, Cairns BR, Snell Q, Clement MJ, Johnson WE. Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data. BMC Bioinformatics. 2013 Nov 21;14:337. PubMed PMID: 24261665; PubMed Central PMCID: PMC3924334.


Model-based Analysis of Tiling Arrays for ChIP-chip. MAT is a developed for the analysis of data form Aymetrix tiling microarrays. It removes bias in microarray data attributable to probe and sample on each array individually. It also facilitates genomic proling and identication of significantly enriched genome regions. The MA2C software is a similar approach, but designed to analyze data from two-color tiling arrays.

MAT/MA2C publications:

  1. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS. Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci U S A. 2006 Aug 15;103(33):12457-62. PubMed PMID: 16895995; PubMed Central PMCID: PMC1567901.
  2. Song JS, Johnson WE, Zhu X, Zhang X, Li W, Manrai AK, Liu JS, Chen R, Liu XS. Model-based analysis of two-color arrays (MA2C). Genome Biol. 2007;8(8):R178. PubMed PMID: 17727723; PubMed Central PMCID: PMC2375008.