A Hybrid Gene Selection Approach to Create the S1500+ Targeted Gene Sets for Use in High-Throughput Transcriptomics

Deepak Mav, Ruchir R. Shah, Brian E. Howard, Scott S. Auerbach, Pierre R. Bushel, Jennifer B. Collins, David L. Gerhold, Richard S. Judson, Agnes L. Karmaus, Elizabeth A. Maull, Donna L. Mendrick, B. Alex Merrick, Nisha S. Sipes, Daniel Svoboda, Richard S. Paules.
PLOS One. 2018 DOI: https://doi.org/10.1371/journal.pone.0191105 PMID: 29462216

On This Page

Publication

Abstract

Changes in gene expression can help reveal the mechanisms of disease processes and the mode of action for toxicities and adverse effects on cellular responses induced by exposures to chemicals, drugs and environment agents. The U.S. Tox21 Federal collaboration, which currently quantifies the biological effects of nearly 10,000 chemicals via quantitative high-throughput screening(qHTS) in in vitro model systems, is now making an effort to incorporate gene expression profiling into the existing battery of assays. Whole transcriptome analyses performed on large numbers of samples using microarrays or RNA-Seq is currently cost-prohibitive. Accordingly, the Tox21 Program is pursuing a high-throughput transcriptomics (HTT) method that focuses on the targeted detection of gene expression for a carefully selected subset of the transcriptome that potentially can reduce the cost by a factor of 10-fold, allowing for the analysis of larger numbers of samples. To identify the optimal transcriptome subset, genes were sought that are (1) representative of the highly diverse biological space, (2) capable of serving as a proxy for expression changes in unmeasured genes, and (3) sufficient to provide coverage of well described biological pathways. A hybrid method for gene selection is presented herein that combines data-driven and knowledge-driven concepts into one cohesive method. Our approach is modular, applicable to any species, and facilitates a robust, quantitative evaluation of performance. In particular, we were able to perform gene selection such that the resulting set of “sentinel genes” adequately represents all known canonical pathways from Molecular Signature Database (MSigDB v4.0) and can be used to infer expression changes for the remainder of the transcriptome. The resulting computational model allowed us to choose a purely data-driven subset of 1500 sentinel genes, referred to as the S1500 set, which was then augmented using a knowledge-driven selection of additional genes to create the final S1500+ gene set. Our results indicate that the sentinel genes selected can be used to accurately predict pathway perturbations and biological relationships for samples under study.

Figures

Figure 1. S1500+ gene selection workflow.

To compile the S1500+ gene set, a combination of modular data-driven algorithms as well as manual crowd-sourced knowledge-based gene nominations was used to optimize for pathway coverage and the ability to extrapolate to the whole transcriptome.

Figure 1 (67 KB)

Figure 2. Dimension reduction plot.

X-axis shows the percentage of the total principal components (eigengenes) and the Y-axis shows percentage of variability captured. The red line represents the expected relationship given statistically independent gene expression, whereas the blue curve shows the observed relationship.

Figure 2 (179 KB)

Figure 3. Clustered experiments.

K-means clustering (k = 10) was used to cluster experiment data using the first 20 principal components. Fold change values shown are for the top 20 eigengenes. The columns denote principal component indices and percentage of captured variability in parentheses.

Figure 3 (612 KB)

Figure 4. Pathway Performance Analysis.

Pathway performance analysis for Follicular Lymphoma vs. Tonsillectomy case study comparison (concordance Venn diagrams).
All significantly enriched pathways were identified using enrichment score >0.5 and Kolmogorov Smirnov p-value < 0.001 were included for this analysis. Recall is the percentage of the observed up-/down-regulated genes (Obs-Up and Obs-Down) that were also correctly predicted as up-/down-regulated (Pred-Up/Down). Precision is the percentage of the predicted up- and down-regulated genes that were observed as up- and down-regulated.

Figure 4 (600 KB)

Tables

Table 1. Pathway Coverage.

Table 1 (41 KB)

Table 2. Summary of gene and pathway level extrapolation performance on cross-validated training set.

Table 2 (61 KB)

Table 3. Summary of gene and pathway level extrapolation performance using independent test set.

Table 3 (42 KB)

Toxicogenomics

Microarray Data

GEO Series: GSE66384

Supplemental Materials

S1 File. Sample Annotation Guideline and Additional Details.

This file provides a detailed description of
1. data curation process utilized
2. steps involved in co-expression importance score
3. extrapolation using principal component regression.

S1 File (52 KB)

S2 File. Gene Descriptions.

This is a compressed tab delimited file consisting of the following three columns and one row for each gene (21,064) with unique Affymetrix probe set annotation constructed from the GPL570.txt file (Affymetrix Probe Annotation file downloaded from NCBI GEO site on 01/30/2014).
“Gene_Name”–Gene symbols separated by “///” character that mapped to unique collection of Affymetrix Probe Sets.
“Probe Set ID”–Probe Set Identifiers that are mapped to gene symbols separated by “;” character
“Sentinel_Selection_Status”–Binary integer denoting whether gene is “S1500+” sentinel gene set.

S2 File (237 KB)

S3 File. Training Data Description

This is a compressed tab delimited file which provides a listing of GEO accession identifiers samples that are utilized to produce each of 31,577 differential expression profiles from the training dataset. This file contains the following columns:
“Series”–GEO series identifier
“ID”–Unique identifier for differential expression profile (A vs. B comparison)
“Samples (A)”–GEO accession identifiers of samples that annotated to condition “A” during the curation process (separated by “;” character)
“Samples (B)”–GEO accession identifiers of samples that annotated to condition “B” during the curation process (separated by “;” character).

S3 File (321 KB)

S4 File. Test Data Description.

This is a compressed tab delimited file which provides a listing of GEO accession identifiers for samples that are utilized to produce each of 4,089 differential expression profiles from the test dataset. This file contains the following columns:
“Series”–GEO series identifier
“ID”–Unique identifier for differential expression profile (A vs. B comparison)
“Samples (A)”–GEO accession identifiers of samples that were annotated to condition “A” during the curation process (separated by “;” character)
“Samples (B)”–GEO accession identifiers of samples that were annotated to condition “B” during the curation process (separated by “;” character).

S4 File (41 KB)

S5 File. Pathway Description.

This is a compressed tab delimited file containing descriptors for each of 1,320 canonical pathways from Broad Institute’s Molecular Signature Database (MSigDB) version 4.0. Note that the following 3 columns were added for convenience
“Symbol.Aliases”—Number of pathway genes with aliases
“Entrez.Exp”—Number entrez gene identfiers mapped to pathway
“Symbols.Exp”—Number of unique gene symbols mapped to pathway.

S5 File (186 KB)

S6 File. Gene Level Extrapolated Signal Matrix for Test Data.

This is a compressed tab delimited file consisting of a 18,325 x 4,089 numeric matrix denoting extrapolated log2 fold-change signals for each of 18,325 genes and 4,089 differential expression profiles from the test data set. Note that row-names of this matrix match the “Gene_Name” column from the above gene description file (i.e. S2.txt.gz).

S6 File (604 MB)
S6 File Information (21 KB)

S7 File. Pathway Level Enrichment Score Matrix for Test Data.

This is a compressed tab delimited file consisting of a 1,320 x 4,089 numeric matrix denoting pathway enrichment scores for each of 1,320 canonical pathways and 4,089 differential expression profiles from the test data set. Note that row-names of this matrix match the “SYSTEMATIC_NAME” column from above pathway description file (i.e. S5.txt.gz). Also note that these enrichment scores are computed using true log2 fold-change values for all 21,064 genes.

S7 File (42 MB)

S8 File. Pathway Level Extrapolated Enrichment Score Matrix for Test Data.

This is a compressed tab delimited file consisting of a 1,320 x 4,089 numeric matrix denoting pathway enrichment scores for each of 1,320 canonical pathways and 4,089 differential expression profiles from the test data set. Note that these enrichment score values utilize true observed log2 fold-change values for S1500+ sentinel (2,739) genes and extrapolated signal for other non-sentinel (18,325) genes.

S8 File (42 MB)

S9 File. Supplementary Tables and Figures.

This file provides following supplementary tables
Table A: GSE66384: Top 20 Pathways for Follicular Lymphoma vs. Tonsillectomy Comparison via S1500 genes and Random 1500 genes based transcriptomes
Table B: GSE66384: Top 20 Pathways for Follicular Lymphoma vs. Tonsillectomy Comparison via S1500+ genes and Random 2739 genes based transcriptome
Figure A: Logit-transformed CIS and DIS value density plots
(a) Displays empirical density plot of logit transformed Diversity Importance Score (DIS);
(b) Displays empirical density plot of logit transformed Co-expression Importance Score (CIS);
(c) Displays scatter plots of logit transformed CIS/DIS values, (black points denote top 1500 genes according to overall importance score and gray points denote remaining genes).

S9 File (140 KB)