Integrative analysis of molecular profiling data
ICAR 2011 Madison, Wisconsin, Thursday, July 23: 7:30 - 9:00 pm
Workshop Organizers: Pierre Hilson (VIB-Ghent University, Belgium) and Katja Baerenfaller (ETH Zurich, Switzerland)
The workshop will describe efforts towards the integration, combined analysis and interpretation of diverse molecular profiling datasets that characterize particular plant systems or processes. In this context, molecular profiles include transcript, protein, metabolite and polymer profiles, as well as enzymatic activities and metabolic fluxes. The aim of the workshop is primarily to highlight the bottlenecks encountered in integration projects together with the practical solutions implemented to solve them. The novel solutions for integrative analyses comprise specific databasing strategies, the definition of standards and data types, statistical approaches, visualization platforms and modeling tools. The workshop will present recent achievements therein, illustrated by original results furthering our understanding of the biological processes under study. It is primarily addressed at researchers performing or planning integrative analyses and will leave room for questions and open discussion.
Program
7:30 pm Pierre Hilson, (VIB-Ghent University): Introduction
7:35 pm Sean Walsh (ETH Zurich): A database approach to the integration, analysis and
visualization of diverse molecular and phenotypic profiling data
7:45 pm Katja Baerenfaller (ETH Zurich): Integrating proteomics data: pep2pro, MASCP Gator
and combined analyses with transcript data, Abstract #21684
7:55 pm Rodrigo Gutierrez (Universidad Catolica de Chile): Discriminative expression
signatures in microarray data for functional network inference, Abstract #21761
8:10 pm Jim Beynon (University of Warwick): Linking plant transcriptional response networks
triggered by biotic and abiotic stress
8:25 pm Nicholas J. Provart (University of Toronto): Hypothesis Generation in Plant Biology
Using Large Data Sets, Abstracts #21750 and #21754
8:45 pm Eva Huala (TAIR): Organizing categories, stumbling blocks and solutions for
integration of molecular profiling data
8:55 pm Discussion
WORKSHOP ABSTRACTS
SEAN WALSH
A database approach to the integration, analysis and visualization of diverse molecular and phenotypic profiling data
Sean Walsh, Katja Baerenfaller, Matthias Hirsch-Hoffmann, The Agron-omics Consortium of Scientists, Pierre Hilson and Wilhelm Gruissem
A substantial part of the Agron-omics consortium resource is devoted to profiling the Arabidopsis leaf in a number of environmental conditions at four developmental- and two diurnal time-points during growth. A diverse range of quantitative molecular and phenotypic data is generated in this and other sub-projects. Integrating and interrogating this information is especially challenging when contributing partners are geographically dispersed. Vital to our analytical pipeline is a database integration that exploits standard and advanced features of the MySQL database engine (e.g. stored routines) and tools. This implementation is utilized for the processes of data capture, validation, documentation, the tracking of provenance, for certain mathematical-, statistical- and structural data transformations, for integration with R/Matlab and for storing visualization routines. In addition, the system provides access controlled user workspaces and the ability to run high performance queries across multiple/high volume datasets. Novel datasets also require the integration of pre-existing knowledge and consequently a range of molecular annotations and classifications are included. Since the database engine and tools are freely available, the data and code can be simply and rapidly replicated for community dissemination and/or extension. This development provides a useful template for a computational platform that has analytical value during a project and beyond.
KATJA BAERENFALLER
Integrating Proteomics Data: pep2pro, MASCP Gator and Combined Analyses with Transcript Data
Katja Baerenfaller1, Matthias Hirsch-Hoffmann1, Sean Walsh1, Lars Hennig2, Daniel Stekhoven1, Sacha Baginsky3, Wilhelm Gruissem1 & the Agron-omics consortium
1 ETH Zurich, Switzerland, 2 Swedish University of Agricultural Sciences, Sweden 3 Martin-Luther-Universität Halle, Germany
Mass spectrometry-based proteomics has now become an important tool for obtaining qualitative and quantitative protein information. Data analysis of the proteomics data usually requires integration of search results from different experiments and search algorithms. The results are then often combined with other proteome data or compared with additional datasets like transcript data. To accomplish the integration of proteomics data at these different levels specialised tools have been built that are presented here.
pep2pro is a comprehensive proteome analysis database that offers solutions to the various challenges of developing a proteome data analysis database. Its capacity for the integration and analysis of large proteome databases was demonstrated by creating the organ-specific Arabidopsis thaliana dataset containing 14'522 identified proteins based on more than 2.6 million spectra that can be accessed at www.pep2pro.ethz.ch. The datasets in this database have been aggregated with other large-scale Arabidopsis proteome datasets in the MASCP Gator aggregation portal (http://gator.masc-proteomics.org/), which has been built as a proteomics community effort lead by Joshua Heazlewood. For combined analyses of the Agron-omics quantitative proteomics data on leaf 6 with additional datasets, the final proteomics result data that had been processed in pep2pro have been integrated into the AgroRDB. Querying this database followed by statistical analyses allows effective data analysis and reveals new insights into the regulation of protein and transcript levels.
RODRIGO GUTIÉRREZ
Discriminative expression signatures in microarray data for functional network inference
Tomás Puelma1,2, Alvaro Soto2 and Rodrigo A Gutiérrez1
1. FONDAP Center for Genome Regulation. Millennium Nucleus for Plant Functional Genomics. Department of Molecular Genetics and Microbiology. 2. Computer Sciences Department. Pontificia Universidad Católica de Chile. Santiago, Chile.
Microarray technology is one of the most widely used approaches for profiling gene expression changes. There are thousands of publicly available microarray data, which provides information on the expression of thousands of genes under many experimental conditions. This tremendous resource can be used to predict new properties of Arabidopsis genes.
In this work we present a novel machine learning method, designed specifically to integrate large microarray datasets and predict gene functional networks for a biological processes of interest. By using the existing knowledge available in Gene Ontology, our method finds local expression signatures that discriminate the biological process of interest. The method then uses these signatures to predict new genes linked to the process in functional networks. In contrast to state-of-the-art classification algorithms such as support vector machines (SVMs) or coexpression networks, our method exposes the data that is useful to make functional predictions for specific processes. Cross-validation and a year-2008 rollback analysis showed that our method performs better than co-expression networks and similar to SVMs for most biological processes. Hence, the proposed method has the discriminative power of supervised methods like SVMs, but unlike them, it provides valuable additional information that help biologists understand the biological process and guide future experiments.
We applied our method to identify new components of the nitrogen response in A. thaliana. We integrated the output of the method with existing large interaction datasets as a functional network using VirtualPlant (www.virtualplant.org) and proposed novel regulatory networks involved in the nitrate response that were validated experimentally.
JIM BEYNON
Linking plant transcriptional response networks triggered by biotic and abiotic stress
Jim Beynon and the PRESTA Consortium
School of Life Sciences and Systems Biology, Warwick University, Wellesbourne,
Warwick, CV35 9EF, UK.
Global food security is the major challenge for biological sciences in the coming years. As plants cannot avoid environmental stress they have developed networks of responses to alleviate their impact. These networks are likely to be complex as they involve detection and response components as well as many feedback mechanisms to dampen their output to avoid extreme consequences to the plant. To complicate the issue, at any one time a plant may be exposed to multiple stresses, both biotic and abiotic, that will alter the output from the stress response networks induced. However, there is likely to be a significant overlap in the gene networks responding to the various stresses. We are taking a Systems Biology approach to begin to elucidate the high level networks that control the common components of these stress responses.
NICHOLAS J. PROVART
Hypothesis Generation in Plant Biology Using Large Data Sets
Dept. of Cell and Systems Biology, University of Toronto. Toronto, ON. CANADA nicholas.provart@utoronto.ca, http://www.csb.utoronto.ca/faculty/provart-nicholas/
We have developed tools, available as part of the Bio-Array Resource at http://bar.utoronto.ca, for exploring large data sets from plants, to allow deeper insights into biological questions and to help guide lab-based research. An emerging theme in plant biology is that interactions, be they regulatory or protein-protein, create networks. In the former instance, coexpression networks can provide more robust support for inferred biological involvement than simple coexpression analyses alone. Coexpression networks developed using publicly-available gene expression data sets from dormant and germinating seeds have provided high-quality candidates for genes involved in regulating these two important processes (joint work with George Bassel - Division of Plant & Crop Sciences, University of Nottingham; and Hui Lan and Anthony Bonner - Department of Computer Science, University of Toronto). In the latter instance, the complex cellular functions of an organism frequently rely on physical interactions between proteins. A map of all protein-protein interactions, an interactome, is thus an invaluable tool. An interactome for Arabidopsis thaliana predicted from interacting orthologs in 7 organisms will be presented (joint work with Matt Geisler and Jane Geisler-Lee - Southern Illinois University Carbondale). These predictions can aid researchers by extending known complexes and pathways with candidate proteins. Finally, methods for integrating networks of coexpression, protein-protein interaction, and of other high-throughput data, can provide additional levels of support for novel function identification. An algorithm for doing so, called GeneMANIA, will be presented and discussed (joint work with Quaid Morris - CCBR, University of Toronto).
EVA HUALA
Organizing categories, stumbling blocks and solutions for integration of molecular profiling data
One or more organizing categories (e.g. genome position, metabolic pathway) are needed when integrating diverse kinds of molecular profiling datasets, and the choice of this organizing category has a strong effect on how the integrated data can be used. Data integration problems result from ambiguous mapping or failure to map to the chosen organizing categories, lack of standardized categories, lack of standardized data exchange formats, and inadequately described experiments. Efforts to develop standardized categories as ontologies are well underway in several areas.