Shmulevich Lab
  at ISB
 Shmulevich Lab Home

Information

Research Informatics
  Projects
  People
  Downloads
  Help
  Partners & Funding
  Links
Research Overview
     

Informatics Projects

The informatics group has two research focuses: adaptable informatics architectures; and bioinformatics application development. These two areas allow us to develop the flexible tools that are required to support large scale biomedical research investigations. Some results of this work are available as software packages in the download section.

ADAPTABLE INFORMATICS ARCHITECTURES

The rapid pace of research in the biological sciences requires new thinking in the way that software is developed. Architectures are required which are able to integrate data without the imposition of data models, and management systems are needed that can be rapidly reconfigured to present different facets of the same underlying data. Within the informatics group, we are producing distributed enterprise systems that can be rapidly adapted to meet the current and future demands of systems biology.

A flexible enterprise software architecture
The ISB Informatics Infrastructure, referred to as I3, is a modular, service-oriented research enterprise architecture which is capable of integrating emerging technologies. The I3 enterprise architecture is designed for interoperability and extensibility, and uses facets of both 'top-down' and 'bottom-up' design. In I3 developers can use their own evolving data models. However, formally defined domain specific data models and services are also provided through a number of common services. This architecture is designed to be flexible, interoperable and light weight, while enabling the rapid development of new solutions and integration of new technologies.

There are two sides to the architecture: data access and data analysis. The data access uses LSIDs to provide an identity system for mapping data items to each other and to their RDF encoded metadata. Relationship information is navigated through the RDF documents. The data analysis architecture is based around Web Services, with an ontology describing the Web Service being stored in a registry service, so that resources can be reasoned over and discovered at run time. New services and data access are integrated by writing lightweight wrappers. This is a "model free" architecture, where there is no direct imposition of a structured data model on clients (which can be written in a variety of languages). However, a standard ID mechanism coupled with the use of "meta models" and ontologies means that a formal data centric integration strategy is available to developers if they wish to use it.

I3 conceptually consists of: a data access component where data and associated metadata are identified using URNs; and a data analysis component which uses interoperable web services. The system is loosely coupled and identity driven, so that services and data are dynamic discovered.

To supplement the design we are also developing a number of horizontal services, which can be used to provide cross-domain functionality. These services include: a synonym service for mapping identifiers between different namespaces; and a generic statistical service which controls the life-cycle and type mapping for R scripts.

Adaptable data management system
Within research there is a continued introduction of new technologies and techniques. These are often high throughput and automated, and their usage is continually evolving. To support these requirements we have built a data management system that can be rapidly adapted for new usage.

The data management system is designed to support the seamless mining and analysis of biological experiment data that is commonly used in systems biology (e.g. ChIP-chip, gene expression, proteomics, imaging, FACS). We use different content graphs to represent different views upon the data. Links between these views are dynamic and resolved at runtime. This means that the management system allows for both the rapid introduction of new types of information and the evolution of the knowledge it represents.

Rather than build a system de novo we have extended a standardized JCR solution from Apache called Jackrabbit. These extensions are designed to ensure that the system integrates well within a research enterprise by using automated LSID bindings; customizing the system to ensure it has a richer semantics; and providing workflows to ensure the system can work robustly with high throughput instrumentation.

Capture of information from high-throughput imaging experiments can be measured on the terabyte/day or CD/minute scale.. The imaging system at the ISB transforms images using a customisable state machine to provide resource management. The transformed data is stored within a data management system

The management system is being extended to allow for multiple levels of integration, so that experimental results can simply be "dropped" into the system and immediately made available, and can later be migrated through a state machine to allow for more complex representations. The architecture is also being extended to provide for materialized views through dynamic data transformation, context searching, project working, history mechanisms and relationship navigation.

BIOINFORMATICS APPLICATION DEVELOPMENT

The informatics group develop applications and algorithms for usage in specific life science research areas.

Cytoscape
Members of the team work on the core development of Cytoscape. Cytoscape is the leading network analysis and visualization tool. It is an open source community led software development project. Cytoscape was originally developed at the ISB, and is now maintained by the Cytoscape Consortium which consists of members from ISB, UCSD, MSKCC, Pasteur and Agilent.

The functionality of Cytoscape can be extended through the inclusion of "plugins". Plugins are compiled extensions to a specific Cytoscape release. Shown above is the registration/versioning system that can be used to ensure plugin compatibility.

Analysis and ETL Pipelines
A number of componentized pipelines have been built to enable the flexible analysis and processing of experiment data. The majority of these tools have been built for the processing of genomic and microscopy data, and are run within the GenePattern toolset environment. The use of a toolset builder allows for the rapid development and customisation of toolsets by non-software engineers.

The toolsets that have been constructed include those for the analysis of various ChIPChip tiling and Gene Expression arrays, as well as for image analysis.

GenePattern has been used to construct pipelines for the analysis of a number of different data types, as it offers a convenient method for publishing and versioning toolsets.
Bench scientists run analyses through a simple form which hides the compelxity of how the tools in the toolset are chained together. The pipelines are linked to our systems through generic publish and querying modules.

RESEARCH OVERVIEW | SOFTWARE | PUBLICATIONS | PEOPLE | CONTACT | NEWS AND EVENTS
© 2008, Institute for Systems Biology, All Rights Reserved