 |
 |
The informatics group has two research focuses: adaptable
informatics architectures; and bioinformatics application
development. These two areas allow us to develop the flexible tools that are
required to support large scale biomedical research investigations. Some results of this work are available
as software packages in the download section.
ADAPTABLE INFORMATICS ARCHITECTURES
The rapid pace of research in the biological sciences requires new thinking in
the way that software is developed. Architectures are required which are able
to integrate data without the imposition of data models, and management systems
are needed that can be rapidly reconfigured to present different facets of the
same underlying data. Within the informatics group, we are producing distributed
enterprise systems that can be rapidly adapted to meet the
current and future demands of systems biology.
A flexible enterprise software architecture
The ISB Informatics Infrastructure, referred to as I3, is a
modular, service-oriented research enterprise architecture which is capable of
integrating emerging technologies. The I3 enterprise architecture is
designed for interoperability and extensibility, and uses facets of both
'top-down' and 'bottom-up' design. In I3 developers can use their own
evolving data models. However, formally defined domain
specific data models and services are also provided through a number
of common services. This architecture is designed to be flexible, interoperable
and light weight, while enabling the rapid development of new solutions and
integration of new technologies.
There are two sides to the architecture: data access and data analysis. The
data access uses LSIDs to provide an identity system for mapping data items to
each other and to their RDF encoded metadata. Relationship information is
navigated through the RDF documents. The data analysis architecture is based
around Web Services, with an ontology describing the Web Service being stored
in a registry service, so that resources can be reasoned over and discovered at
run time. New services and data access are integrated by writing lightweight
wrappers. This is a "model free" architecture, where there is no direct
imposition of a structured data model on clients (which can be written in a
variety of languages). However, a standard ID mechanism coupled with the use of
"meta models" and ontologies means that a formal data centric integration
strategy is available to developers if they wish to use it.
|
|
I3 conceptually consists of: a data access component where data and
associated metadata are identified using URNs; and a data analysis component
which uses interoperable web services. The system is loosely coupled and
identity driven, so that services and data are dynamic discovered.
|
To supplement the design we are also developing a number of horizontal services,
which can be used to provide cross-domain functionality. These services include:
a synonym service for mapping identifiers between different namespaces; and a
generic statistical service which controls the life-cycle and type mapping for R
scripts.
Adaptable data management system
Within research there is a continued introduction of
new technologies and techniques. These are often high throughput and automated,
and their usage is continually evolving. To support these requirements we have
built a data management system that can
be rapidly adapted for new usage.
The data management system is designed to support the seamless mining and
analysis
of biological experiment data that is commonly used in systems biology (e.g.
ChIP-chip, gene expression, proteomics, imaging, FACS). We use different
content graphs to represent different views upon the data. Links between these
views are dynamic and resolved at runtime. This means that the management
system allows for both the rapid introduction of new types of information and
the evolution of the knowledge it represents.
Rather than build a system de novo we have extended a standardized JCR
solution from Apache called Jackrabbit. These extensions are designed to ensure
that the system integrates well within a research enterprise by using automated
LSID bindings; customizing the system to ensure it has a richer semantics; and
providing workflows to ensure the system can work robustly with high throughput
instrumentation.
|
|
Capture of information from high-throughput imaging experiments can be measured on the terabyte/day or CD/minute scale..
The imaging system at the ISB transforms images using a customisable state machine to provide resource management.
The transformed data is stored within a data management system
|
The management system is being extended to allow for multiple levels of
integration, so that experimental results can simply be "dropped" into the
system and immediately made available, and can later be migrated through a
state machine to allow for more complex representations. The architecture is
also being extended to provide for materialized views through dynamic data
transformation, context searching, project working, history mechanisms and
relationship navigation.
BIOINFORMATICS APPLICATION DEVELOPMENT
The informatics group develop applications and algorithms for usage in specific
life science research areas.
Cytoscape
Members of the team work on the core development of Cytoscape. Cytoscape is the leading network analysis and visualization tool. It is an open
source community led software development project. Cytoscape was originally developed at the ISB,
and is now maintained by the Cytoscape Consortium which consists of members from ISB, UCSD, MSKCC, Pasteur and Agilent.
|
|
The functionality of Cytoscape can be extended through the inclusion of "plugins". Plugins are compiled extensions
to a specific Cytoscape release. Shown above is the registration/versioning system that can be used to ensure plugin compatibility.
|
Analysis and ETL Pipelines
A number of componentized pipelines have been built to enable the flexible
analysis and processing of experiment data. The majority of these tools have
been built for the processing of genomic and microscopy data, and are run
within the GenePattern toolset environment. The use of a toolset builder allows
for the rapid development and customisation of toolsets by non-software engineers.
The toolsets that have been constructed include those for the analysis of various ChIPChip tiling and Gene Expression arrays,
as well as for image analysis.
|
|
GenePattern has been used to construct pipelines for the analysis of a number of
different data types, as it offers a convenient method for publishing and versioning toolsets.
|
|
|
Bench scientists run analyses through a simple form which hides the compelxity of how the tools in the toolset are chained together. The pipelines are linked to our systems through generic publish and querying modules.
|
|
 |
 |