KiPar is a computer application for (1) retrieval of textual documents given pathway
information and the required kinetic parameters; and (2) annotation of the retrieved
documents with the pathway- and kinetics-related concepts and the potential values
of these parameters. Based on the requirements of kinetic metabolic modelling, we
identified a set of concepts necessary to specify typical information needs occurring
during model construction. These include:
enzymes catalysing the reactions of interest,
a pathway (i.e. a specific part of a network) to which these reactions belong,
the organism studied as the biological context of the pathway, and
the parameters for which values are required for the kinetic aspects of the model.
Other types of information can be "derived" from this information relying on the
public data resources. For example, given an enzyme one can retrieve the known
information from the KEGG ENZYME
database (Kanehisa, et al., 2008)
about the compounds acting as substrates/products of the reaction catalysed.
Similarly, we can obtain information on genes known to encode the enzyme.
Given the terminological variability
(Spasic, et al., 2005)
of biomedical sublanguages (Friedman, et al., 2002;
Harris, 2002), KiPar does not accept free-text
descriptions of the input concepts. Instead, input is acquired through browsing
relevant biomedical ontologies and databases (DBs), and selecting the concepts of
interest. By supplying widely recognised identifiers for the concepts, rather then
their possibly ambiguous names, we facilitate subsequent integration of information
acquired from disparate public data resources. The resources used to specify input
information are: KEGG ENZYME
(Kanehisa, et al., 2008;
KEGG, 2008) for enzymes,
the Gene Ontology (GO)
(Ashburner, et al., 2000;
for pathways and the Systems Biology Ontology (SBO)
(Le Novere, 2006;
for the required kinetic parameters.
The figure above depicts the structure of the KiPar system, numbered in a logical sequence
of the elementary acts that it performs. Given the high-level input specification (box 1),
KiPar employs a range of integrated bioinformatics strategies (relying heavily on web services)
to harvest reaction-specific terms (e.g. an enzyme, compounds acting as substrates/products,
and the genes encoding the enzyme) from publicly available biological DBs (box 2): KEGG,
PubChem (PubChem, 2008),
ChEBI (ChEBI, 2008;
Degtyarenko, et al., 2008),
SGD (Cherry, et al., 1998;
CYGD (CYGD, 2008;
Güldener, et al., 2005)).
The problem of terminological variability is further tackled by collecting additional
synonyms from UMLS (Bodenreider, 2004;
UMLS, 2008) (box 3).
The search terms collected are used to effect a transition from conceptual to textual space.
In order to query the literature for the information required for a kinetic model of a given
pathway (box 4), KiPar first indexes the literature with concepts of interest. The indexing
process involves mapping each concept to a search query based on the synonyms acquired in the
previous steps, e.g. the query used to search for information on enzyme with EC number
The query is passed to Entrez (Entrez, 2008),
a search and retrieval system that enables a user to access information from many
NCBI DBs (Wheeler, et al., 2008).
The search results effectively map a concept to a set of matching documents in
the NCBI literature DBs
(PubMed (PubMed, 2008) and
PubMed Central (PMC, 2008)).
A local DB is used to store information gathered about concepts, synonyms and documents.
This DB is queried for relevant information within the indexed documents. Each document
is assigned a score (S) using a weighted formula combining the number of matching
concepts of each type considered (i.e. enzymes, compounds, genes, pathways and kinetic parameters):
where EC is a set of selected enzymes, e is an enzyme from EC,
CPDe is a set of compounds involved in the reaction catalysed by the
enzyme e, SCEe is a set of S. cerevisiae genes
encoding the enzyme e, GO is a set of pathway-related concepts selected
from GO, SBO is a set of
kinetics-related concepts selected from SBO,
hits(S) is the percentage of concepts in the set S matching the
given document, and ωEC, ωCPD,
ωSCE, ωRN, ωGO,
ωPATH and ωSBO are weights used for
enzymes, compounds, genes, reactions, GO terms, pathway and SBO terms respectively. The
weights used are configurable parameters of the system.
By storing the mappings between concepts, terms and documents in a local DB, the querying
ability of the DB management system can be combined with that of Entrez, which is a practical
alternative to launching multiple Entrez queries searching for different combinations of
pathway-related concepts. The expressiveness of SQL
and the speed of a relational DB management system are used to perform semantically complex
searches in a less cumbersome way and with reduced execution time. Finally, the highest-ranked
documents are presented to the user in HTML
format (box 5). The results produced represent links to the original documents annotated with the matching
concepts (linked to their entries in the relevant DBs) and quantitative data. The annotation helps a user
determine which types of information each document contains so it can be incorporated into the model, which
can be formally represented in SBML (Hucka, et al., 2003;
SBML, 2008). In addition, all citations retrieved are exported
into BibTeX format, which can be used to import citation details into most reference management applications.
OS Independent (Written in an interpreted language)
Text mining, Bio-Informatics
Follow these steps in order to install KiPar:
Make sure you have Java installed on your computer.
Make sure you have access to a relational database management system.
We tested KiPar with a database hosted on a PostgreSQL system, but
other DB management systems supporting SQL should work with KiPar by changing the
driver information in the configuration file.
Create two databases:
local KiPar's database (e.g. called kipar)
A local database needs to be installed to hold intermediary
data obtained automatically when running KiPar. Run
to create the tables
local PubChem database (e.g. called pubchem)
A local database needs to be installed and populated with a relevant subset
of data from the PubChem databases. After downloading the specified files
create a new database and run the given SQL scripts to create
the tables and load
them with the data from the corresponding files.
Download KiPar.zip and unzip the file.
This will create a folder KiPar from which KiPar will be run.
Different modules of KiPar are run in a pipeline from the <start>
option to the <end> option specified during configuration. The
following options are available:
1 - Get terms
Connects to external (SBO, GO, KEGG, ChEBI) and local (PubChem) databases to obtain
knowledge and termonologies related to concepts specified as input data during
2 - Initialize database
Empties a local KiPar's database specified during configuration and populates it
with terms obtained in step 1.
3 - Expand with synonyms
Connects to UMLS to obtain additional synonyms for terms acquired in step 1 and
updates KiPar's database with them. Nota Bene: Running this option
without previously obtaining a UMLS licence will generate RemoteException,
because of an invalid client IP address. Alternatively, you may chose to skip this
4 - Query literature
Updates each concept in KiPar's database with its indexing queries.
5 - Index literature
Uses indexing queries generated in step 4 to map each concept to the
matching documents and store this information in KiPar's database
6 - Score documents
Uses indexing information obtained in step 5 to score all indexed documents.
7 - Retrieve document details
Retrieves citation details for the highest scored documents.
8 - Export results
Exports document details obtained in step 7, downloads full-text articles,
and annotates all exported documents with the matching concepts linked
to their entries in relevant databases. Exported information is available
in the HTML format [example].
These dependencies are described for documentation purposes only and no actions
other then the ones listed in the instructions above are required. KiPar makes
use of the tools listed below. The versions given refer to the ones used during
the development of KiPar. The higher versions should work in general, but no
testing has been performed.
A local database needs to be installed and populated with a relevant subset of data from
the PubChem databases. After downloading the specified files (unzip where necessary),
create a new database and run the given SQL scripts to create the tables and load them
with the data from the corresponding files.
More details on how the external resources are used in KiPar are given in resources.pdf.
All relevant files have been archived into KiPar.zip.
The links to some of the files given below are provided for illustration
and do not need to be downloaded separately. The only exception is the JPedal library,
which needs to be downloaded directly from http://www.jpedal.org/
due to its GPL licence. Once downloaded it should be placed into the jpedal subdirectory
of the lib directory. Previous usage of this library was invalid. We apologize for the
Java classes for managing the XML data acquired from GO. These classes were generated
automatically using JAXB (which is distributed with Java WSDP) based on the XML schema
ajaxGO.xsd given in
the schema folder.
Java classes used by configuration.java to manipulate the configuration
file config.xml. These classes
were generated automatically using JAXB (which is distributed with Java WSDP) based on
the XML schema config.xsd
given in the schema folder.
Java classes for managing the XML data acquired from UMLS. These classes were generated
automatically using JAXB (which is distributed with Java WSDP) based on the XML schema
UMLS.xsd given in
the schema folder.
A configuration of the tool to be used. This XML file conforms to the schema given in
annotations given in the XML Schema document explain the configurable parameters.
This file is automatically generated by from the XML files explained below:
default.xml and user.xml.
Java class for acessing KEGG database(s) and other cross-referenced
databases (SGD, CYGD, PubChem), used
here to retrieve information about an enzyme, compounds acting as substrates/products of
the catalysed reaction and the genes known to encode the enzyme.
Motivation: Quantitative models of metabolism (and of other bio-chemical processes)
require knowledge of enzyme kinetic parameters. Many of these are buried in the literature, and
manual search for the appropriate subset of papers is both time consuming and likely to lead to
very partial results. We have developed a text mining approach that retrieves relevant scientific
publications and annotates them with information required for the kinetic modelling of a given pathway.
Results: We developed an integrative approach, combining publicly available data
and software resources, for the time- and cost-effective development of text mining tools for
retrieval of information relevant to systems biology applications. To demonstrate the conceptual
approach we implemented KiPar, a standalone Java application for the retrieval of textual documents
likely to contain information relevant for kinetic modelling of a given metabolic pathway. During
the document retrieval process, relevant semantic and lexical information is acquired from public
data resources. The use of S. cerevisiae as the biological context of the pathways modelled
influenced the choice and usage of these resources. The evaluation results show that KiPar can provide
valuable support to those interested in kinetic modelling of metabolism by providing quick access to
literature relevant for a particular pathway. The evaluation also points to the fact that full-text
articles are a much richer source of information on kinetic parameters than are their abstracts. Thus,
the greater availability of full papers will contribute substantially to the improvement of systems