University Home
Manchester Centre for Integrative Systems Biology

KiPar


KiPar is a standalone Java application for the retrieval of textual documents likely to contain information relevant for kinetic modelling of a given metabolic pathway.


Specification

KiPar is a computer application for (1) retrieval of textual documents given pathway information and the required kinetic parameters; and (2) annotation of the retrieved documents with the pathway- and kinetics-related concepts and the potential values of these parameters. Based on the requirements of kinetic metabolic modelling, we identified a set of concepts necessary to specify typical information needs occurring during model construction. These include:

  1. enzymes catalysing the reactions of interest,
  2. a pathway (i.e. a specific part of a network) to which these reactions belong,
  3. the organism studied as the biological context of the pathway, and
  4. the parameters for which values are required for the kinetic aspects of the model.


Other types of information can be "derived" from this information relying on the public data resources. For example, given an enzyme one can retrieve the known information from the KEGG ENZYME database (Kanehisa, et al., 2008) about the compounds acting as substrates/products of the reaction catalysed. Similarly, we can obtain information on genes known to encode the enzyme.

metabolic pathway

Given the terminological variability (Spasic, et al., 2005) of biomedical sublanguages (Friedman, et al., 2002; Harris, 2002), KiPar does not accept free-text descriptions of the input concepts. Instead, input is acquired through browsing relevant biomedical ontologies and databases (DBs), and selecting the concepts of interest. By supplying widely recognised identifiers for the concepts, rather then their possibly ambiguous names, we facilitate subsequent integration of information acquired from disparate public data resources. The resources used to specify input information are: KEGG ENZYME (Kanehisa, et al., 2008; KEGG, 2008) for enzymes, the Gene Ontology (GO) (Ashburner, et al., 2000; GO, 2008) for pathways and the Systems Biology Ontology (SBO) (Le Novere, 2006; SBO, 2008) for the required kinetic parameters.

TM workflow

The figure above depicts the structure of the KiPar system, numbered in a logical sequence of the elementary acts that it performs. Given the high-level input specification (box 1), KiPar employs a range of integrated bioinformatics strategies (relying heavily on web services) to harvest reaction-specific terms (e.g. an enzyme, compounds acting as substrates/products, and the genes encoding the enzyme) from publicly available biological DBs (box 2): KEGG, PubChem (PubChem, 2008), ChEBI (ChEBI, 2008; Degtyarenko, et al., 2008), SGD (Cherry, et al., 1998; SGD, 2008), CYGD (CYGD, 2008; Güldener, et al., 2005)). The problem of terminological variability is further tackled by collecting additional synonyms from UMLS (Bodenreider, 2004; UMLS, 2008) (box 3). The search terms collected are used to effect a transition from conceptual to textual space. In order to query the literature for the information required for a kinetic model of a given pathway (box 4), KiPar first indexes the literature with concepts of interest. The indexing process involves mapping each concept to a search query based on the synonyms acquired in the previous steps, e.g. the query used to search for information on enzyme with EC number 3.1.3.12 is:


The query is passed to Entrez (Entrez, 2008), a search and retrieval system that enables a user to access information from many NCBI DBs (Wheeler, et al., 2008). The search results effectively map a concept to a set of matching documents in the NCBI literature DBs (PubMed (PubMed, 2008) and PubMed Central (PMC, 2008)). A local DB is used to store information gathered about concepts, synonyms and documents. This DB is queried for relevant information within the indexed documents. Each document is assigned a score (S) using a weighted formula combining the number of matching concepts of each type considered (i.e. enzymes, compounds, genes, pathways and kinetic parameters):

score

where EC is a set of selected enzymes, e is an enzyme from EC, CPDe is a set of compounds involved in the reaction catalysed by the enzyme e, SCEe is a set of S. cerevisiae genes encoding the enzyme e, GO is a set of pathway-related concepts selected from GO, SBO is a set of kinetics-related concepts selected from SBO, hits(S) is the percentage of concepts in the set S matching the given document, and ωEC, ωCPD, ωSCE, ωRN, ωGO, ωPATH and ωSBO are weights used for enzymes, compounds, genes, reactions, GO terms, pathway and SBO terms respectively. The weights used are configurable parameters of the system.

By storing the mappings between concepts, terms and documents in a local DB, the querying ability of the DB management system can be combined with that of Entrez, which is a practical alternative to launching multiple Entrez queries searching for different combinations of pathway-related concepts. The expressiveness of SQL and the speed of a relational DB management system are used to perform semantically complex searches in a less cumbersome way and with reduced execution time. Finally, the highest-ranked documents are presented to the user in HTML format (box 5). The results produced represent links to the original documents annotated with the matching concepts (linked to their entries in the relevant DBs) and quantitative data. The annotation helps a user determine which types of information each document contains so it can be incorporated into the model, which can be formally represented in SBML (Hucka, et al., 2003; SBML, 2008). In addition, all citations retrieved are exported into BibTeX format, which can be used to import citation details into most reference management applications.

top



Implementation

Project details

Project administrators Dr Irena Spasić
Developers 1
Development status 4 - Beta
Intended audience Developers, Science/Research
License Academic Free License (AFL) v3.0
Operating system OS Independent (Written in an interpreted language)
Programming language Java
Database environment PostgreSQL
Topics Text mining, Bio-Informatics
User interface Command-line

Installation instructions

Follow these steps in order to install KiPar:

  1. Make sure you have Java installed on your computer.
  2. Make sure you have access to a relational database management system. We tested KiPar with a database hosted on a PostgreSQL system, but other DB management systems supporting SQL should work with KiPar by changing the driver information in the configuration file.
  3. Create two databases:

    1. local KiPar's database (e.g. called kipar)

      A local database needs to be installed to hold intermediary data obtained automatically when running KiPar. Run kipar.sql to create the tables (schema diagram).

    2. local PubChem database (e.g. called pubchem)

      A local database needs to be installed and populated with a relevant subset of data from the PubChem databases. After downloading the specified files (CID-SID, CID-Synonym), create a new database and run the given SQL scripts to create (create_pubchem_tables.sql) the tables and load (load_pubchem_tables.sql) them with the data from the corresponding files.

  4. Download KiPar.zip and unzip the file. This will create a folder KiPar from which KiPar will be run.

Usage instructions

Follow these steps in order to run KiPar:

  1. Configure: input data and system parameters

    In order to configure KiPar, you can either:

    1. edit user.xml directly [example], or
    2. double-click RunConfiguration.bat, which will launch Pedro interface to KiPar [example]. (Note that Pedro uses Java 1.4.1.)

  2. Run: double-click RunKiPar.bat

    Different modules of KiPar are run in a pipeline from the <start> option to the <end> option specified during configuration. The following options are available:

    OptionEffect
    1 - Get terms Connects to external (SBO, GO, KEGG, ChEBI) and local (PubChem) databases to obtain knowledge and termonologies related to concepts specified as input data during configuration.
    2 - Initialize database Empties a local KiPar's database specified during configuration and populates it with terms obtained in step 1.
    3 - Expand with synonyms Connects to UMLS to obtain additional synonyms for terms acquired in step 1 and updates KiPar's database with them. Nota Bene: Running this option without previously obtaining a UMLS licence will generate RemoteException, because of an invalid client IP address. Alternatively, you may chose to skip this option.
    4 - Query literature Updates each concept in KiPar's database with its indexing queries.
    5 - Index literature Uses indexing queries generated in step 4 to map each concept to the matching documents and store this information in KiPar's database [example].
    6 - Score documents Uses indexing information obtained in step 5 to score all indexed documents.
    7 - Retrieve document details Retrieves citation details for the highest scored documents.
    8 - Export results Exports document details obtained in step 7, downloads full-text articles, and annotates all exported documents with the matching concepts linked to their entries in relevant databases. Exported information is available in the HTML format [example].

Dependencies

These dependencies are described for documentation purposes only and no actions other then the ones listed in the instructions above are required. KiPar makes use of the tools listed below. The versions given refer to the ones used during the development of KiPar. The higher versions should work in general, but no testing has been performed.

ToolVersionURL
Java 2 1.6.0 http://java.sun.com/javase/
Java WSDP 2.0 http://java.sun.com/webservices/
PostgreSQL 8.1 http://www.postgresql.org/
PDFBox 0.7.3 http://www.pdfbox.org/
JPedal http://www.jpedal.org/


To take advantage of web service access to the following resources, the given jar files need to be available to run KiPar.

ResourceURLjar
KEGG http://www.genome.ad.jp/kegg/
http://www.genome.jp/kegg/soap/
keggapi.jar
ChEBI http://www.ebi.ac.uk/chebi/
http://www.ebi.ac.uk/chebi/webServices.do
WSChebiJAX-WS-1.1.jar
SBO http://www.ebi.ac.uk/sbo/
http://www.ebi.ac.uk/sbo/SBOWSLib/ws.html
SBOWSLib-20070110.jar
GO http://www.geneontology.org/
http://www.ebi.ac.uk/ontology-lookup/WSDLDocumentation.do
ols-client.jar
UMLS http://umlsks.nlm.nih.gov/
http://kswebp1.nlm.nih.gov/DocPortlet/html/dGuide/Guide.htm
kss-api-5.0.jar
For this link to work you need to log in as a registered user.
Entrez http://www.ncbi.nlm.nih.gov/Entrez/
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html
Follow the instructions to create the entrez.jar file needed to access Entrez E-Utilities.


A local database needs to be installed and populated with a relevant subset of data from the PubChem databases. After downloading the specified files (unzip where necessary), create a new database and run the given SQL scripts to create the tables and load them with the data from the corresponding files.

ResourceURLfilesscripts
PubChem http://pubchem.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nih.gov/pubchem/Compound/Extras
CID-SID
CID-Synonym
create_pubchem_tables.sql
load_pubchem_tables.sql


The following resources are used in KiPar, but no downloads or any other actions are required for that.

ResourceURL
SGD http://www.yeastgenome.org/
CYGD http://mips.gsf.de/genre/proj/yeast/
MeSH http://www.nlm.nih.gov/mesh
PubMed http://www.ncbi.nlm.nih.gov/PubMed/
PubMed Central http://www.pubmedcentral.nih.gov/


More details on how the external resources are used in KiPar are given in resources.pdf.


Downloads

All relevant files have been archived into KiPar.zip. The links to some of the files given below are provided for illustration and do not need to be downloaded separately. The only exception is the JPedal library, which needs to be downloaded directly from http://www.jpedal.org/ due to its GPL licence. Once downloaded it should be placed into the jpedal subdirectory of the lib directory. Previous usage of this library was invalid. We apologize for the inconvenience.

Description of the content:

FolderDescription
class Java classes implementing KiPar.
class/ajaxGO Java classes for managing the XML data acquired from GO. These classes were generated automatically using JAXB (which is distributed with Java WSDP) based on the XML schema ajaxGO.xsd given in the schema folder.
class/config Java classes used by configuration.java to manipulate the configuration file config.xml. These classes were generated automatically using JAXB (which is distributed with Java WSDP) based on the XML schema config.xsd given in the schema folder.
class/UMLS Java classes for managing the XML data acquired from UMLS. These classes were generated automatically using JAXB (which is distributed with Java WSDP) based on the XML schema UMLS.xsd given in the schema folder.
class/user Java classes used to access XML data in the user.xml file.
data Input and output data can be found in the corresponding sub-folders.
lib This folder contains external Java libraries used by KiPar.
pedro This folder contains the Pedro files used to create a configuration interface for KiPar.
schema This folder contains:
  • ajaxGO.xsd (an XML schema, which describes the structure of XML data obtained from GO)
  • config.xsd (an XML schema, which describes how the KiPar tool can be configured)
  • user.xsd (an XML schema, which describes how the KiPar tool can be configured by the user)
  • pubchem.sql (a set of CREATE TABLE definitions for a local version of PubChem)
  • kipar.sql (a set of CREATE TABLE definitions for a local database used by KiPar: schema diagram)
  • UMLS.xsd (an XML schema, which describes the structure of XML data obtained from UMLS)
FileDescription
config.xml A configuration of the tool to be used. This XML file conforms to the schema given in config.xsd. The annotations given in the XML Schema document explain the configurable parameters. This file is automatically generated by from the XML files explained below: default.xml and user.xml.
deafult.xml This is the default version of the config.xml file (see above).
user.xml A user configuration of the tool to be used. This XML file conforms to the schema given in user.xsd. The annotations given in the XML Schema document explain the configurable parameters.
RunConfiguration.bat A batch file used to configure KiPar using Pedro.
RunKiPar.bat A batch file used to run KiPar.
BibTex.java Java class for generating a BibTex entry for a citation from PubMed given by its PubMed ID (PMID).
Citation.java Data structure for exchanging the citation details.
Configuration.java Java class used to retrieve the configuration parameters from config.xml.
Entrez.java Java class for using the Entrez Utilities, used here to access PubMed and PubMed Central.
EnzymeKEGG.java Java class for acessing KEGG database(s) and other cross-referenced databases (SGD, CYGD, PubChem), used here to retrieve information about an enzyme, compounds acting as substrates/products of the catalysed reaction and the genes known to encode the enzyme.
ExpertClientUMLS.java Java class for acessing UMLS knowledge sources, used here to retrieve synonyms of a given term.
GO.java Java class for acessing GO, used here to retrieve all terms given by accession numbers (i.e. GO IDs).
KiPar.java Java class that implements basic modules of KiPar.
Pdf2text.java Java class for extracting ASCII text from a PDF file combining PDFBox and JPedal.
PubChemLocal.java Java class for acessing a local PubChem database.
RunKiPar.java The main Java class for running KiPar.
SBO.java Java class for acessing SBO, used here to retrieve all terms given by accession numbers (i.e. SBO IDs).
TermUMLS.java Java class for extracting synonyms of a given term from an XML file retrieved from UMLS.
User.java Java class for generating config.xml from default.xml and user.xml. It is used in RunKiPar.bat immediately before running KiPar.

top



Publications

Irena Spasić, Evangelos Simeonidis, Hanan L Messiha, Norman W Paton and Douglas B Kell (2008) KiPar, a tool for systematic information retrieval regarding parameters for kinetic modelling of yeast metabolic pathways. Bioinformatics, submitted

Motivation: Quantitative models of metabolism (and of other bio-chemical processes) require knowledge of enzyme kinetic parameters. Many of these are buried in the literature, and manual search for the appropriate subset of papers is both time consuming and likely to lead to very partial results. We have developed a text mining approach that retrieves relevant scientific publications and annotates them with information required for the kinetic modelling of a given pathway.

Results: We developed an integrative approach, combining publicly available data and software resources, for the time- and cost-effective development of text mining tools for retrieval of information relevant to systems biology applications. To demonstrate the conceptual approach we implemented KiPar, a standalone Java application for the retrieval of textual documents likely to contain information relevant for kinetic modelling of a given metabolic pathway. During the document retrieval process, relevant semantic and lexical information is acquired from public data resources. The use of S. cerevisiae as the biological context of the pathways modelled influenced the choice and usage of these resources. The evaluation results show that KiPar can provide valuable support to those interested in kinetic modelling of metabolism by providing quick access to literature relevant for a particular pathway. The evaluation also points to the fact that full-text articles are a much richer source of information on kinetic parameters than are their abstracts. Thus, the greater availability of full papers will contribute substantially to the improvement of systems biology modelling.

Availability: Source code and documentation are available at: http://www.mcisb.org/resources/kipar/

Supplementary material:

Supplementary dataFormatDescription
1 jpg screenshot of the Pedro user interface of KiPar
2 jpg concept-based indexing
3 pdf screenshots of the output of KiPar
4 xls evaluation for a pathway: glycolysis
5 xls evaluation for a pathway: pentose phosphate pathway
6 xls evaluation for a pathway: citrate cycle
7 xls sensitivity analysis for weights used in the scoring formula

top



People

PersonRole
Dr Irena Spasić IS designed and implemented the text mining application for retrieval of kinetic parameters.
Dr Evangelos Simeonidis ES does the mathematical modelling of yeast metabolism and he helped evaluate the text mining results.
Dr Hanan L Messiha HLM does the experiments to determine the kinetic parameters needed for yeast modelling and she helped evaluate the text mining results.
Prof. Norman W Paton NWP supervises the bioinformatics integration aspects.
Prof. Douglas B Kell DBK supervises the systems biology studies.

top



Contact details

General contacts
Area Person Contact
text mining Dr Irena Spasić I.Spasic-AT-manchester.ac.uk
general Prof. Douglas B Kell DBK-AT-manchester.ac.uk

top