CV expand is a text mining tool for automatic expansion of controlled vocabularies
as a practical alternative to tailor-made named entity recognition methods.
We describe a text mining (TM) method for efficient corpus-based term acquisition
as a way of rapidly expanding a set of controlled vocabularies (CVs) with the
terms used in the scientific literature. We adopted an integrative approach,
combining relatively generic software and data resources for time- and
cost-effective development of a text mining tool for expansion of CVs
across various domains, as a practical alternative to both manual term
collection and tailor-made named entity recognition methods.
A set of relevant tasks regarding CV term acquisition has been identified,
including information retrieval, term recognition and term filtering. The given
figure summarises the main steps taken in our TM approach to CV expansion.
First, the information retrieval module is used to gather documents relevant
for a given CV from the literature databases. Once a domain-specific corpus
of documents has been assembled, it is searched for potential terms unaccounted
for in the initial CV. Automatic term recognition is performed to extract terms
as domain-specific lexical units, i.e. the ones that frequently occur in the
corpus and bear special meaning in the domain. In order to reduce the number
of terms not directly related to a given CV, we filter out typically co-occurring
types of terms that belong to sub-domains having more established CVs. The existing
CVs can be exploited to recognise these terms using a dictionary-based approach.
OS Independent (Written in an interpreted language)
Programming language
Java
Database environment
PostgreSQL
Topics
Text mining, Bio-Informatics
User interface
Command-line
Prerequisites
In order to use CV expand, the below listed tools need to be installed first.
The versions given refer to the ones used during the development of CV expand.
The higher versions should work in general, but no testing has been performed.
All relevant files have been archived into CVexpand.zip.
The links to some of the files given below are provided for illustration
and do not need to be downloaded separately.
The results of two case studies described in the given publications. This folder
is provided for illustration only and is not required to run the tool.
config
Java classes used by configuration.java. These classes were generated
automatically using JAXB (which is distributed with Java WSDP) based on the XML schema
config.xsd given in the schema folder.
data
Input and output data can be found in the corresponding sub-folders.
schema
This folder contains config.xsd (XML schema, which describes how the tool can be configured)
and tables.sql (SQL schema of the relational database, which should be installed locally).
A configuration of the tool to be used. This XML file conforms to the schema given in
config.xsd. The
annotations given in the XML Schema document explain the configurable parameters.
Background: Many bioinformatics applications rely on controlled vocabularies
or ontologies to consistently interpret and seamlessly integrate information scattered across
public resources. Experimental data sets from metabolomics studies need to be integrated with
one another, but also with data produced by other types of omics studies in the spirit of
systems biology, hence the pressing need for vocabularies and ontologies in metabolomics.
However, it is time-consuming and non trivial to construct these resources manually.
Results: We describe a methodology for rapid development of controlled
vocabularies, a study originally motivated by the needs for vocabularies describing
metabolomics technologies. We present case studies involving two controlled vocabularies
(for nuclear magnetic resonance spectroscopy and gas chromatography) whose development
is currently underway as part of the Metabolomics Standards Initiative. The initial
vocabularies were compiled manually, providing a total of 243 and 152 terms. A total
of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis
of the results showed that full-text articles (especially the Materials and Methods sections)
are the major source of technology-specific terms as opposed to paper abstracts.
Conclusions: We suggest a text mining method for efficient corpus-based
term acquisition as a way of rapidly expanding a set of controlled vocabularies with the
terms used in the scientific literature. We adopted an integrative approach, combining
relatively generic software and data resources for time- and cost-effective development
of a text mining tool for expansion of controlled vocabularies across various domains,
as a practical alternative to both manual term collection and tailor-made named entity
recognition methods.
Bioinformatics applications heavily rely on controlled vocabularies and ontologies to consistently interpret and seamlessly integrate information scattered across disparate public resources. Experimental data from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. Here we describe the development of controlled vocabularies for metabolomics investigations. Manual term acquisition approaches are time-consuming, labour-intensive and error-prone, especially in a rapidly developing domain such as metabolomics, where new analytical techniques emerge regularly so that the domain experts are often compelled to use non-standardised terms. We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature.