Markus J. Herrgård, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalçin Arga, Mikko Arvas,
Nils Blüthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Peter Li,
Wolfram Liebermeister, Monica L. Mo, Ana Paula Oliveira, Dina Petranovic, Stephen Pettifer,
Evangelos Simeonidis, Kieran Smallbone, Irena Spasić, Dieter Weichart,
Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betül Kirdar, Merja Penttilä, Edda Klipp,
Bernhard Ø. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen and Douglas B. Kell (2008)
A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology.
Nature Biotechnology, Vol. 26, No. 10, pp. 1155-1160
[DOI: 10.1038/nbt1492]
Genomic data allow the large-scale manual or semi-automated assembly of
metabolic network reconstructions, which provide highly curated organism-specific
knowledge bases. Although several genome-scale network reconstructions describe
Saccharomyces cerevisiae metabolism, they differ in scope and content,
and use different terminologies to describe the same chemical entities. This makes
comparisons between them difficult and underscores the desirability of a consolidated
metabolic network that collects and formalizes the 'community knowledge' of yeast
metabolism. We describe how we have produced a consensus metabolic network
reconstruction for S. cerevisiae. In drafting it, we placed special
emphasis on referencing molecules to persistent databases or using
database-independent forms, such as SMILES or InChI strings, as this
permits their chemical structure to be represented unambiguously and
in a manner that permits automated reasoning. The reconstruction is
readily available via a publicly accessible database and in the
Systems Biology Markup Language
(http://www.comp-sys-bio.org/yeastnet).
It can be maintained as a resource that serves as a common denominator for
studying the systems biology of yeast. Similar strategies should benefit
communities studying genome-scale metabolic networks of other organisms.
Background:
Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently
interpret and seamlessly integrate information scattered across public resources. Experimental
data sets from metabolomics studies need to be integrated with one another, but also with data
produced by other types of omics studies in the spirit of systems biology, hence the pressing
need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non
trivial to construct these resources manually.
Results:
We describe a methodology for rapid development of controlled vocabularies, a study originally
motivated by the needs for vocabularies describing metabolomics technologies. We present case
studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and
gas chromatography) whose development is currently underway as part of the Metabolomics Standards
Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms.
A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis
of the results showed that full-text articles (especially the Materials and Methods sections) are
the major source of technology-specific terms as opposed to paper abstracts.
Conclusions:
We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly
expanding a set of controlled vocabularies with the terms used in the scientific literature. We
adopted an integrative approach, combining relatively generic software and data resources for
time- and cost-effective development of a text mining tool for expansion of controlled vocabularies
across various domains, as a practical alternative to both manual term collection and tailor-made
named entity recognition methods.
Background: The stability of mammalian serum and urine in large metabolomic investigations is essential for accurate, valid and reproducible studies. The stability of mammalian serum and urine, either processed immediately by freezing at -80°C or stored at 4°C for 24 hours before being frozen, was compared in a pilot metabolomic study of samples from 40 separate healthy volunteers.
Methods: Metabolic profiling with GC-TOF-MS was performed for serum and urine samples collected from 40 volunteers and stored at -80°C or 4°C for 24 hours before being frozen. Subsequent Kruskal-Wallis and Principal Components Analysis methods were used to assess whether metabolomic differences were detected between samples stored at 4°C for 0 or 24 hours.
Results: More than 700 unique metabolite peaks were detected, with over 200 metabolite peaks detected in any one sample. Principal Components Analysis (PCA) of serum and urine data showed that the variance associated with the replicate analysis per sample (analytical variance) was of the same magnitude as the variance observed between samples stored at 4°C or -80°C for 24 hours (biological variance). From a functional point of view the metabolomic composition of samples did not change in a statistically significant manner when stored under the two different conditions.
Conclusions: Based on this small pilot study, the UK Biobank sampling, transport and fractionation protocols are considered suitable to provide samples which can produce scientifically robust and valid data in metabolomic studies.
There is intense interest in the identification of novel biomarkers which
improve the diagnosis of heart failure. Serum samples from 52 patients with s
ystolic heart failure (EF<40% plus signs and symptoms of failure) and 57
controls were analyzed by gas chromatography - time of flight - mass
spectrometry and the raw data reduced to 272 statistically robust
metabolite peaks. 38 peaks showed a significant difference between
case and control (p<5×10-5). Two such metabolites
were pseudouridine, a modified nucleotide present in t- and rRNA and a
marker of cell turnover, as well as the tricarboxylic acid cycle intermediate
2-oxoglutarate. Furthermore, three further compounds were also excellent
discriminators between patients and controls: 2-hydroxy, 2-methylpropanoic
acid, erythritol and 2,4,6-trihydroxypyrimidine. These findings demonstrate
the power of data-driven metabolomics approaches to identify such markers
of disease.
In this article we present the activities of the Ontology Working Group (OWG)
under the Metabolomics Standards Initiative (MSI) umbrella. Our endeavour aims
to synergise the work of several communities, where independent activities are
underway to develop terminologies and databases for metabolomics investigations.
We have joined forces to rise to the challenges associated with interpreting and
integrating experimental process and data across disparate sources (software and
databases, private and public). Our focus is to support the activities of the
other MSI working groups by developing a common semantic framework to enable
metabolomics-user communities to consistently annotate the experimental process
and to enable meaningful exchange of datasets. Our work is accessible via a public
webpage and a draft ontology has been posted under the Open Biological Ontology
umbrella. At the very outset, we have agreed to minimize duplications across
omics domains through extensive liaisons with other communities under the OBO
Foundry. This is work in progress and we welcome new participants willing to
volunteer their time and expertise to this open effort.
Background:
The genome sequencing projects have shown our limited knowledge regarding gene function, e.g. S. cerevisiae has 5-6,000 genes of which nearly 1,000 have an uncertain function. Their gross influence on the behaviour of the cell can be observed using large-scale metabolomic studies. The metabolomic data produced need to be structured and annotated in a machine-usable form to facilitate the exploration of the hidden links between the genes and their functions.
Description:
MeMo is a formal model for representing metabolomic data and the associated metadata. Two predominant platforms (SQL and XML) are used to encode the model. MeMo has been implemented as a relational database using a hybrid approach combining the advantages of the two technologies. It represents a practical solution for handling the sheer volume and complexity of the metabolomic data effectively and efficiently. The MeMo model and the associated software are available at http://dbkgroup.org/memo/.
Conclusions:
The maturity of relational database technology is used to support efficient data processing. The scalability and self-descriptiveness of XML are used to simplify the relational schema and facilitate the extensibility of the model necessitated by the creation of new experimental techniques. Special consideration is given to data integration issues as part of the systems biology agenda. MeMo has been physically integrated and cross-linked to related metabolomic and genomic databases. Semantic integration with other relevant databases has been supported through ontological annotation. Compatibility with other data formats is supported by automatic conversion.
A report on the third Genomes to Systems consortium conference, which portrayed
the breadth of the post-genome sciences including Genomics, Transcriptomics,
Proteomics, Metabolomics, Informatics, and integrative Systems Biology.
The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill information, extract facts, discover implicit links and generate hypotheses relevant to user needs. Ontologies, as conceptual models, provide the necessary framework for semantic representation of textual information. The principal link between text and an ontology is terminology, which maps terms to domain-specific concepts. In this article, we summarize different approaches in which ontologies have been used for text mining applications in biomedicine.
One element of classical systems analysis treats a system as a black or grey box, the inner structure and behaviour of which can be analysed and modelled by varying an internal or external condition, probing it from outside and studying the effect of the variation on the external observables. The result is an understanding of the inner make-up and workings of the system. The equivalent of this in biology is to observe what a cell or system excretes under controlled conditions - the 'metabolic footprint' or exometabolome - as this is readily and accurately measurable. Here, we review the principles, experimental approaches and scientific outcomes that have been obtained with this useful and convenient strategy.
Motivation: The sheer volume of textually described biomedical knowledge exerts the need for natural language processing (NLP) applications in order to allow flexible and efficient access to relevant information. Specialised semantic networks (such as biomedical ontologies, terminologies or semantic lexicons) can significantly enhance these applications by supplying the necessary terminological information in a machinereadable form. Due to the explosive growth of bio-literature, new terms (representing newly identified concepts or variations of the existing terms) may not be explicitly described within the network and hence cannot be fully exploited by NLP applications. Linguistic and statistical clues can be used to extract many new terms from free text. The extracted terms still need to be correctly positioned relative to other terms in the network. Classification as a means of semantic typing represents the first step in updating a semantic network with new terms.
Results: The MaSTerClass system implements the case-based reasoning methodology for the classification of biomedical terms.
Availability: MaSTerClass is available at http://www.cbr-masterclass.org. It is distributed under an open source licence for educational and research purposes. The software requires Java, JWSDP, Ant, MySQL and X-hive to be installed and licences obtained separately where needed.
Metabolomics, like others omics methods, produces huge datasets of biological variables, along with the necessary metadata. However, regardless of the form in which these are produced they are merely the ground substance for assisting us in answering biological questions. In this short tutorial review and position paper we seek to set out some of the elements of 'best practice' in the optimal acquisition of such data, and in the means by which they may be turned into reliable knowledge. Many of these steps involve the solution of what amount to combinatorial optimization problems, and methods developed for these are, especially those based on evolutionary computing, are proving valuable. This is done in terms of a 'pipeline' that goes from the design of good experiments, through instrumental optimization, data storage and manipulation, the chemometric data processing methods in common use, and the necessary means of validation and cross-validation for giving conclusions that are credible and likely to be robust when applied in comparable circumstances and to samples not used in their generation.
In this paper, we present an approach to term classification based on
verb selectional patterns (VSPs), where such a pattern is defined as a
set of semantic classes that could be used in combination with a given
domain-specific verb. VSPs have been automatically learnt based on the
information found in a corpus and an ontology in the biomedical domain.
Prior to the learning phase, the corpus is terminologically processed:
term recognition is performed by both looking up the dictionary of terms
listed in the ontology and applying the C/NC-value method for on-the-fly
term extraction. Subsequently, domain-specific verbs are automatically
identified in the corpus based on the frequency of occurrence and the
frequency of their co-occurrence with terms. VSPs are then learnt
automatically for these verbs. Two machine learning approaches are
presented. The first approach has been implemented as an iterative
generalisation procedure based on a partial order relation induced by
the domain-specific ontology. The second approach exploits the idea of
genetic algorithms. Once the VSPs are acquired, they can be used to
classify newly recognised terms co-occurring with domain-specific verbs.
Given a term, the most frequently co-occurring domain-specific verb is
selected. Its VSP is used to constrain the search space by focusing on
potential classes of the given term. A nearest-neighbour approach is then
applied to select a class from the constrained space of candidate classes.
The most similar candidate class is predicted for the given term. The
similarity measure used for this purpose combines contextual, lexical,
and syntactic properties of terms.
In this article we present an approach to the automatic discovery of term similarities, which may serve as a basis for a number of term-oriented knowledge mining tasks. The method for term comparison combines internal (lexical similarity) and two types of external criteria (syntactic and contextual similarities). Lexical similarity is based on sharing lexical constituents (i.e. term heads and modifiers). Syntactic similarity relies on a set of specific lexico-syntactic co-occurrence patterns indicating the parallel usage of terms (e.g. within an enumeration or within a term coordination/conjunction structure), while contextual similarity is based on the usage of terms in similar contexts. Such contexts are automatically identified by a pattern mining approach, and a procedure is proposed to assess their domain-specific and terminological relevance. Although automatically collected, these patterns are domain dependent and identify contexts in which terms are used. Different types of similarities are combined into a hybrid similarity measure, which can be tuned for a specific domain by learning optimal weights for individual similarities. The suggested similarity measure has been tested in the domain of biomedicine, and some experiments are presented.
In this paper we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts are presented.
In this paper we describe TIMS, an integrated knowledge management system for the domain of molecular biology and biomedicine, in which terminology-driven literature mining, knowledge acquisition, knowledge integration, and XML-based knowledge retrieval are combined using tag information management and ontology inference. The system integrates automatic terminology acquisition, term variation management, hierarchical term clustering, tag-based information extraction, and ontology-based query expansion. TIMS supports introducing and combining different types of tags (linguistic and domain-specific, manual and automatic). Tag-based interval operations and a query language are introduced in order to facilitate knowledge acquisition and retrieval from XML documents. Through knowledge acquisition examples, we illustrate the way in which literature mining techniques can be utilised for knowledge discovery from documents.
We present a measure of contextual similarity for biomedical terms.
The contextual features need to be explored, because newly coined
terms are not explicitly described and efficiently stored in
biomedical ontologies and their inner features (e.g. morphologic
or orthographic) do not always provide sufficient information
about the properties of the underlying concepts. The context of
each term can be represented as a sequence of syntactic elements
annotated with biomedical information retrieved from an ontology.
The sequences of contextual elements may be matched approximately
by edit distance defined as the minimal cost incurred by the
changes (including insertion, deletion and replacement) needed to
transform one sequence into the other. Our approach augments the
traditional concept of edit distance by elements of linguistic and
biomedical knowledge, which together provide flexible selection of
contextual features and their comparison.
Goran Nenadić,
Irena Spasić and
Sophia Ananiadou
(2005)
Mining biomedical abstracts: what is in a term?.
In K.Y. Su et al. (Eds.): Natural Language Processing - IJCNLP 2004.
LNAI 3248, Springer Verlag, pp. 797-806
In this paper we present a study of the usage of terminology in biomedical literature, with the main aim to indicate phenomena that can be helpful for automatic term recognition in the domain. Our comparative analysis is based on the terminology used in the Genia corpus. We analyse the usage of ordinary biomedical terms as well as their variants (namely inflectional and orthographic alternatives, terms with prepositions, coordinated terms, etc.), showing the variability and dynamic nature of terms used in biomedical abstracts. Term coordination and terms containing prepositions are analysed in detail. We show that there is a discrepancy between terms used in literature and terms listed in controlled dictionaries. We also evaluate the effectiveness of incorporating different types of term variation into an automatic term recognition system.
Irena Spasić,
Goran Nenadić and
Sophia Ananiadou
(2004)
Learning to classify biomedical terms through literature mining and genetic algorithms.
In Z.R. Yang et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2004.
LNCS 3177, Springer Verlag, pp. 345-351
We present an approach to classification of biomedical terms based on the information acquired automatically from the corpus of relevant literature. The learning phase consists of two stages: acquisition of terminologically relevant contextual patterns (CPs) and selection of classes that apply to terms used with these patterns. CPs represent a generalisation of similar term contexts in the form of regular expressions containing lexical, syntactic and terminological information. The most probable classes for the training terms co-occurring with the statistically relevant CP are learned by a genetic algorithm. Term classification is based on the learnt results. First, each term is associated with the most frequently co-occurring CP. Classes attached to such CP are initially suggested as the term's potential classes. Then, the term is finally mapped to the most similar suggested class.
Goran Nenadić,
Irena Spasić and
Sophia Ananiadou
(2003)
Reducing lexical ambiguity in Serbo-Croatian by using genetic algorithms.
In P. Kosta et al. (Eds.): Investigations into Formal Slavic Linguistics.
Linguistik International, Peter Lang, Frankfurt, pp. 287-298
This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora using genetic algorithms. The main aim is to use these constraints to automatically define local grammars that can be used to reduce lexical ambiguity usually found in an initially tagged text. A genetic algorithm for computation of the minimal representation of grammatical features of textual constituents is suggested. The algorithm incorporates two types of genes, dominant and recessive, which are specific for the features that are analysed. The resulting genetic structure describes the constraints that have to be fulfilled in order to form a correct utterance. As a case study, the suggested algorithm is applied on contexts of prepositional phrases, and features of corresponding noun phrases are obtained. The results obtained coincide with (theoretical) grammars that define the constraints for such noun phrases.
Irena Spasić,
Goran Nenadić,
Kostas Manios and
Sophia Ananiadou
(2002)
Supervised learning of term similarities.
In Hujun Yin et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2002.
LNCS 2412, Springer Verlag, pp. 429-434
In this paper we present a method for the automatic discovery
and tuning of term similarities. The method is based on the
automatic extraction of significant patterns in which terms
tend to appear. Beside that, we use lexical and functional
similarities between terms to define a hybrid similarity
measure as a linear combination of the three similarities.
We then present a genetic algorithm approach to supervised
learning of parameters that are used in this linear combination.
We used a domain specific ontology to evaluate the generated
similarity measures and set the direction of their convergence.
The approach has been tested and evaluated in the domain of
molecular biology.
Goran Nenadić,
Irena Spasić and
Sophia Ananiadou
(2002)
Term clustering using a corpus-based similarity measure.
In P. Sojka et al. (Eds.): Text, Speech and Dialogue - TSD 2002.
LNAI 2448, Springer Verlag, pp. 151-154
In this paper we present a method for the automatic term clustering.
The method uses a hybrid similarity measure to cluster terms
automatically extracted from a corpus by applying the C/NC-value
method. The measure comprises contextual, functional and lexical
similarity, and it is used to instantiate the cell values in a
similarity matrix. The clustering algorithm uses either the nearest
neighbour or the Ward's method to calculate the distance between
clusters. The approach has been tested and evaluated in the domain
of molecular biology and the results are presented.
Irena Spasić and
Gordana Pavlović-Lažetić
(2001)
Syntactic structures in a sublanguage of Serbian for querying relational databases.
In G. Zybatow et al. (Eds.): Current Issues in Formal Slavic Linguistics.
Peter Lang, Frankfurt/Main, pp. 478-488
This paper deals with syntactic structures identified in a sublanguage of Serbian for querying relational databases. Three levels of syntactic description of the sublanguage are defined: word, syntagmatic, and sentence levels. An algorithm for complete syntactic analysis of a Serbian language query over relational database and its translation into a formal SQL query is presented. An example of partial parsing and translation is discussed.
Goran Nenadić and
Irena Spasić
(2000)
The recognition and acquisition of compound names from corpora.
In D. Christodoulakis (Ed.): Natural Language Processing - NLP 2000.
LNAI 1835, Springer Verlag, pp.38-48
In this paper we will present an approach to acquisition of some classes
of compound words from large corpora, as well as a method for semi-automatic
generation of appropriate linguistic models, that can be further used for
compound word recognition and for completion of compound word dictionaries.
The approach is intended for a highly inflective language such as Serbo-Croatian.
Generated linguistic models are represented by local grammars.
Goran Nenadić and
Irena Spasić
(1999)
The acquisition of some lexical constraints from corpora.
In V. Matousek et al. (Eds.): Text, Speech and Dialogue - TSD 1999.
LNAI 1692, Springer Verlag, pp. 115-120
This paper presents an approach to acquisition of some lexical and
grammatical constraints from large corpora. Constraints that are
discussed are related to grammatical features of a preposition and
the corresponding noun phrase that constitute a prepositional phrase.
The approach is based on the extraction of a textual environment of
a preposition from a corpus, which is then tagged using the system
of electronic dictionaries. An algorithm for computation of some
kind of the minimal representation of grammatical features associated
with the corresponding noun phrases is suggested. The resulting set of
features describes the constraints that a noun phrase has to fulfil in
order to form a correct prepositional phrase with a given preposition.
This set can be checked against other corpora.
Bioinformatics applications heavily rely on controlled vocabularies and ontologies to consistently interpret and seamlessly integrate information scattered across disparate public resources. Experimental data from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. Here we describe the development of controlled vocabularies for metabolomics investigations. Manual term acquisition approaches are time-consuming, labour-intensive and error-prone, especially in a rapidly developing domain such as metabolomics, where new analytical techniques emerge regularly so that the domain experts are often compelled to use non-standardised terms. We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature.
In this paper we discuss the performance of a text-based classification approach by comparing different types of features. We consider the automatic classification of gene names from the molecular biology literature, by using a support-vector machine method. Classification features range from words, lemmas and stems, to automatically extracted terms. Also, simple co-occurrences of genes within documents are considered. The preliminary experiments performed on a set of 3,000 S. cerevisiae gene names and 53,000 Medline abstracts have shown that using domain-specific terms can improve the performance compared to the standard bag-of-words approach, in particular for genes classified with higher confidence, and for under-represented classes.
In this paper we present an approach to term classification based on verb complementation patterns. The complementation patterns have been automatically learnt by combining information found in a corpus and an ontology, both belonging to the biomedical domain. The learning process is unsupervised and has been implemented as an iterative reasoning procedure based on a partial order relation induced by the domain-specific ontology. First, term recognition was performed by both looking up the dictionary of terms listed in the ontology and applying the C/NC-value method. Subsequently, domain-specific verbs were automatically identified in the corpus. Finally, the classes of terms typically selected as arguments for the considered verbs were induced from the corpus and the ontology. This information was used to classify newly recognised terms. The precision of the classification method reached 64%.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Morpho-syntactic clues for terminological processing in Serbian, in Proceedings of EACL Workshop on Morphological Processing of Slavic Languages, Budapest, Hungary, pp. 79-86
In this paper we discuss morpho-syntactic clues that can be used to facilitate terminological processing in Serbian. A method (called srCe) for automatic extraction of multiword terms is presented. The approach incorporates a set of generic morpho-syntactic filters for recognition of term candidates, a method for conflation of morphological variants and a module for foreign word recognition. Morpho-syntactic filters describe general term formation patterns, and are implemented as generic regular expressions. The inner structure together with the agreements within term candidates are used as clues to discover the boundaries of nested terms. The results of the terminological processing of a textbook corpus in the domains of mathematics and computer science are presented.
In this paper we describe the X-TRACT workbench, which enables efficient term-based querying against a domain-specific literature corpus. Its main aim is to aid domain specialists in locating and extracting new knowledge from scientific literature corpora. Before querying, a corpus is automatically terminologically analysed by the ATRACT system, which performs terminology recognition based on the C/NC-value method enhanced by incorporation of term variation handling. The results of terminology processing are annotated in XML, and the produced XML documents are stored in an XML-native database. All corpus retrieval operations are performed against this database using an XML query language. We illustrate the way in which the X-TRACT workbench can be utilised for knowledge discovery, literature mining and conceptual information extraction.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Terminology-driven mining of biomedical literature, in Proceedings of 18th Annual ACM Symposium on Applied Computing, Melbourne, Florida, USA
Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective literature mining techniques that can help biologists to gather and make use of the knowledge encoded in text documents. Although the knowledge is organised around sets of domain-specific terms, few literature mining systems incorporate deep and dynamic terminology processing.
Results: In this paper, we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts recorded the precision of 98% and 71% respectively.
Term recognition and clustering are key topics in automatic knowledge acquisition and text mining. In this paper we present a novel approach to the automatic discovery of term similarities, which serves as a basis for both classification and clustering of domain-specific concepts represented by terms. The method is based on automatic extraction of significant patterns in which terms tend to appear. The approach is domain independent: it needs no manual description of domain-specific features and it is based on knowledge-poor processing of specific term features. However, automatically collected patterns are domain specific and identify significant contexts in which terms are used. Beside features that represent contextual patterns, we use lexical and functional similarities between terms to define a combined similarity measure. The approach has been tested and evaluated in the domain of molecular biology, and preliminary results are presented.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2002) Tuning context features with genetic algorithms, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2048-2054
In this paper we present an approach to tuning of context features acquired from corpora. The approach is based on the idea of a genetic algorithm (GA). We analyse a whole population of contexts surrounding related linguistic entities in order to find a generic property characteristic of such contexts. Our goal is to tune the context properties so as not to lose any correct feature values, but also to minimise the presence of ambiguous values. The GA implements a crossover operator based on dominant and recessive genes, where a gene corresponds to a context feature. A dominant gene is the one that, when combined with another gene of the same type, is inevitably reflected in the offspring. Dominant genes denote the more suitable context features. In each iteration of the GA, the number of individuals in the population is halved, finally resulting in a single individual that contains context features tuned with respect to the information contained in the training corpus. We illustrate the general method by using a case study concerned with the identification of relationships between verbs and terms complementing them. More precisely, we tune the classes of terms that are typically selected as arguments for the considered verbs in order to acquire their semantic features.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Automatic acronym acquisition and management within domain-specific texts, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2155-2162
In this paper we present a framework for the effective management of terms and their variants that are automatically acquired from domain-specific texts. In our approach, the term variant recognition is incorporated in the automatic term retrieval process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in scientific papers. We describe a method for the automatic acquisition of newly introduced acronyms and the mapping to their 'meanings', i.e. the corresponding terms. The proposed three-step procedure is based on morpho-syntactic constraints that are commonly used in acronym definitions. First, acronym definitions containing an acronym and the corresponding term are retrieved. These two elements are matched in the second step by performing morphological analysis of words and combining forms constituting the term. The problems of acronym variation and acronym ambiguity are addressed in the third step by establishing classes of term variants that correspond to specific concepts. We present the results of the acronym acquisition in the domain of molecular biology: the precision of the method ranged from 94% to 99% depending on the size of the corpus used for evaluation, whilst the recall was 73%.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2002) A genetic algorithm approach to unsupervised learning of context features, in Proceedings of 5th National Colloquium for Computational Linguistics in the UK, Leeds, UK, pp. 12-19
We present an approach to unsupervised learning of some context features from corpora. The approach uses the idea of genetic algorithms. The algorithm operates on collection of related linguistic entities as opposed to an isolated linguistic entity. Each of the entities encodes the values for predefined set of context features obtained by automatic tagging. Our goal is to refine these features in order to find an interpretation that is optimal in the sense that it does not lose any correct feature values, but which, on the other hand, minimises the presence of feature values that are not applicable in a specific context. Our genetic algorithm implements a novel crossover operator based on two types of genes, dominant and recessive, where a gene corresponds to a context feature.
Dubravka Pavličić and Irena Spasić (2001) The effects of irrelevant alternatives on the results of the TOPSIS method, in Proceedings of XXVIII Yugoslav Symposium
on Operational Research SYM-OP-IS 2001, Belgrade, Serbia
Irena Spasić and Gordana Pavlović-Lažetić (2001) Object-oriented modelling in natural language communication with a relational database, in Selected Papers from 10th Congress of Yugoslav Mathematicians, Belgrade, Serbia, pp. 343-347
This paper describes the problems of developing a natural language
interface towards a relational database (RDB). These problems depend on
a particular database, or, more precisely, on a specific semantic domain
that is modeled by the RDB. The most obvious dependency is the one reflected
in the structure of the RDB, that is - the actual tables, attributes and
their relationships. This information is recorded in the RDB catalogue,
which can be used for the automatic generation of an OO model of the RDB.
The classes of that model may serve the purpose of supporting the information
extracted from a natural language query (NLQ). Possible ambiguities are
gradually reduced by using the IsA relationships between the classes. If this
still leaves the ambiguity unresolved, then it is possible to automatically
generate a menu corresponding to the class that is the source of the ambiguity.
The structure of the menu is in accordance with the OO model of the RDB.
Olgica Bošković and Irena Spasić (1999) Graph theory and log-linear models, in Proceedings of XXVI Yugoslav Symposium on Operational Research SYM-OP-IS '99, Belgrade, Serbia
Irena Spasić (1996) Automatic foreign words recognition in a Serbian scientific or technical text, in Proceedings of Conference on Standardization of Terminology, Serbian Academy of Arts and Sciences, Belgrade, Serbia
We present SOLD a flexible similarity measure for biomedical terms (textual representations of biomedical concepts, e.g. genes, compounds, microorganisms), which combines term features on three different levels: syntactic, semantic (ontology-driven) and lexical. Since terms' inner features (e.g. morphologic or orthographic) do not always provide sufficient information about the properties of the underlying concepts, contextual features are explored as well. The context of each term is represented as a sequence of syntactic elements semantically annotated with the information retrieved from an ontology. The sequences of contextual elements are matched approximately through edit distance, defined as the minimal cost incurred by the changes (insertion, deletion and replacement) needed to transform one sequence into the other. Our approach augments the traditional concept of edit distance with elements of linguistic and biomedical knowledge. The SOLD measure has been incorporated into MaSTerClass, a case-based reasoning (CBR) system for term classification. CBR is based on remembering specific experiences similar to the problem (case) being solved. Such an approach to bio-text mining takes an advantage of the growing body of available biomedical literature, used as a case-base, since the coverage of the system automatically increases with more cases becoming available. Apart from term classification, the same approach can be used to mine other types of term associations including: recognition of term variants, clustering of similar terms, extraction of specific relations between terms, etc.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) What can be learnt and acquired from non-disambiguated corpora: a case study in Serbian, 7th TELRI Seminar, Dubrovnik, Croatia
Dubravka Pavličić and Irena Spasić (2001) The effects of irrelevant alternatives on decision making results, The European Operational Research Conference EURO 2001, Rotterdam, The Netherlands
The paper deals with the effects of an irrelevant alternative on the results of the Multiple Attribute Decision Making (MADM) methods. By irrelevant alternative (IA) we denote an alternative, which, although not dominated by any other alternative from the observed set, in binary comparisons made by a MADM method is worse than any of them. We observe the problem of sequential choices from a fixed group of objects by using the same criteria with constant weights during the observed period. The effects of the changes of attributes' values of an IA on the final choices are examined. Several conditions of consistent choice of the MADM methods concerning an IA are defined: Independence of Worsening of an IA, Independence of Completely Negligible Improvement of an IA, and Independence of Partially Negligible Improvement of an IA. The ELECTRE method is chosen as an illustration and it is shown that (when based on vector-normalised ratings, and not on utilities) the method violates the three conditions. Finally, we conclude that the main cause of inconsistent choices is vector normalisation of empirical data, conducted in the first step of the method.
Technical reports:
Irena Spasić (2003) Automatic term extraction in biomedicine, in Technical Reports in Computer Science, ISSN 1476-3060, Report No. 03/01, School of Sciences, University of Salford, p. 46
Irena Spasić (2002) An overview of case-based reasoning, in Technical Reports in Computer Science, ISSN 1476-3060, Report No. 02/01, School of Sciences, University of Salford, p. 75
Books:
Irena Spasić and Predrag Janičić (2000) Theory of Algorithms, Languages and Automata. Faculty of Mathematics, Belgrade, Serbia
Miodrag Ivović, Branislav Boričić, Dragan Azdejković and Irena Spasić (1998) Practice Book in Mathematics. Faculty of Economics, Belgrade, Serbia
Miodrag Ivović, Branislav Boričić, Velimir Pavlović, Dragan Azdejković and Irena Spasić (1996) Mathematics through Examples and Exercises with Elements of Theory. Faculty of Economics, Belgrade, Serbia
Search for publications in external sources:
DBLP Bibliography Server provides bibliographic information on major computer science journals and proceedings.