University Home
Dr Irena Spasić

Publications


(expand | collapse) all abstracts | EndNote | citeUlike

Journal publications:

Markus J. Herrgård, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalçin Arga, Mikko Arvas, Nils Blüthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Peter Li, Wolfram Liebermeister, Monica L. Mo, Ana Paula Oliveira, Dina Petranovic, Stephen Pettifer, Evangelos Simeonidis, Kieran Smallbone, Irena Spasić, Dieter Weichart, Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betül Kirdar, Merja Penttilä, Edda Klipp, Bernhard Ø. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen and Douglas B. Kell (2008) A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology, Vol. 26, No. 10, pp. 1155-1160 [DOI: 10.1038/nbt1492]
Genomic data allow the large-scale manual or semi-automated assembly of metabolic network reconstructions, which provide highly curated organism-specific knowledge bases. Although several genome-scale network reconstructions describe Saccharomyces cerevisiae metabolism, they differ in scope and content, and use different terminologies to describe the same chemical entities. This makes comparisons between them difficult and underscores the desirability of a consolidated metabolic network that collects and formalizes the 'community knowledge' of yeast metabolism. We describe how we have produced a consensus metabolic network reconstruction for S. cerevisiae. In drafting it, we placed special emphasis on referencing molecules to persistent databases or using database-independent forms, such as SMILES or InChI strings, as this permits their chemical structure to be represented unambiguously and in a manner that permits automated reasoning. The reconstruction is readily available via a publicly accessible database and in the Systems Biology Markup Language (http://www.comp-sys-bio.org/yeastnet). It can be maintained as a resource that serves as a common denominator for studying the systems biology of yeast. Similar strategies should benefit communities studying genome-scale metabolic networks of other organisms.
Irena Spasić, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas Kell and Norman Paton (2008) Facilitating the development of controlled vocabularies for metabolomics technologies with text mining. BMC Bioinformatics, Vol. 9, Suppl. 5, S5 [PMID: 18460187] [DOI: 10.1186/1471-2105-9-S5-S5]
Background: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually.

Results: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts.

Conclusions: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.
Warwick Dunn, David Broadhurst, David Ellis, Marie Brown, Anthony Halsall, Steven O'Hagan, Irena Spasić, Andrew Tseng and Douglas Kell (2008) A GC-TOF-MS study of the stability of serum and urine metabolomes during the UK Biobank sample collection and preparation protocols. International Journal of Epidemiology, Vol. 37, pp. i23-i30 [PMID: 18381390] [DOI: 10.1093/ije/dym281]
Background: The stability of mammalian serum and urine in large metabolomic investigations is essential for accurate, valid and reproducible studies. The stability of mammalian serum and urine, either processed immediately by freezing at -80°C or stored at 4°C for 24 hours before being frozen, was compared in a pilot metabolomic study of samples from 40 separate healthy volunteers.

Methods: Metabolic profiling with GC-TOF-MS was performed for serum and urine samples collected from 40 volunteers and stored at -80°C or 4°C for 24 hours before being frozen. Subsequent Kruskal-Wallis and Principal Components Analysis methods were used to assess whether metabolomic differences were detected between samples stored at 4°C for 0 or 24 hours.

Results: More than 700 unique metabolite peaks were detected, with over 200 metabolite peaks detected in any one sample. Principal Components Analysis (PCA) of serum and urine data showed that the variance associated with the replicate analysis per sample (analytical variance) was of the same magnitude as the variance observed between samples stored at 4°C or -80°C for 24 hours (biological variance). From a functional point of view the metabolomic composition of samples did not change in a statistically significant manner when stored under the two different conditions.

Conclusions: Based on this small pilot study, the UK Biobank sampling, transport and fractionation protocols are considered suitable to provide samples which can produce scientifically robust and valid data in metabolomic studies.

Keywords: metabolomics, metabolic profiling, GC-MS, univariate analysis, multivariate analysis, biofluid, serum, urine
Warwick Dunn, David Broadhurst, Sasalu Deepak, Mamta Buch, Garry McDowell, Irena Spasić, David Ellis, Nicholas Brooks, Douglas Kell and Ludwig Neyses (2007) Serum metabolomics reveals many novel metabolic markers of heart failure, including pseudouridine and 2-oxoglutarate. Metabolomics, Vol. 3, No. 4, pp. 413-426 [DOI: 10.1007/s11306-007-0063-5]
There is intense interest in the identification of novel biomarkers which improve the diagnosis of heart failure. Serum samples from 52 patients with s ystolic heart failure (EF<40% plus signs and symptoms of failure) and 57 controls were analyzed by gas chromatography - time of flight - mass spectrometry and the raw data reduced to 272 statistically robust metabolite peaks. 38 peaks showed a significant difference between case and control (p<5×10-5). Two such metabolites were pseudouridine, a modified nucleotide present in t- and rRNA and a marker of cell turnover, as well as the tricarboxylic acid cycle intermediate 2-oxoglutarate. Furthermore, three further compounds were also excellent discriminators between patients and controls: 2-hydroxy, 2-methylpropanoic acid, erythritol and 2,4,6-trihydroxypyrimidine. These findings demonstrate the power of data-driven metabolomics approaches to identify such markers of disease.

Keywords: heart failure, metabolomics, biomarkers, pseudouridine, 2-oxoglutarate.
Susanna-Assunta Sansone, Daniel Schober, Helen Atherton, Oliver Fiehn, Helen Jenkins, Philippe Rocca-Serra, Denis Rubtsov, Irena Spasić, Larisa Soldatova, Chris Taylor, Andy Tseng, Mark Viant and the Ontology Working Group Members (2007) Metabolomics Standards Initiative - Ontology Working Group: Work in progress. Metabolomics, Vol. 3, No. 3, pp. 249-256 [DOI: 10.1007/s11306-007-0069-z]
In this article we present the activities of the Ontology Working Group (OWG) under the Metabolomics Standards Initiative (MSI) umbrella. Our endeavour aims to synergise the work of several communities, where independent activities are underway to develop terminologies and databases for metabolomics investigations. We have joined forces to rise to the challenges associated with interpreting and integrating experimental process and data across disparate sources (software and databases, private and public). Our focus is to support the activities of the other MSI working groups by developing a common semantic framework to enable metabolomics-user communities to consistently annotate the experimental process and to enable meaningful exchange of datasets. Our work is accessible via a public webpage and a draft ontology has been posted under the Open Biological Ontology umbrella. At the very outset, we have agreed to minimize duplications across omics domains through extensive liaisons with other communities under the OBO Foundry. This is work in progress and we welcome new participants willing to volunteer their time and expertise to this open effort.

Keywords: controlled vocabulary, annotation, terminology, semantic, metadata, ontology, functional genomics, metabolomics, metabonomics, standard, Metabolomics Society, Metabolomics Standards Initiative, OBO.
Irena SpasićCorrespondence, Warwick Dunn, Giles Velarde, Andy Tseng, Helen Jenkins, Nigel Hardy, Stephen Oliver and Douglas KellCorrespondence (2006) MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics. BMC Bioinformatics, Vol. 7, 281 [PMID: 16753052] [DOI: 10.1186/1471-2105-7-281] highly accessed
Background: The genome sequencing projects have shown our limited knowledge regarding gene function, e.g. S. cerevisiae has 5-6,000 genes of which nearly 1,000 have an uncertain function. Their gross influence on the behaviour of the cell can be observed using large-scale metabolomic studies. The metabolomic data produced need to be structured and annotated in a machine-usable form to facilitate the exploration of the hidden links between the genes and their functions.

Description: MeMo is a formal model for representing metabolomic data and the associated metadata. Two predominant platforms (SQL and XML) are used to encode the model. MeMo has been implemented as a relational database using a hybrid approach combining the advantages of the two technologies. It represents a practical solution for handling the sheer volume and complexity of the metabolomic data effectively and efficiently. The MeMo model and the associated software are available at http://dbkgroup.org/memo/.

Conclusions: The maturity of relational database technology is used to support efficient data processing. The scalability and self-descriptiveness of XML are used to simplify the relational schema and facilitate the extensibility of the model necessitated by the creation of new experimental techniques. Special consideration is given to data integration issues as part of the systems biology agenda. MeMo has been physically integrated and cross-linked to related metabolomic and genomic databases. Semantic integration with other relevant databases has been supported through ontological annotation. Compatibility with other data formats is supported by automatic conversion.
Stephen Wilkinson, Irena Spasić and David EllisCorrespondence (2006) Genomes to Systems 3. Metabolomics, Vol. 2, No. 3, pp. 165-170 [DOI: 10.1007/s11306-006-0030-6]
A report on the third Genomes to Systems consortium conference, which portrayed the breadth of the post-genome sciences including Genomics, Transcriptomics, Proteomics, Metabolomics, Informatics, and integrative Systems Biology.
Irena SpasićCorrespondence, Sophia Ananiadou, John McNaught and Anand Kumar (2005) Text mining and ontologies in biomedicine: making sense of raw text. Briefings in Bioinformatics, Vol. 6, No. 3, pp. 239-251 [PMID: 16212772]
The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill information, extract facts, discover implicit links and generate hypotheses relevant to user needs. Ontologies, as conceptual models, provide the necessary framework for semantic representation of textual information. The principal link between text and an ontology is terminology, which maps terms to domain-specific concepts. In this article, we summarize different approaches in which ontologies have been used for text mining applications in biomedicine.
Douglas KellCorrespondence, Marie Brown, Hazel Davey, Warwick Dunn, Irena Spasić and Stephen Oliver (2005) Metabolic footprinting and systems biology: the medium is the message. Nature Reviews Microbiology, Vol. 3, No. 7, pp. 557-565 [PMID: 15953932] [DOI: 10.1038/nrmicro1177]
One element of classical systems analysis treats a system as a black or grey box, the inner structure and behaviour of which can be analysed and modelled by varying an internal or external condition, probing it from outside and studying the effect of the variation on the external observables. The result is an understanding of the inner make-up and workings of the system. The equivalent of this in biology is to observe what a cell or system excretes under controlled conditions - the 'metabolic footprint' or exometabolome - as this is readily and accurately measurable. Here, we review the principles, experimental approaches and scientific outcomes that have been obtained with this useful and convenient strategy.
Irena SpasićCorrespondence, Sophia Ananiadou and Junichi Tsujii (2005) MaSTerClass: a case-based reasoning system for the classification of biomedical terms. Bioinformatics, Vol. 21, No. 11, pp. 2748-2758 [PMID: 15728115] [DOI: 10.1093/bioinformatics/bti338]
Motivation: The sheer volume of textually described biomedical knowledge exerts the need for natural language processing (NLP) applications in order to allow flexible and efficient access to relevant information. Specialised semantic networks (such as biomedical ontologies, terminologies or semantic lexicons) can significantly enhance these applications by supplying the necessary terminological information in a machinereadable form. Due to the explosive growth of bio-literature, new terms (representing newly identified concepts or variations of the existing terms) may not be explicitly described within the network and hence cannot be fully exploited by NLP applications. Linguistic and statistical clues can be used to extract many new terms from free text. The extracted terms still need to be correctly positioned relative to other terms in the network. Classification as a means of semantic typing represents the first step in updating a semantic network with new terms.

Results: The MaSTerClass system implements the case-based reasoning methodology for the classification of biomedical terms.

Availability: MaSTerClass is available at http://www.cbr-masterclass.org. It is distributed under an open source licence for educational and research purposes. The software requires Java, JWSDP, Ant, MySQL and X-hive to be installed and licences obtained separately where needed.
Marie Brown, Warwick Dunn, David Ellis, Royston Goodacre, Julia Handl, Joshua Knowles, Steve O'Hagan, Irena Spasić and Douglas KellCorrespondence (2005) A Metabolome pipeline: from concept to data to knowledge. Metabolomics, Vol. 1, No. 1, pp. 39-51 [DOI: 10.1007/s11306-005-1106-4]
Metabolomics, like others omics methods, produces huge datasets of biological variables, along with the necessary metadata. However, regardless of the form in which these are produced they are merely the ground substance for assisting us in answering biological questions. In this short tutorial review and position paper we seek to set out some of the elements of 'best practice' in the optimal acquisition of such data, and in the means by which they may be turned into reliable knowledge. Many of these steps involve the solution of what amount to combinatorial optimization problems, and methods developed for these are, especially those based on evolutionary computing, are proving valuable. This is done in terms of a 'pipeline' that goes from the design of good experiments, through instrumental optimization, data storage and manipulation, the chemometric data processing methods in common use, and the necessary means of validation and cross-validation for giving conclusions that are credible and likely to be robust when applied in comparable circumstances and to samples not used in their generation.
Irena SpasićCorrespondence and Sophia Ananiadou (2004) Using automatically learnt verb selectional preferences for classification of biomedical terms. Journal of Biomedical Informatics, Special Issue on Named Entity Recognition in Biomedicine, Vol. 37, No. 6, pp. 483-497 [PMID: 15542021] [DOI: 10.1016/j.jbi.2004.08.002]
In this paper, we present an approach to term classification based on verb selectional patterns (VSPs), where such a pattern is defined as a set of semantic classes that could be used in combination with a given domain-specific verb. VSPs have been automatically learnt based on the information found in a corpus and an ontology in the biomedical domain. Prior to the learning phase, the corpus is terminologically processed: term recognition is performed by both looking up the dictionary of terms listed in the ontology and applying the C/NC-value method for on-the-fly term extraction. Subsequently, domain-specific verbs are automatically identified in the corpus based on the frequency of occurrence and the frequency of their co-occurrence with terms. VSPs are then learnt automatically for these verbs. Two machine learning approaches are presented. The first approach has been implemented as an iterative generalisation procedure based on a partial order relation induced by the domain-specific ontology. The second approach exploits the idea of genetic algorithms. Once the VSPs are acquired, they can be used to classify newly recognised terms co-occurring with domain-specific verbs. Given a term, the most frequently co-occurring domain-specific verb is selected. Its VSP is used to constrain the search space by focusing on potential classes of the given term. A nearest-neighbour approach is then applied to select a class from the constrained space of candidate classes. The most similar candidate class is predicted for the given term. The similarity measure used for this purpose combines contextual, lexical, and syntactic properties of terms.
Goran NenadićCorrespondence, Irena Spasić and Sophia Ananiadou (2004) Mining term similarities from corpora. Terminology, Special Issue on Recent Trends in Computational Terminology, Vol. 10, No. 1, pp. 55-80
In this article we present an approach to the automatic discovery of term similarities, which may serve as a basis for a number of term-oriented knowledge mining tasks. The method for term comparison combines internal (lexical similarity) and two types of external criteria (syntactic and contextual similarities). Lexical similarity is based on sharing lexical constituents (i.e. term heads and modifiers). Syntactic similarity relies on a set of specific lexico-syntactic co-occurrence patterns indicating the parallel usage of terms (e.g. within an enumeration or within a term coordination/conjunction structure), while contextual similarity is based on the usage of terms in similar contexts. Such contexts are automatically identified by a pattern mining approach, and a procedure is proposed to assess their domain-specific and terminological relevance. Although automatically collected, these patterns are domain dependent and identify contexts in which terms are used. Different types of similarities are combined into a hybrid similarity measure, which can be tuned for a specific domain by learning optimal weights for individual similarities. The suggested similarity measure has been tested in the domain of biomedicine, and some experiments are presented.
Goran NenadićCorrespondence, Irena Spasić and Sophia Ananiadou (2003) Terminology-driven mining of biomedical literature. Bioinformatics, Vol. 19, No. 8, pp. 938-943 [PMID: 12761055]
In this paper we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts are presented.
Goran NenadićCorrespondence, Hideki Mima, Irena Spasić, Sophia Ananiadou and Junichi Tsujii (2002) Terminology-based literature mining and knowledge acquisition in biomedicine. International Journal of Medical Informatics, Vol. 67, No. 1-3, pp. 33-48 [PMID: 12460630] [DOI: 10.1016/S1386-5056(02)00055-2]
In this paper we describe TIMS, an integrated knowledge management system for the domain of molecular biology and biomedicine, in which terminology-driven literature mining, knowledge acquisition, knowledge integration, and XML-based knowledge retrieval are combined using tag information management and ontology inference. The system integrates automatic terminology acquisition, term variation management, hierarchical term clustering, tag-based information extraction, and ontology-based query expansion. TIMS supports introducing and combining different types of tags (linguistic and domain-specific, manual and automatic). Tag-based interval operations and a query language are introduced in order to facilitate knowledge acquisition and retrieval from XML documents. Through knowledge acquisition examples, we illustrate the way in which literature mining techniques can be utilised for knowledge discovery from documents.


Refereed book chapters:

Irena Spasić and Sophia Ananiadou (2005) A flexible measure of contextual similarity for biomedical terms. In R. Altman et al. (Eds.): Pacific Symposium on Biocomputing - PSB 2005. World Scientific Publishing Company, Singapore, pp. 197-208 [PMID: 15759626]
We present a measure of contextual similarity for biomedical terms. The contextual features need to be explored, because newly coined terms are not explicitly described and efficiently stored in biomedical ontologies and their inner features (e.g. morphologic or orthographic) do not always provide sufficient information about the properties of the underlying concepts. The context of each term can be represented as a sequence of syntactic elements annotated with biomedical information retrieved from an ontology. The sequences of contextual elements may be matched approximately by edit distance defined as the minimal cost incurred by the changes (including insertion, deletion and replacement) needed to transform one sequence into the other. Our approach augments the traditional concept of edit distance by elements of linguistic and biomedical knowledge, which together provide flexible selection of contextual features and their comparison.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2005) Mining biomedical abstracts: what is in a term?. In K.Y. Su et al. (Eds.): Natural Language Processing - IJCNLP 2004. LNAI 3248, Springer Verlag, pp. 797-806
In this paper we present a study of the usage of terminology in biomedical literature, with the main aim to indicate phenomena that can be helpful for automatic term recognition in the domain. Our comparative analysis is based on the terminology used in the Genia corpus. We analyse the usage of ordinary biomedical terms as well as their variants (namely inflectional and orthographic alternatives, terms with prepositions, coordinated terms, etc.), showing the variability and dynamic nature of terms used in biomedical abstracts. Term coordination and terms containing prepositions are analysed in detail. We show that there is a discrepancy between terms used in literature and terms listed in controlled dictionaries. We also evaluate the effectiveness of incorporating different types of term variation into an automatic term recognition system.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2004) Learning to classify biomedical terms through literature mining and genetic algorithms. In Z.R. Yang et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2004. LNCS 3177, Springer Verlag, pp. 345-351
We present an approach to classification of biomedical terms based on the information acquired automatically from the corpus of relevant literature. The learning phase consists of two stages: acquisition of terminologically relevant contextual patterns (CPs) and selection of classes that apply to terms used with these patterns. CPs represent a generalisation of similar term contexts in the form of regular expressions containing lexical, syntactic and terminological information. The most probable classes for the training terms co-occurring with the statistically relevant CP are learned by a genetic algorithm. Term classification is based on the learnt results. First, each term is associated with the most frequently co-occurring CP. Classes attached to such CP are initially suggested as the term's potential classes. Then, the term is finally mapped to the most similar suggested class.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Reducing lexical ambiguity in Serbo-Croatian by using genetic algorithms. In P. Kosta et al. (Eds.): Investigations into Formal Slavic Linguistics. Linguistik International, Peter Lang, Frankfurt, pp. 287-298
This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora using genetic algorithms. The main aim is to use these constraints to automatically define local grammars that can be used to reduce lexical ambiguity usually found in an initially tagged text. A genetic algorithm for computation of the minimal representation of grammatical features of textual constituents is suggested. The algorithm incorporates two types of genes, dominant and recessive, which are specific for the features that are analysed. The resulting genetic structure describes the constraints that have to be fulfilled in order to form a correct utterance. As a case study, the suggested algorithm is applied on contexts of prepositional phrases, and features of corresponding noun phrases are obtained. The results obtained coincide with (theoretical) grammars that define the constraints for such noun phrases.
Irena Spasić, Goran Nenadić, Kostas Manios and Sophia Ananiadou (2002) Supervised learning of term similarities. In Hujun Yin et al. (Eds.): Intelligent Data Engineering and Automated Learning - IDEAL 2002. LNCS 2412, Springer Verlag, pp. 429-434
In this paper we present a method for the automatic discovery and tuning of term similarities. The method is based on the automatic extraction of significant patterns in which terms tend to appear. Beside that, we use lexical and functional similarities between terms to define a hybrid similarity measure as a linear combination of the three similarities. We then present a genetic algorithm approach to supervised learning of parameters that are used in this linear combination. We used a domain specific ontology to evaluate the generated similarity measures and set the direction of their convergence. The approach has been tested and evaluated in the domain of molecular biology.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Term clustering using a corpus-based similarity measure. In P. Sojka et al. (Eds.): Text, Speech and Dialogue - TSD 2002. LNAI 2448, Springer Verlag, pp. 151-154
In this paper we present a method for the automatic term clustering. The method uses a hybrid similarity measure to cluster terms automatically extracted from a corpus by applying the C/NC-value method. The measure comprises contextual, functional and lexical similarity, and it is used to instantiate the cell values in a similarity matrix. The clustering algorithm uses either the nearest neighbour or the Ward's method to calculate the distance between clusters. The approach has been tested and evaluated in the domain of molecular biology and the results are presented.
Irena Spasić and Gordana Pavlović-Lažetić (2001) Syntactic structures in a sublanguage of Serbian for querying relational databases. In G. Zybatow et al. (Eds.): Current Issues in Formal Slavic Linguistics. Peter Lang, Frankfurt/Main, pp. 478-488
This paper deals with syntactic structures identified in a sublanguage of Serbian for querying relational databases. Three levels of syntactic description of the sublanguage are defined: word, syntagmatic, and sentence levels. An algorithm for complete syntactic analysis of a Serbian language query over relational database and its translation into a formal SQL query is presented. An example of partial parsing and translation is discussed.
Goran Nenadić and Irena Spasić (2000) The recognition and acquisition of compound names from corpora. In D. Christodoulakis (Ed.): Natural Language Processing - NLP 2000. LNAI 1835, Springer Verlag, pp.38-48
In this paper we will present an approach to acquisition of some classes of compound words from large corpora, as well as a method for semi-automatic generation of appropriate linguistic models, that can be further used for compound word recognition and for completion of compound word dictionaries. The approach is intended for a highly inflective language such as Serbo-Croatian. Generated linguistic models are represented by local grammars.
Goran Nenadić and Irena Spasić (1999) The acquisition of some lexical constraints from corpora. In V. Matousek et al. (Eds.): Text, Speech and Dialogue - TSD 1999. LNAI 1692, Springer Verlag, pp. 115-120
This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora. Constraints that are discussed are related to grammatical features of a preposition and the corresponding noun phrase that constitute a prepositional phrase. The approach is based on the extraction of a textual environment of a preposition from a corpus, which is then tagged using the system of electronic dictionaries. An algorithm for computation of some kind of the minimal representation of grammatical features associated with the corresponding noun phrases is suggested. The resulting set of features describes the constraints that a noun phrase has to fulfil in order to form a correct prepositional phrase with a given preposition. This set can be checked against other corpora.


Refereed conference papers:

Irena Spasić, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas Kell, Norman Paton and the Ontology Working Group Members (2007) Facilitating the development of controlled vocabularies for metabolomics with text mining, in ISMB/ECCB Special Interest Group (SIG) Meeting Program Materials, Bio-Ontologies SIG Workshop, Vienna, Austria, pp. 103-106
Bioinformatics applications heavily rely on controlled vocabularies and ontologies to consistently interpret and seamlessly integrate information scattered across disparate public resources. Experimental data from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. Here we describe the development of controlled vocabularies for metabolomics investigations. Manual term acquisition approaches are time-consuming, labour-intensive and error-prone, especially in a rapidly developing domain such as metabolomics, where new analytical techniques emerge regularly so that the domain experts are often compelled to use non-standardised terms. We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature.
Goran Nenadić, Simon Rice, Irena Spasić, Sophia Ananiadou and Benjamin Stapley (2003) Selecting text features for gene name classification: from documents to terms, in Proceedings of ACL Workshop on Natural Language Processing in Biomedicine, Sapporo, Japan, pp. 121-128
In this paper we discuss the performance of a text-based classification approach by comparing different types of features. We consider the automatic classification of gene names from the molecular biology literature, by using a support-vector machine method. Classification features range from words, lemmas and stems, to automatically extracted terms. Also, simple co-occurrences of genes within documents are considered. The preliminary experiments performed on a set of 3,000 S. cerevisiae gene names and 53,000 Medline abstracts have shown that using domain-specific terms can improve the performance compared to the standard bag-of-words approach, in particular for genes classified with higher confidence, and for under-represented classes.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2003) Using domain-specific verbs for term classification, in Proceedings of ACL Workshop on Natural Language Processing in Biomedicine, Sapporo, Japan, pp. 17-24
In this paper we present an approach to term classification based on verb complementation patterns. The complementation patterns have been automatically learnt by combining information found in a corpus and an ontology, both belonging to the biomedical domain. The learning process is unsupervised and has been implemented as an iterative reasoning procedure based on a partial order relation induced by the domain-specific ontology. First, term recognition was performed by both looking up the dictionary of terms listed in the ontology and applying the C/NC-value method. Subsequently, domain-specific verbs were automatically identified in the corpus. Finally, the classes of terms typically selected as arguments for the considered verbs were induced from the corpus and the ontology. This information was used to classify newly recognised terms. The precision of the classification method reached 64%.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Morpho-syntactic clues for terminological processing in Serbian, in Proceedings of EACL Workshop on Morphological Processing of Slavic Languages, Budapest, Hungary, pp. 79-86
In this paper we discuss morpho-syntactic clues that can be used to facilitate terminological processing in Serbian. A method (called srCe) for automatic extraction of multiword terms is presented. The approach incorporates a set of generic morpho-syntactic filters for recognition of term candidates, a method for conflation of morphological variants and a module for foreign word recognition. Morpho-syntactic filters describe general term formation patterns, and are implemented as generic regular expressions. The inner structure together with the agreements within term candidates are used as clues to discover the boundaries of nested terms. The results of the terminological processing of a textbook corpus in the domains of mathematics and computer science are presented.
Irena Spasić, Goran Nenadić, Kostas Manios and Sophia Ananiadou (2003) An integrated term-based corpus query system, in Proceedings of 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 243-250
In this paper we describe the X-TRACT workbench, which enables efficient term-based querying against a domain-specific literature corpus. Its main aim is to aid domain specialists in locating and extracting new knowledge from scientific literature corpora. Before querying, a corpus is automatically terminologically analysed by the ATRACT system, which performs terminology recognition based on the C/NC-value method enhanced by incorporation of term variation handling. The results of terminology processing are annotated in XML, and the produced XML documents are stored in an XML-native database. All corpus retrieval operations are performed against this database using an XML query language. We illustrate the way in which the X-TRACT workbench can be utilised for knowledge discovery, literature mining and conceptual information extraction.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2003) Terminology-driven mining of biomedical literature, in Proceedings of 18th Annual ACM Symposium on Applied Computing, Melbourne, Florida, USA
Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective literature mining techniques that can help biologists to gather and make use of the knowledge encoded in text documents. Although the knowledge is organised around sets of domain-specific terms, few literature mining systems incorporate deep and dynamic terminology processing.

Results: In this paper, we present an overview of an integrated framework for terminology-driven mining from biomedical literature. The framework integrates the following components: automatic term recognition, term variation handling, acronym acquisition, automatic discovery of term similarities and term clustering. The term variant recognition is incorporated into terminology recognition process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in biomedical papers. Term clustering is based on the automatic discovery of term similarities. We use a hybrid similarity measure, where terms are compared by using both internal and external evidence. The measure combines lexical, syntactical and contextual similarity. Experiments on terminology recognition and structuring performed on a corpus of biomedical abstracts recorded the precision of 98% and 71% respectively.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Automatic discovery of term similarities using pattern mining, in Proceedings of Second International Workshop on Computational Terminology - CompuTerm 2002, Taipei, Taiwan, pp. 43-49
Term recognition and clustering are key topics in automatic knowledge acquisition and text mining. In this paper we present a novel approach to the automatic discovery of term similarities, which serves as a basis for both classification and clustering of domain-specific concepts represented by terms. The method is based on automatic extraction of significant patterns in which terms tend to appear. The approach is domain independent: it needs no manual description of domain-specific features and it is based on knowledge-poor processing of specific term features. However, automatically collected patterns are domain specific and identify significant contexts in which terms are used. Beside features that represent contextual patterns, we use lexical and functional similarities between terms to define a combined similarity measure. The approach has been tested and evaluated in the domain of molecular biology, and preliminary results are presented.
Sophia Ananiadou, Goran Nenadić, Dietrich Schuhmann and Irena Spasić (2002) Term-based literature mining from biomedical texts," ISMB Text Data Mining SIG, Edmonton, Canada

Irena Spasić, Goran Nenadić and Sophia Ananiadou (2002) Tuning context features with genetic algorithms, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2048-2054
In this paper we present an approach to tuning of context features acquired from corpora. The approach is based on the idea of a genetic algorithm (GA). We analyse a whole population of contexts surrounding related linguistic entities in order to find a generic property characteristic of such contexts. Our goal is to tune the context properties so as not to lose any correct feature values, but also to minimise the presence of ambiguous values. The GA implements a crossover operator based on dominant and recessive genes, where a gene corresponds to a context feature. A dominant gene is the one that, when combined with another gene of the same type, is inevitably reflected in the offspring. Dominant genes denote the more suitable context features. In each iteration of the GA, the number of individuals in the population is halved, finally resulting in a single individual that contains context features tuned with respect to the information contained in the training corpus. We illustrate the general method by using a case study concerned with the identification of relationships between verbs and terms complementing them. More precisely, we tune the classes of terms that are typically selected as arguments for the considered verbs in order to acquire their semantic features.
Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) Automatic acronym acquisition and management within domain-specific texts, in Proceedings of 3rd International Conference on Language, Resources and Evaluation, Las Palmas, Spain, pp. 2155-2162
In this paper we present a framework for the effective management of terms and their variants that are automatically acquired from domain-specific texts. In our approach, the term variant recognition is incorporated in the automatic term retrieval process by taking into account orthographical, morphological, syntactic, lexico-semantic and pragmatic term variations. In particular, we address acronyms as a common way of introducing term variants in scientific papers. We describe a method for the automatic acquisition of newly introduced acronyms and the mapping to their 'meanings', i.e. the corresponding terms. The proposed three-step procedure is based on morpho-syntactic constraints that are commonly used in acronym definitions. First, acronym definitions containing an acronym and the corresponding term are retrieved. These two elements are matched in the second step by performing morphological analysis of words and combining forms constituting the term. The problems of acronym variation and acronym ambiguity are addressed in the third step by establishing classes of term variants that correspond to specific concepts. We present the results of the acronym acquisition in the domain of molecular biology: the precision of the method ranged from 94% to 99% depending on the size of the corpus used for evaluation, whilst the recall was 73%.
Irena Spasić, Goran Nenadić and Sophia Ananiadou (2002) A genetic algorithm approach to unsupervised learning of context features, in Proceedings of 5th National Colloquium for Computational Linguistics in the UK, Leeds, UK, pp. 12-19
We present an approach to unsupervised learning of some context features from corpora. The approach uses the idea of genetic algorithms. The algorithm operates on collection of related linguistic entities as opposed to an isolated linguistic entity. Each of the entities encodes the values for predefined set of context features obtained by automatic tagging. Our goal is to refine these features in order to find an interpretation that is optimal in the sense that it does not lose any correct feature values, but which, on the other hand, minimises the presence of feature values that are not applicable in a specific context. Our genetic algorithm implements a novel crossover operator based on two types of genes, dominant and recessive, where a gene corresponds to a context feature.
Dubravka Pavličić and Irena Spasić (2001) The effects of irrelevant alternatives on the results of the TOPSIS method, in Proceedings of XXVIII Yugoslav Symposium on Operational Research SYM-OP-IS 2001, Belgrade, Serbia

Irena Spasić and Gordana Pavlović-Lažetić (2001) Object-oriented modelling in natural language communication with a relational database, in Selected Papers from 10th Congress of Yugoslav Mathematicians, Belgrade, Serbia, pp. 343-347
This paper describes the problems of developing a natural language interface towards a relational database (RDB). These problems depend on a particular database, or, more precisely, on a specific semantic domain that is modeled by the RDB. The most obvious dependency is the one reflected in the structure of the RDB, that is - the actual tables, attributes and their relationships. This information is recorded in the RDB catalogue, which can be used for the automatic generation of an OO model of the RDB. The classes of that model may serve the purpose of supporting the information extracted from a natural language query (NLQ). Possible ambiguities are gradually reduced by using the IsA relationships between the classes. If this still leaves the ambiguity unresolved, then it is possible to automatically generate a menu corresponding to the class that is the source of the ambiguity. The structure of the menu is in accordance with the OO model of the RDB.
Olgica Bošković and Irena Spasić (1999) Graph theory and log-linear models, in Proceedings of XXVI Yugoslav Symposium on Operational Research SYM-OP-IS '99, Belgrade, Serbia

Irena Spasić (1996) Automatic foreign words recognition in a Serbian scientific or technical text, in Proceedings of Conference on Standardization of Terminology, Serbian Academy of Arts and Sciences, Belgrade, Serbia



Presentations:

Irena Spasić (2005) A flexible term similarity measure as a basis for intelligent mining of biomedical literature, a talk given at Data and Decision Engineering Group, School of Informatics, University of Manchester, UK
We present SOLD a flexible similarity measure for biomedical terms (textual representations of biomedical concepts, e.g. genes, compounds, microorganisms), which combines term features on three different levels: syntactic, semantic (ontology-driven) and lexical. Since terms' inner features (e.g. morphologic or orthographic) do not always provide sufficient information about the properties of the underlying concepts, contextual features are explored as well. The context of each term is represented as a sequence of syntactic elements semantically annotated with the information retrieved from an ontology. The sequences of contextual elements are matched approximately through edit distance, defined as the minimal cost incurred by the changes (insertion, deletion and replacement) needed to transform one sequence into the other. Our approach augments the traditional concept of edit distance with elements of linguistic and biomedical knowledge. The SOLD measure has been incorporated into MaSTerClass, a case-based reasoning (CBR) system for term classification. CBR is based on remembering specific experiences similar to the problem (case) being solved. Such an approach to bio-text mining takes an advantage of the growing body of available biomedical literature, used as a case-base, since the coverage of the system automatically increases with more cases becoming available. Apart from term classification, the same approach can be used to mine other types of term associations including: recognition of term variants, clustering of similar terms, extraction of specific relations between terms, etc.
Irena Spasić (2005) Term associations: from ontologies to text mining and back, a tutorial presented at the 19th International Conference of the European Federation for Medical Informatics (MIE 2005), Workshop on Terminologies and Ontologies in Biomedicine: Can Text Mining Help?, Geneva, Switzerland [pdf]

Goran Nenadić, Irena Spasić and Sophia Ananiadou (2002) What can be learnt and acquired from non-disambiguated corpora: a case study in Serbian, 7th TELRI Seminar, Dubrovnik, Croatia

Dubravka Pavličić and Irena Spasić (2001) The effects of irrelevant alternatives on decision making results, The European Operational Research Conference EURO 2001, Rotterdam, The Netherlands
The paper deals with the effects of an irrelevant alternative on the results of the Multiple Attribute Decision Making (MADM) methods. By irrelevant alternative (IA) we denote an alternative, which, although not dominated by any other alternative from the observed set, in binary comparisons made by a MADM method is worse than any of them. We observe the problem of sequential choices from a fixed group of objects by using the same criteria with constant weights during the observed period. The effects of the changes of attributes' values of an IA on the final choices are examined. Several conditions of consistent choice of the MADM methods concerning an IA are defined: Independence of Worsening of an IA, Independence of Completely Negligible Improvement of an IA, and Independence of Partially Negligible Improvement of an IA. The ELECTRE method is chosen as an illustration and it is shown that (when based on vector-normalised ratings, and not on utilities) the method violates the three conditions. Finally, we conclude that the main cause of inconsistent choices is vector normalisation of empirical data, conducted in the first step of the method.


Technical reports:

Irena Spasić (2003) Automatic term extraction in biomedicine, in Technical Reports in Computer Science, ISSN 1476-3060, Report No. 03/01, School of Sciences, University of Salford, p. 46

Irena Spasić (2002) An overview of case-based reasoning, in Technical Reports in Computer Science, ISSN 1476-3060, Report No. 02/01, School of Sciences, University of Salford, p. 75



Books:

Irena Spasić and Predrag Janičić (2000) Theory of Algorithms, Languages and Automata. Faculty of Mathematics, Belgrade, Serbia

Miodrag Ivović, Branislav Boričić, Dragan Azdejković and Irena Spasić (1998) Practice Book in Mathematics. Faculty of Economics, Belgrade, Serbia

Miodrag Ivović, Branislav Boričić, Velimir Pavlović, Dragan Azdejković and Irena Spasić (1996) Mathematics through Examples and Exercises with Elements of Theory. Faculty of Economics, Belgrade, Serbia



Search for publications in external sources:




DBLP

DBLP Bibliography Server provides bibliographic information on major computer science journals and proceedings.



PubMed

PubMed is a service of the National Library of Medicine providing access to over 12 million MEDLINE citations.



Google Scholar
Google Scholar provides a simple way to broadly search for scholarly literature.




eXTReMe Tracker