Tuesday, February 17, 2015

“The importance of being semantic”: Annotations vs Semantic Annotations

The focal point of many Net7’s projects is on semantic annotations. What does it really mean and why the term semantic is so important in this context?

Annotations can be simply seen as “attaching data to some other piece of data”, eg documents; the advantage of a semantic annotation is to have this data formally defined and machine-understandable. This offers better possibility for search, reuse and exploitations of the annotations performed by users.

In [Oren 2006] a formal definition of annotation has been proposed. Simply speaking it consists of a tuple with:
  • A. the information that is annotated inside the text
  • B. the annotation itself
  • C. a predicate, that establishes a relationship amongst the two points above
  • D. the context in which the annotation has been made (who made it, when, its reliability, a possible limit for its validity, etc).
This holds true for all kinds of annotation, including handwritten notes beside paper documents.

A simple definition of semantic annotation is proposed in [Kiryakov 2004], where it is presented a vision by which “...named entities constitute an important part of the semantics of the document they are mentioned in… In a nutshell, semantic annotation is about assigning to the entities in text links to their semantic descriptions”.

A very effective definition of semantic annotation can be found in the Ontotext web site: “Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc. to a document or to a selected part in a text. It provides additional information (metadata) about an existing piece of data. … Semantic Annotation helps to bridge the ambiguity of the natural language when expressing notions and their computational representation in a formal language. By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process complex filter and search operations.

This is of course a step forward respect traditional Information Retrieval techniques, in which documents are managed (and indexed) as a disarranged “bag of words”, with no attention to their meaning and no ability to identify ambiguities due to synonymy or polysemy.

Ontologies (or more simply speaking “vocabularies”) provide a formalization of a knowledge domain. Some of them are generic (like for example OpenCyc or Schema.org) and can be used to provide meaningful, albeit not domain-specific, descriptions of common facts and events.

A simple example of a "movie" ontology, taken from the Internet, is depicted below: it presents entity types/classes (Movie, Person, Character), their attributes/metadata (title, name, gender) and relationships among entities and attributes (HasCast, Is directedBy, etc). Of course for a professional use is hugely important to provide very specific vocabularies, that describe in great detail a certain domain.
Movie Ontology Example - source: dev.simantics.org/

The key for providing effective “semantic descriptions” through annotations therefore lies in:
  • the careful definition of ontologies, that is the vocabulary of terms, classes, predicates and properties according to which semantic annotations are performed by users;
  • the use, as much as possible, of standard ontologies, whose meaning is therefore well-known and accepted. This allows the automatic interpretation of the metadata defined using them;
  • the exploitation of Linked Data in annotations. It’s quite convenient to use standard, well-known and formally defined web datasets in the annotation process. Datasets as Wikipedia (or better, DBPedia, its Semantic Web version) or Freebase provide both huge, general purpose, vocabularies of entities and terms, that can be referred through “linking” in the annotation process, and a sets of well-known ontologies, that are so common to be considered standard and as such easily understandable by semantic-aware software agents.
For all what has been said so far, there are several advantages that semantic annotations provide, for example:
  • they are machine understandable since, reusing the definition of [Oren 2006] presented above:
    • the predicate C is formally defined in an ontology
    • the type of the annotation B is formally defined in an ontology
    • the annotation B itself can be an entity formally defined (in an ontology or in a public dataset, eg Wikipedia)
    • the context D may be formally described with terms, types and entities from ontologies and public datasets.
  • they precisely define the context of the annotated document, identifying in detail the nature of the information that is under study. This can be exploited in searches, classification and more generally in every possible reuse of annotated data.
  • they open the door to inferences, that is deducing automatically other data that relate to the annotated document and complement/enrich the annotations originally performed by a user.
The importance of semantic annotations in text can be better appreciated through an example. Consider the following three sentences:
  • On the morning of September 11 Australian swimming legend Ian Thorpe was on his way to the World Trade Centre for a business meeting.
  • Born on September 11 writer/director Brian De Palma’s career began to take off in the 1970s with the horror classic Carrie, based on a Stephen King novel.
  • On September 11, President Salvador Allende of Chile was deposed in a violent coup led by General Augusto Pinochet.
Even if they all contain a reference to the same day (September 11), its actual semantic in them is very different: the first one is an Event (in 2001), the second a Birth date (1940) and finally the last one refers both to a Death date (Allende’s) and to an Event (Pinochet’s Coup in Chile in 1973).

A simple textual indexing of these three texts would identify the “September 11” fragments, without any understanding of their meanings and the actual years they refer to. On the contrary, by simply annotating them with a link to the corresponding entities of Wikipedia (namely:
September 11 attacks, Brian De Palma, 1973 Chilean coup d'état) and the use of specific predicates (Eg. dbpedia-owl:eventDate, dbpedia-owl:birthDate, dbpedia-owl:deathDate), one can automatically infer the correct contexts plus much more data.

Finally annotations can be performed manually by users or automatically, using software services that can identify terms in a text and associate them, through a specific predicate, to entities of a controlled dictionary or of a linked dataset.

Why all this talking about Semantic Annotations? Well, the reason has a name: Pundit!

Pundit is a web tool that allows users to creare semantic annotations on web pages and fragments of text, with an easy to use interface. Pundit is the foundation stone of the PunditBrain service in the StoM project and of many other Net7 initiatives, both research oriented and commercial.

Although Pundit is mainly used for manual annotations, it already supports automatic entity recognition by using several software services, including DataTXT, a commercial service of SpazioDati whose main development has been carried out in the SenTaClAus research project in which Net7 was also involved.

The 2.0 version of Pundit, currently (February 2015) under development and in "alpha", can be tested on the project web site. Hopefully soon a demo version of PunditBrain will be also released to the public.


Bibliography



2 comments:

Amit Sheth said...

If you want to see earlier robust semantic annotation efforts (one million documents semantically annotated per hour per server - in 2000-2002 timeframe in a commercial system), check out: http://knoesis.org/library/resource.php?id=00113

lucadex said...

Thank you Amit!
Looks like a very powerful approach. I particularly appreciated the "pipelined" architecture, which should ensure, as you were saying, a remarkable throughput.
At the same time it seems to me that, in order to obtain the best results, a careful design of the ontology is needed, in order to perform effective Named Entity Extraction (NER).
While this of course holds true if you want analyse texts in specific domains (finance, law, medicine, science,...), for "generic" NER needs the use of Wikipedia as a controlled vocabulary/general ontology seems quite effective. This is exactly the case of the NER service DataTXT, which was in part refactored/developed in a research project we of Net7 had the chance to participate.
Thanks again and keep up your good work!
L.