I'm just back from Bologna, where I attended with my colleagues the Italian Drupal Day conference. We at Net7 are working on several Drupal based projects (the latest I've managed being the website of Scuola Sant'Anna, one of the most prestigious universities in Italy).
We decided to do a presentation on the Innolabsplus.eu project, in which we exploited the semantic API of Dandelion to completely automatise the work of an editorial team. Software services fetch articles from more than 40 web sites (in Italian, English and French) and analyse their texts using Dandelion's Named Entity Extraction and Automatic Classification services. If the article matches with the topics of interest of the portal, it is automatically published, if not it gets discarded.
The site has been in production for several months now, publishing hundreds and hundreds of selected contents, in three languages, without a hiccup and without any manual intervention.
The Drupal Day slides follow (in Italian). Enjoy!
Saturday, December 05, 2015
Wednesday, August 05, 2015
My Nerdie Bookshelf - "Linked Data - Structured data on the web" by David Wood, Marsha Zaidman, Luke Ruth and Michael Hausenblas
This book has been a bit of a disappointment to me, the first one I had from Manning Publications.
Despite being published in 2014 you have the impression that the information provided here are stale. Only in the final chapter ("The evolving Web") a comprehensive, well written and updated viewpoint on Linked Data (and the Semantic Web) is provided, although in concise form.
The foreword by Tim Berners-Lee and the collaboration with Michael Hausenblas lured me to blind purchase the book. In particular I was looking for insights in what, in my viewpoint, is a powerful use case for Linked Data which hasn't been addressed enough, that is Semantic Enterprise Data Integration, hoping to get, as it is common for Manning books, a lot of advanced technical information. In particular I was, and still am, looking for technical advice, integration patterns and product reviews that can guide me in using Semantics to effectively interconnect enterprise data silos.
The book on the other hand revolves around a different perspective, those of a data publisher, with little (if any) notion of the technology behind Linked Data. It presents therefore all the basic concepts at a quite simple level.
This is of course a legit editorial choice but what annoyed me the most was the fact that the information provided are often outdated. No mention on JSON-LD or to the Linked Data Platform principles; CKAN, a widely used platform for creating open data repositories, is just cited but only in connection to the DataHub site. Moreover, the motivations, advantages, pros and cons of working on Linked Data are presented in a very basic, if not superficial, way.
The mention of Callimachus, the "Linked Data application server" created by the authors, left me unimpressed as well, even if it is correct to say that it has been used in interesting projects.
Labels:
Linked Open Data,
My Nerdie Bookshelf,
Semantic Web
Friday, June 05, 2015
Introducing Social Proxy
I've finally published on SlideShare the presentation of Social Proxy, a project I've been working on since 2010.
If you ask, "why this platform and not HootSuite or Radian6?", well I think it still has some strengths, despite our (huge!) competitors have received tons of VC funds over the years, while basically Social Proxy has been developed through a series of orders (some very small) from our customers. In fact:
1. Social Proxy offers, in a single SaaS offering, plenty of features that you can only get by acquiring multiple services. You can get Social Media management (à la Hootsuite) and Social Media Analysis (see Radian6). It is certainly less advanced respect these famous competitors but... it still performs more than nicely!
2. Social Proxy is a framework for Net7, that can be easily extended when a new, custom feature is needed by a customer. For example this Drupal web site, Innolabsplus.eu doesn't have an editorial team behind. It presents content automatically fetched and "cleaned" through the Social Proxy: dozens of RSS feeds are scanned, the linked pages retrieved and their content is extracted, keeping the main text and removing all the decoration parts. Through web services, the Drupal site fetches and publishes the curated content.
Of course competitors provide APIs but the amount of things that you can do with them is limited.
Anyway, here are the slides: enjoy the reading! Other information on Social Proxy (in Italian) can be read here.
If you ask, "why this platform and not HootSuite or Radian6?", well I think it still has some strengths, despite our (huge!) competitors have received tons of VC funds over the years, while basically Social Proxy has been developed through a series of orders (some very small) from our customers. In fact:
1. Social Proxy offers, in a single SaaS offering, plenty of features that you can only get by acquiring multiple services. You can get Social Media management (à la Hootsuite) and Social Media Analysis (see Radian6). It is certainly less advanced respect these famous competitors but... it still performs more than nicely!
2. Social Proxy is a framework for Net7, that can be easily extended when a new, custom feature is needed by a customer. For example this Drupal web site, Innolabsplus.eu doesn't have an editorial team behind. It presents content automatically fetched and "cleaned" through the Social Proxy: dozens of RSS feeds are scanned, the linked pages retrieved and their content is extracted, keeping the main text and removing all the decoration parts. Through web services, the Drupal site fetches and publishes the curated content.
Of course competitors provide APIs but the amount of things that you can do with them is limited.
Anyway, here are the slides: enjoy the reading! Other information on Social Proxy (in Italian) can be read here.
Monday, April 27, 2015
There’s Semantics in this Web
I was asked by Dr. Serena Pezzini of the CTL department of the Scuola Normale Superiore of Pisa to do a presentation on the Semantic Web on April 16th (beside there’s a photo of me taken at the event). Slides, in Italian, are available on SlideShare: the preparation has been a quite interesting process, so I thought to share it in this blog here, this time in English.
This presentation for me was in fact like opening up the legendary Pandora’s Box. It ignited a reflection about what we as a company do regarding the Semantic Web. Net7 in fact always characterizes itself as a “Semantic Web Company".
At about the same time I was contacted both by CTL and by a partner company to talk about this subject. On the one hand CTL expected suggestions and stimuli to use Semantic Web technologies in their work on the Digital Humanities field. The partner company was looking for professional training on these topics.
For 5 seconds I went into autopilot mode and started to think about explaining the Semantic Web in the standard fashion (RDF, ontologies, triple stores, SPARQL, RDFS, OWL, well, you got the idea…). Then three questions sprang to my mind...
Do these persons really need this kind of information? Are they going to really use all of this on their daily job?
The second question is a bit discomforting: do we at Net7 really use completely and especially consciously the whole of the Semantic Web technologies?
The third is even more serious: what’s the current state of the art of the Semantic Web? Is it still an important technology, with practical uses even for middle/low-sized projects, or should it stay confined in the Empyrean of research and huge knowledge management initiatives?
So, it was really important for me to do a presentation with the attempt to find answers to these questions, to present topics that could be of interest and useful to the audience and at the same time to put in a new perspective my knowledge on the field.
The presentation therefore came out as a reflection on the possible uses and advantages of "Semantics in the web", first and foremost for me, in order to reorder my mind, with the hope that it can be useful for others as well. I tried therefore to take a step back, hopefully to progress further in perspective.
For its preparation I read a great deal of material (see bibliography at the end) and was heavily influenced by the presentations and articles of Jim Hendler (not to mention the fantastic book “Semantic Web for the working ontologist” that he co-authored). So, even if you won’t read these lines, thank you very much Dr. Hendler for your insightful thoughts!
Coming back to my presentation, it is not a case that I used the concept “Semantics in the Web” and not “Semantic Web” in the title. Semantics in fact, in the light of all the readings that I did, seems to me more important than the technology behind it.
I started the presentation with a small historical digression, from the very first vision of the World Wide Web in the Tim Berners-Lee's 1989 original proposal, up to the seminal 2001 article on Scientific American, where Berners-Lee, James Hendler and Ora Lassila presented the Semantic Web.
I continued by explaining the key concepts of the Semantic Web, which served to prove how Semantics, despite the Semantic Web vocal critics, can still count huge success stories in the web of today.
The funny thing is that the Semantic Web’s vision didn’t exactly materialize as expected by its inventors. On the one hand is fundamental to comprehend how things in web history just happens through serendipity. On the other is crucial to have always in mind the Jim Hendler’s motto “a little semantics goes a long way”. Indeed just a small portion of the Semantic Web “pyramid” (see slide number 42 in my presentation, taken from a Jim Hendler’s keynote) finds a recurring use, while the rest (inferences and the most sophisticated OWL constructs included) has still a limited diffusion or is just relegated in high-end research initiatives.
So the Semantic Web hasn’t failed but materialized a bit differently than expected. One therefore should really think to Semantics first, that is to exploit the knowledge that can be extracted from documents, linked data repositories, machine readable annotations in web pages (SEO metadata included) before worrying about the orthodox application of the complete stack of Semantic Web technologies.
The Semantic Web is on the other hand a still promising and on certain aspects undiscovered territory. While I don’t honestly see it as a key technology to power web portals (there are plenty of more mature technologies, even open source, - think of Drupal or Django - that fit better this purpose) the idea of managing information through graph makes a lot of sense in several areas, including:
I concluded my slides by also noticing that Semantics is becoming more and more a commodity, offered through specialized cloud services. Named Entity Recognition SaaS offerings, SpazioDati’s DataTXT and AlchemyAPI included, are a consolidated reality. Cloud Machine Learning services are becoming mainstream (see in this regard this insightful article on ZDNet). Developers therefore can enjoy “a little semantics” in their application, without embracing the Semantic Web in full. As Jim Hendler says in fact… a little semantics goes a long way!
Bibliography
This presentation for me was in fact like opening up the legendary Pandora’s Box. It ignited a reflection about what we as a company do regarding the Semantic Web. Net7 in fact always characterizes itself as a “Semantic Web Company".
At about the same time I was contacted both by CTL and by a partner company to talk about this subject. On the one hand CTL expected suggestions and stimuli to use Semantic Web technologies in their work on the Digital Humanities field. The partner company was looking for professional training on these topics.
For 5 seconds I went into autopilot mode and started to think about explaining the Semantic Web in the standard fashion (RDF, ontologies, triple stores, SPARQL, RDFS, OWL, well, you got the idea…). Then three questions sprang to my mind...
Do these persons really need this kind of information? Are they going to really use all of this on their daily job?
The second question is a bit discomforting: do we at Net7 really use completely and especially consciously the whole of the Semantic Web technologies?
The third is even more serious: what’s the current state of the art of the Semantic Web? Is it still an important technology, with practical uses even for middle/low-sized projects, or should it stay confined in the Empyrean of research and huge knowledge management initiatives?
So, it was really important for me to do a presentation with the attempt to find answers to these questions, to present topics that could be of interest and useful to the audience and at the same time to put in a new perspective my knowledge on the field.
The presentation therefore came out as a reflection on the possible uses and advantages of "Semantics in the web", first and foremost for me, in order to reorder my mind, with the hope that it can be useful for others as well. I tried therefore to take a step back, hopefully to progress further in perspective.
For its preparation I read a great deal of material (see bibliography at the end) and was heavily influenced by the presentations and articles of Jim Hendler (not to mention the fantastic book “Semantic Web for the working ontologist” that he co-authored). So, even if you won’t read these lines, thank you very much Dr. Hendler for your insightful thoughts!
Coming back to my presentation, it is not a case that I used the concept “Semantics in the Web” and not “Semantic Web” in the title. Semantics in fact, in the light of all the readings that I did, seems to me more important than the technology behind it.
I started the presentation with a small historical digression, from the very first vision of the World Wide Web in the Tim Berners-Lee's 1989 original proposal, up to the seminal 2001 article on Scientific American, where Berners-Lee, James Hendler and Ora Lassila presented the Semantic Web.
I continued by explaining the key concepts of the Semantic Web, which served to prove how Semantics, despite the Semantic Web vocal critics, can still count huge success stories in the web of today.
The funny thing is that the Semantic Web’s vision didn’t exactly materialize as expected by its inventors. On the one hand is fundamental to comprehend how things in web history just happens through serendipity. On the other is crucial to have always in mind the Jim Hendler’s motto “a little semantics goes a long way”. Indeed just a small portion of the Semantic Web “pyramid” (see slide number 42 in my presentation, taken from a Jim Hendler’s keynote) finds a recurring use, while the rest (inferences and the most sophisticated OWL constructs included) has still a limited diffusion or is just relegated in high-end research initiatives.
So the Semantic Web hasn’t failed but materialized a bit differently than expected. One therefore should really think to Semantics first, that is to exploit the knowledge that can be extracted from documents, linked data repositories, machine readable annotations in web pages (SEO metadata included) before worrying about the orthodox application of the complete stack of Semantic Web technologies.
The Semantic Web is on the other hand a still promising and on certain aspects undiscovered territory. While I don’t honestly see it as a key technology to power web portals (there are plenty of more mature technologies, even open source, - think of Drupal or Django - that fit better this purpose) the idea of managing information through graph makes a lot of sense in several areas, including:
- knowledge management with highly interconnected data (think of Social Network relationships). Here the capacity of triple stores to handle big graph data will really make the difference, especially if an open source product can be used for this purpose (recently we @ Net7 have bet on Blazegraph and while we have been quite satisfied until now, it must be also said that our graphs are not exactly “that big”). There is no doubt in fact that solid open source products are fundamental to skyrocket the use of specific technologies and software architectures (think of LAMP).
- extraction of structured data from text: a great classic Semantic Web use case indeed
- linking independent repositories of information, implemented with traditional technologies in multiple legacy systems (another Semantic Web classic).
- raw data management and dissemination.
- formally described in great detail
- openly distributed, after a specific anonymizing process in order to remove “sensible information” from it
I concluded my slides by also noticing that Semantics is becoming more and more a commodity, offered through specialized cloud services. Named Entity Recognition SaaS offerings, SpazioDati’s DataTXT and AlchemyAPI included, are a consolidated reality. Cloud Machine Learning services are becoming mainstream (see in this regard this insightful article on ZDNet). Developers therefore can enjoy “a little semantics” in their application, without embracing the Semantic Web in full. As Jim Hendler says in fact… a little semantics goes a long way!
Bibliography
- Tim Berners-Lee, James Hendler and Ora Lassila: The Semantic Web, Scientific American May 2001
- Dean Allemang, James Hendler: Semantic Web for the Working Ontologist 2nd Edition, Morgan Kaufmann, 2011
- James Hendler: The Semantic Web: It’s for real http://www.slideshare.net/jahendler/semantic-web-what-it-is-and-why-you-should-care
- Dominiek ter Heide: Three reasons why the Semantic Web has failed https://gigaom.com/2013/11/03/three-reasons-why-the-semantic-web-has-failed/
- Seth Grimes: Semantic Web Business: Going Nowhere Slowly http://www.informationweek.com/software/information-management/semantic-web-business-going-nowhere-slowly/d/d-id/1113323
- Clay Shirky: Ontology is Overrated: Categories, Links, and Tags http://www.shirky.com/writings/ontology_overrated.html
- Michela Finizio: Il miraggio dell’anagrafe unica: più di 54mila banche dati gestite dalla Pa http://www.infodata.ilsole24ore.com/2015/03/11/il-miraggio-dellanagrafe-unica-piu-di-54mila-banche-dati-gestite-dalla-pa/
- James Hendler: “Why the Semantic Web will Never Work” (note the quote marks!) http://www.slideshare.net/jahendler/why-the-semantic-web-will-never-work
- James Hendler: Semantic Web: The Inside Story http://www.slideshare.net/jahendler/semantic-web-the-inside-story
- James Hendler: The Dark Side of the Semantic Web, IEEE Intelligent Systems, Jan/Feb 2007
- Tim Berners-Lee: Raw data, now http://www.wired.co.uk/news/archive/2012-11/09/raw-data
- Neelie Kroes: Digital Agenda and Open Data http://europa.eu/rapid/press-release_SPEECH-12-149_en.htm
- Google: Introducing the Knowledge Graph https://www.youtube.com/watch?v=mmQl6VGvX-c
- Kevan Lee: What Really Happens When Someone Clicks Your Facebook Like Button https://blog.bufferapp.com/facebook-like-button
- Vestforsk.no: Semantic Markup Report http://www.vestforsk.no/filearchive/semantic_markup_report.pdf
- European Commission: Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020 http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf
Tuesday, February 17, 2015
“The importance of being semantic”: Annotations vs Semantic Annotations
The focal point of many Net7’s projects is on semantic annotations. What does it really mean and why the term semantic is so important in this context?
Annotations can be simply seen as “attaching data to some other piece of data”, eg documents; the advantage of a semantic annotation is to have this data formally defined and machine-understandable. This offers better possibility for search, reuse and exploitations of the annotations performed by users.
In [Oren 2006] a formal definition of annotation has been proposed. Simply speaking it consists of a tuple with:
A simple definition of semantic annotation is proposed in [Kiryakov 2004], where it is presented a vision by which “...named entities constitute an important part of the semantics of the document they are mentioned in… In a nutshell, semantic annotation is about assigning to the entities in text links to their semantic descriptions”.
A very effective definition of semantic annotation can be found in the Ontotext web site: “Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc. to a document or to a selected part in a text. It provides additional information (metadata) about an existing piece of data. … Semantic Annotation helps to bridge the ambiguity of the natural language when expressing notions and their computational representation in a formal language. By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process complex filter and search operations.”
This is of course a step forward respect traditional Information Retrieval techniques, in which documents are managed (and indexed) as a disarranged “bag of words”, with no attention to their meaning and no ability to identify ambiguities due to synonymy or polysemy.
Ontologies (or more simply speaking “vocabularies”) provide a formalization of a knowledge domain. Some of them are generic (like for example OpenCyc or Schema.org) and can be used to provide meaningful, albeit not domain-specific, descriptions of common facts and events.
A simple example of a "movie" ontology, taken from the Internet, is depicted below: it presents entity types/classes (Movie, Person, Character), their attributes/metadata (title, name, gender) and relationships among entities and attributes (HasCast, Is directedBy, etc). Of course for a professional use is hugely important to provide very specific vocabularies, that describe in great detail a certain domain.
The key for providing effective “semantic descriptions” through annotations therefore lies in:
A simple textual indexing of these three texts would identify the “September 11” fragments, without any understanding of their meanings and the actual years they refer to. On the contrary, by simply annotating them with a link to the corresponding entities of Wikipedia (namely:
September 11 attacks, Brian De Palma, 1973 Chilean coup d'état) and the use of specific predicates (Eg. dbpedia-owl:eventDate, dbpedia-owl:birthDate, dbpedia-owl:deathDate), one can automatically infer the correct contexts plus much more data.
Finally annotations can be performed manually by users or automatically, using software services that can identify terms in a text and associate them, through a specific predicate, to entities of a controlled dictionary or of a linked dataset.
Why all this talking about Semantic Annotations? Well, the reason has a name: Pundit!
Pundit is a web tool that allows users to creare semantic annotations on web pages and fragments of text, with an easy to use interface. Pundit is the foundation stone of the PunditBrain service in the StoM project and of many other Net7 initiatives, both research oriented and commercial.
Although Pundit is mainly used for manual annotations, it already supports automatic entity recognition by using several software services, including DataTXT, a commercial service of SpazioDati whose main development has been carried out in the SenTaClAus research project in which Net7 was also involved.
The 2.0 version of Pundit, currently (February 2015) under development and in "alpha", can be tested on the project web site. Hopefully soon a demo version of PunditBrain will be also released to the public.
Bibliography
Annotations can be simply seen as “attaching data to some other piece of data”, eg documents; the advantage of a semantic annotation is to have this data formally defined and machine-understandable. This offers better possibility for search, reuse and exploitations of the annotations performed by users.
In [Oren 2006] a formal definition of annotation has been proposed. Simply speaking it consists of a tuple with:
- A. the information that is annotated inside the text
- B. the annotation itself
- C. a predicate, that establishes a relationship amongst the two points above
- D. the context in which the annotation has been made (who made it, when, its reliability, a possible limit for its validity, etc).
A simple definition of semantic annotation is proposed in [Kiryakov 2004], where it is presented a vision by which “...named entities constitute an important part of the semantics of the document they are mentioned in… In a nutshell, semantic annotation is about assigning to the entities in text links to their semantic descriptions”.
A very effective definition of semantic annotation can be found in the Ontotext web site: “Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc. to a document or to a selected part in a text. It provides additional information (metadata) about an existing piece of data. … Semantic Annotation helps to bridge the ambiguity of the natural language when expressing notions and their computational representation in a formal language. By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process complex filter and search operations.”
This is of course a step forward respect traditional Information Retrieval techniques, in which documents are managed (and indexed) as a disarranged “bag of words”, with no attention to their meaning and no ability to identify ambiguities due to synonymy or polysemy.
Ontologies (or more simply speaking “vocabularies”) provide a formalization of a knowledge domain. Some of them are generic (like for example OpenCyc or Schema.org) and can be used to provide meaningful, albeit not domain-specific, descriptions of common facts and events.
A simple example of a "movie" ontology, taken from the Internet, is depicted below: it presents entity types/classes (Movie, Person, Character), their attributes/metadata (title, name, gender) and relationships among entities and attributes (HasCast, Is directedBy, etc). Of course for a professional use is hugely important to provide very specific vocabularies, that describe in great detail a certain domain.
![]() |
Movie Ontology Example - source: dev.simantics.org/ |
The key for providing effective “semantic descriptions” through annotations therefore lies in:
- the careful definition of ontologies, that is the vocabulary of terms, classes, predicates and properties according to which semantic annotations are performed by users;
- the use, as much as possible, of standard ontologies, whose meaning is therefore well-known and accepted. This allows the automatic interpretation of the metadata defined using them;
- the exploitation of Linked Data in annotations. It’s quite convenient to use standard, well-known and formally defined web datasets in the annotation process. Datasets as Wikipedia (or better, DBPedia, its Semantic Web version) or Freebase provide both huge, general purpose, vocabularies of entities and terms, that can be referred through “linking” in the annotation process, and a sets of well-known ontologies, that are so common to be considered standard and as such easily understandable by semantic-aware software agents.
- they are machine understandable since, reusing the definition of [Oren 2006] presented above:
- the predicate C is formally defined in an ontology
- the type of the annotation B is formally defined in an ontology
- the annotation B itself can be an entity formally defined (in an ontology or in a public dataset, eg Wikipedia)
- the context D may be formally described with terms, types and entities from ontologies and public datasets.
- they precisely define the context of the annotated document, identifying in detail the nature of the information that is under study. This can be exploited in searches, classification and more generally in every possible reuse of annotated data.
- they open the door to inferences, that is deducing automatically other data that relate to the annotated document and complement/enrich the annotations originally performed by a user.
- On the morning of September 11 Australian swimming legend Ian Thorpe was on his way to the World Trade Centre for a business meeting.
- Born on September 11 writer/director Brian De Palma’s career began to take off in the 1970s with the horror classic Carrie, based on a Stephen King novel.
- On September 11, President Salvador Allende of Chile was deposed in a violent coup led by General Augusto Pinochet.
A simple textual indexing of these three texts would identify the “September 11” fragments, without any understanding of their meanings and the actual years they refer to. On the contrary, by simply annotating them with a link to the corresponding entities of Wikipedia (namely:
September 11 attacks, Brian De Palma, 1973 Chilean coup d'état) and the use of specific predicates (Eg. dbpedia-owl:eventDate, dbpedia-owl:birthDate, dbpedia-owl:deathDate), one can automatically infer the correct contexts plus much more data.
Finally annotations can be performed manually by users or automatically, using software services that can identify terms in a text and associate them, through a specific predicate, to entities of a controlled dictionary or of a linked dataset.
Why all this talking about Semantic Annotations? Well, the reason has a name: Pundit!
Pundit is a web tool that allows users to creare semantic annotations on web pages and fragments of text, with an easy to use interface. Pundit is the foundation stone of the PunditBrain service in the StoM project and of many other Net7 initiatives, both research oriented and commercial.
Although Pundit is mainly used for manual annotations, it already supports automatic entity recognition by using several software services, including DataTXT, a commercial service of SpazioDati whose main development has been carried out in the SenTaClAus research project in which Net7 was also involved.
The 2.0 version of Pundit, currently (February 2015) under development and in "alpha", can be tested on the project web site. Hopefully soon a demo version of PunditBrain will be also released to the public.
Bibliography
- [Oren 2006] What are Semantic Annotations?, Oren and others, 2006 - http://www.siegfried-handschuh.net/pub/2006/whatissemannot2006.pdf
- [Kiryakov 2004] Semantic annotation, indexing, and retrieval, Kiryakov and others, 2004 - http://infosys3.elfak.ni.ac.rs/nastava/attach/SemantickiWebKurs/sdarticle.pdf
Sunday, October 05, 2014
Semantic Web and Semantic Annotation for newbies: a very introductory presentation of the StoM project
I was asked to write in a simple way a presentation of the StoM project and the technologies behind it. What follows is the final result: I decided to post it here, so that it can hopefully be reused when some of my friends (very often indeed) asks me what Semantic Web is and especially what the heck is a semantic annotation!
I dearly hope my attempt to explain clearly these concepts has succeeded: to paraphrase one of Einstein’s most famous quotes, if you can explain a concept in a simple way, it means you have really understood it... Did I???
StoM reuses the results of a previous EU funded research project named SemLib, in which two main outcomes were produced: an annotation system and a semantic recommender. Both exploit Semantic Web technologies in their internal mechanisms.
First of all, Semantic Web is about adding explicit “meaning” to objects and services on the web: this meaning is expressed through metadata that can be unambiguously interpreted by software agents. It is about making software systems automatically understand each other.
This should not be confused with Artificial Intelligence: there’s no “imitation” of the cognitive processes of the human brain but simply the use of formal metadata to “better describe” a certain context on which a software operates. This “better description” consists of multiple ingredients:
1. using a specific convention to express information. This can be seen as the syntax for describing things. In particular Semantic Web has at its foundation a model based on “triples”, that is assertions of the form:
2. using standard formalism to express triples, in particular the RDF (Resource Description Framework) modeling language. This can be seen as the grammar for describing things.
3. using “vocabularies” to express concepts, possibly those accepted as standards. This is incredibly important: to allow software agents “understanding" concepts and assertions, they must share the knowledge of the domain (isn’t it the same for humans?).
Multiple standard vocabularies are available for representing specific knowledge domains and use cases. For example:
5. linking information/metadata around, to enrich the amount of information available, exploiting those already specified elsewhere.
For example, the triple:
I dearly hope my attempt to explain clearly these concepts has succeeded: to paraphrase one of Einstein’s most famous quotes, if you can explain a concept in a simple way, it means you have really understood it... Did I???
StoM reuses the results of a previous EU funded research project named SemLib, in which two main outcomes were produced: an annotation system and a semantic recommender. Both exploit Semantic Web technologies in their internal mechanisms.
First of all, Semantic Web is about adding explicit “meaning” to objects and services on the web: this meaning is expressed through metadata that can be unambiguously interpreted by software agents. It is about making software systems automatically understand each other.
This should not be confused with Artificial Intelligence: there’s no “imitation” of the cognitive processes of the human brain but simply the use of formal metadata to “better describe” a certain context on which a software operates. This “better description” consists of multiple ingredients:
< Subject > < Predicate > < Object >Some examples:
< X> < is a > < Document >Albeit very simple in nature, this model is incredibly expressive and resorts in creating “a graph of facts”. “Deduction” by software agents is obtained by navigating and querying this graph. To better explain this concept I borrowed the two following images, both taken from the brilliant presentation Semantic Data Management in Graph Databases by Maribel Acosta: they show a graph that models authors, papers, conferences, the logical relationships amongst these concepts and how queries are resolved by navigating the graph structure.
< X> < has author > < Luca De Santis >
< Luca De Santis > < is a > < Person >
< Luca De Santis > < works for > < Net7 >
< Net7 > < is a > < Company >
3. using “vocabularies” to express concepts, possibly those accepted as standards. This is incredibly important: to allow software agents “understanding" concepts and assertions, they must share the knowledge of the domain (isn’t it the same for humans?).
Multiple standard vocabularies are available for representing specific knowledge domains and use cases. For example:
- Dublin Core can describe document metadata (eg author, title, date of creation etc);
- FOAF and SIOC can describe persons and relationships amongst them;
- SKOS can be used to define classification rules (lists of tags, taxonomies, etc);
- GoodRelations formally describes e-commerce scenarios;
- Schema.org is a very rich vocabulary that can describe a huge number of “concepts” and their “properties”, including Creative works (Book, Movie, MusicRecording, Recipe, TVSeries, …), Events, Organizations, Persons, etc. It has in fact been introduced to manage Search Engine Optimization (SEO) by big search vendors like Google, Yahoo and Microsoft/Bing.
5. linking information/metadata around, to enrich the amount of information available, exploiting those already specified elsewhere.
For example, the triple:
< http://www.netseven.it/persone#LucaDeSantis >
< cito:likes >
< http://dbpedia.org/resource/Nine_Inch_Nails >
contains two real navigable set of information in the Subject and Object (together with the property “likes” of the “Citation Typing Ontology” vocabulary). One can exploit the semantic information of the two specified URLs to infer information about the Subject (Luca De Santis is a Person, works for Net7 srl, his job title is IT Consultant and his contact point is his Twitter account - https://twitter.com/#!/lucadex) and the Object (Nine Inch Nails is a Music Group of Industrial Rock).
Semantic Web has been one of the most discussed technologies for more than a decade (this term was introduced in a seminal article by Tim Berners-Lee on Scientific American in 2001!). After the hype has cooled down (so much that a lot of people completely lost interest on it) it has gained wide adoption in several fields, including: search (see below), SEO or the integration of content with Social Networks. The Open Data movement also boosted a huge interest in Semantic Web, because it is obviously important to describe free datasets in terms that can be easily understandable by automatic software agents. Likewise the widespread adoption of API services by software vendors could possibly in a near future be strictly linked to Semantic Web, since its technologies can provide better descriptions to application services and the data they ingest and produce.
Very often Semantic Web is associated to search: albeit sometimes this happens a bit too lightly (when this term is confused with natural language processing) it is also true that Semantic web can really empower enterprise search. In fact, while normal search engines can only recognize word occurrences, with semantics it becomes possible to identify "concepts" and disambiguate synonyms (eg. FED vs Federal Reserve System; Wall Street vs the New York Stock Exchange) or words with multiple meanings (Rock: Music or Geology? Wall Street: the NY District? The Movie? The Stock Exchange Market?). It’s not a case that Google is investing a lot on its Knowledge Graph, that exploits semantic information to enhance the search engine’s results.
The products at the basis of StoM exploit in their very nature Semantic Web technologies.
In particular the Semantic Annotation System allows users to add annotations on textual documents, published on the web. Annotations can be seen as the equivalent, for digital documents, of applying textual notes, underlines, highlights on paper documents or books (the so-called “marginalia”).
Annotations can be applied on a web page by using the familiar metaphor of adding marginalia on documents (see on this regard the brilliant W3C’s Web Annotation Architecture animation). They can be extremely beneficial for those that must manually process a great deal of documents (think of Students, Digital Humanities Scholars or Professional Categories like Lawyers).
Making use of Semantic technologies, it is possible to enrich the meaning of annotations and describe them, at least “internally”, through formal assertions. This way annotations are no longer simple textual comments but become statements (that is, triples) that are stored in a central repository. Formal, structured data can be therefore assigned to textual documents, which are naturally an unstructured form of data.
It is crucial that the annotation process is simple and intuitive for users: for example one should apply formal semantics to a web page by simply highlighting a piece of text and declaring in few steps its type (eg “Yellen” is a Person) or by linking it to an entry of a public dataset, like DBPedia/Wikipedia (eg. “Yellen” “is the same of” http://dbpedia.org/resource/Janet_Yellen). The latter action is very useful: this way the system can automatically “import” all the meaningful references stored in the remote dataset (like the fact that Janet Yellen is a woman and that she is the Chair of the FED) to enrich the available knowledge.
Annotations can facilitate semantic search engines to provide more suitable results. For example “power searches” that these systems can perform are:
The document of the example above will be returned in both cases if the search engine can make use of the annotation on the word “Yellen”, even if the rest of the text doesn’t include any mention of Susan Yellen’s gender or of her role at FED.
Semantic Annotations therefore allow a better management of the knowledge that can be extracted from digital documents and opens the way for a more fine-grained reuse of this knowledge.
The efforts in StoM for the Annotation System will be mainly concentrated on four areas:
1. Usability: providing an annotation tool that is very simple and intuitive to use. Users should apply semantic annotations with the same conceptual effort of highlighting with different colours sentences in a textual document.
2. Facilitating the annotation process by transparently intepreting metadata in the page (title, language, SEO descriptions, ...) and through the integration of entity extractions services, like SpazioDati's DataTXT. This way annotations can be automatically created and presented to the user, for approval or rejection.
3. Providing a comprehensive environment to manage the annotation of web documents. Users can create “notebooks” to store annotations, share them with friends and colleagues, search amongst them and export their data in various formats (even Office Documents) for further reuse.
4. Proposing predefined use case scenarios, to address the needs of specific categories of users. For example, scholars or lawyers when logged in, will find, preloaded, all the vocabularies that specifically refer to their professions.
StoM's Semantic Recommender on the other hand uses Semantic Web technologies to provide better recommendations to users. This can be valuable in multiple scenarios, from E-Commerce (“You bought this: try that”) to News site (“You read this: check also that”).
The system uses the information of public semantic web data sources (like Wikipedia/DBPedia, Europeana, Jamendo, etc) to enrich the descriptions of certain items on which the recommendation should be based on and to create links and references amongst them. The engine navigates the resulting graph to find the items that can be interesting for a user according to her profile, the history of her purchases or the web pages she most liked.
In the example before it was known that Luca De Santis likes the band Nine Inch Nails. By navigating the DBPedia graph from the Nine Inch Nails entry, the system can automatically create a list of all associated bands and propose it to the user (“Luca, you like Nine Inch Nails, try also Skinny Puppy and Cabaret Voltaire!”).
In SemLib, the previous research project, this technology was basically at a very prototypal state. In StoM it will be reengineered to make it more robust, efficient and general-purpose (to make it suitable in multiple real life scenarios). Other recommendation techniques (eg collaborative filtering) will be also tested and, if effective, implemented in the system. A lot of attention in fact will be devoted to performance: the recommender in our vision should become a cloud based service so it must be essential to guarantee a linear scalability in terms of the amount of data that it can manage, by ensuring also consistent processing times, both for updating its internal indexes and for responding to recommendation queries. The choice of the most suitable recommendation algorithms must also take in consideration these aspects.
Despite the increase in adoption of Semantic Web technologies, we think (hope!) that the products that we are focusing on in StoM address a niche market that hasn’t been fully satisfied yet. For example, while several big vendors provide annotation services nowadays (think of Evernote) none of them present a model based on semantics, limiting the annotations to textual comments. This greatly hinders the chances of reuse of the information, making these services sometimes too limited for power users.
In StoM we are concentrating on identifying business needs that can be satisfied by our Semantic Web-powered technologies and to design services that can hopefully generate a real interest in the market.
Thanks to Francesca Di Donato (aka @ederinita) for reviewing this post and for providing precious hints, and to Natalia Mielech (aka @nmielech), who sparked the need to explain what we are trying to achieve in StoM to everybody, and not only to hopeless nerds like me!
Semantic Web has been one of the most discussed technologies for more than a decade (this term was introduced in a seminal article by Tim Berners-Lee on Scientific American in 2001!). After the hype has cooled down (so much that a lot of people completely lost interest on it) it has gained wide adoption in several fields, including: search (see below), SEO or the integration of content with Social Networks. The Open Data movement also boosted a huge interest in Semantic Web, because it is obviously important to describe free datasets in terms that can be easily understandable by automatic software agents. Likewise the widespread adoption of API services by software vendors could possibly in a near future be strictly linked to Semantic Web, since its technologies can provide better descriptions to application services and the data they ingest and produce.
Very often Semantic Web is associated to search: albeit sometimes this happens a bit too lightly (when this term is confused with natural language processing) it is also true that Semantic web can really empower enterprise search. In fact, while normal search engines can only recognize word occurrences, with semantics it becomes possible to identify "concepts" and disambiguate synonyms (eg. FED vs Federal Reserve System; Wall Street vs the New York Stock Exchange) or words with multiple meanings (Rock: Music or Geology? Wall Street: the NY District? The Movie? The Stock Exchange Market?). It’s not a case that Google is investing a lot on its Knowledge Graph, that exploits semantic information to enhance the search engine’s results.
The products at the basis of StoM exploit in their very nature Semantic Web technologies.
In particular the Semantic Annotation System allows users to add annotations on textual documents, published on the web. Annotations can be seen as the equivalent, for digital documents, of applying textual notes, underlines, highlights on paper documents or books (the so-called “marginalia”).
Annotations can be applied on a web page by using the familiar metaphor of adding marginalia on documents (see on this regard the brilliant W3C’s Web Annotation Architecture animation). They can be extremely beneficial for those that must manually process a great deal of documents (think of Students, Digital Humanities Scholars or Professional Categories like Lawyers).
Making use of Semantic technologies, it is possible to enrich the meaning of annotations and describe them, at least “internally”, through formal assertions. This way annotations are no longer simple textual comments but become statements (that is, triples) that are stored in a central repository. Formal, structured data can be therefore assigned to textual documents, which are naturally an unstructured form of data.
It is crucial that the annotation process is simple and intuitive for users: for example one should apply formal semantics to a web page by simply highlighting a piece of text and declaring in few steps its type (eg “Yellen” is a Person) or by linking it to an entry of a public dataset, like DBPedia/Wikipedia (eg. “Yellen” “is the same of” http://dbpedia.org/resource/Janet_Yellen). The latter action is very useful: this way the system can automatically “import” all the meaningful references stored in the remote dataset (like the fact that Janet Yellen is a woman and that she is the Chair of the FED) to enrich the available knowledge.
Annotations can facilitate semantic search engines to provide more suitable results. For example “power searches” that these systems can perform are:
- fetch all documents that talk about women;
- fetch all documents that refer the Federal Reserve System.
The document of the example above will be returned in both cases if the search engine can make use of the annotation on the word “Yellen”, even if the rest of the text doesn’t include any mention of Susan Yellen’s gender or of her role at FED.
Semantic Annotations therefore allow a better management of the knowledge that can be extracted from digital documents and opens the way for a more fine-grained reuse of this knowledge.
The efforts in StoM for the Annotation System will be mainly concentrated on four areas:
1. Usability: providing an annotation tool that is very simple and intuitive to use. Users should apply semantic annotations with the same conceptual effort of highlighting with different colours sentences in a textual document.
2. Facilitating the annotation process by transparently intepreting metadata in the page (title, language, SEO descriptions, ...) and through the integration of entity extractions services, like SpazioDati's DataTXT. This way annotations can be automatically created and presented to the user, for approval or rejection.
3. Providing a comprehensive environment to manage the annotation of web documents. Users can create “notebooks” to store annotations, share them with friends and colleagues, search amongst them and export their data in various formats (even Office Documents) for further reuse.
4. Proposing predefined use case scenarios, to address the needs of specific categories of users. For example, scholars or lawyers when logged in, will find, preloaded, all the vocabularies that specifically refer to their professions.
StoM's Semantic Recommender on the other hand uses Semantic Web technologies to provide better recommendations to users. This can be valuable in multiple scenarios, from E-Commerce (“You bought this: try that”) to News site (“You read this: check also that”).
The system uses the information of public semantic web data sources (like Wikipedia/DBPedia, Europeana, Jamendo, etc) to enrich the descriptions of certain items on which the recommendation should be based on and to create links and references amongst them. The engine navigates the resulting graph to find the items that can be interesting for a user according to her profile, the history of her purchases or the web pages she most liked.
In the example before it was known that Luca De Santis likes the band Nine Inch Nails. By navigating the DBPedia graph from the Nine Inch Nails entry, the system can automatically create a list of all associated bands and propose it to the user (“Luca, you like Nine Inch Nails, try also Skinny Puppy and Cabaret Voltaire!”).
In SemLib, the previous research project, this technology was basically at a very prototypal state. In StoM it will be reengineered to make it more robust, efficient and general-purpose (to make it suitable in multiple real life scenarios). Other recommendation techniques (eg collaborative filtering) will be also tested and, if effective, implemented in the system. A lot of attention in fact will be devoted to performance: the recommender in our vision should become a cloud based service so it must be essential to guarantee a linear scalability in terms of the amount of data that it can manage, by ensuring also consistent processing times, both for updating its internal indexes and for responding to recommendation queries. The choice of the most suitable recommendation algorithms must also take in consideration these aspects.
Despite the increase in adoption of Semantic Web technologies, we think (hope!) that the products that we are focusing on in StoM address a niche market that hasn’t been fully satisfied yet. For example, while several big vendors provide annotation services nowadays (think of Evernote) none of them present a model based on semantics, limiting the annotations to textual comments. This greatly hinders the chances of reuse of the information, making these services sometimes too limited for power users.
In StoM we are concentrating on identifying business needs that can be satisfied by our Semantic Web-powered technologies and to design services that can hopefully generate a real interest in the market.
Thanks to Francesca Di Donato (aka @ederinita) for reviewing this post and for providing precious hints, and to Natalia Mielech (aka @nmielech), who sparked the need to explain what we are trying to achieve in StoM to everybody, and not only to hopeless nerds like me!
Sunday, September 14, 2014
My Nerdie Bookshelf - "Big Data" by Nathan Marz and James Warren
Undoubtedly Big Data is the hype of 2014: if on the one hand the availability of cheap cloud resources, together with the huge increase of data sources, have paved the way for the incredible rise of interest in this subject, it is also true that behind it there is also a lot of marketing.
Real use cases are abundant but Big Data has become a buzzword used now for everything regarding data analytics. Interesting viewpoints on that can be read on selected articles like "'Big Data' Is One Of The Biggest Buzzwords In Tech That No One Has Figured Out Yet" and "If Big Data Is Anything at All, This Is It". Moreover often the term "Big" is improperly used. Definitely 100 Gb of overall data, that can be processed in RAM on affordable machines (see for example this offer from Hetzner), can't be defined Big Data...
The strongest point of the Big Data book I'm reviewing here has been for me its ability to present a clear definition of the problem. Since the hype, I wanted to study more about Big Data and chose this Nathan Marz book thanks to the very good reputation of its author (well, co-author to say the truth): Marz was a technical architect at Twitter and founded several exciting open source projects (one for all Storm, a sort of ultra scalable, high-performance service bus).
Marz presents here his vision and recipes to deal with the Big Data problem, namely the Lambda Architecture. It is definitely a high end solution, to be used when great bunch of data must be processed with the lowest latency as possible. For this purpose the Lambda Architecture consists of two layers, the batch layer and the speed layer, the latter, as the name implies, to process the most recent data.
Albeit the book is very practical and describes directly the technical solutions and the open source technologies the Lambda Architecture is based on, it also expresses clearly the characteristic of Big Data processing. At the heart of it all there is the idea that analytics is just a "function" of all the available data and that the master dataset is immutable: new information should be only accumulated and never deleted. This way it is possible to reprocess the whole universe of information when needed. This allows several advantages like the ability to easily correct errors introduced in previous processing and the ability to elaborate new indicators or to refine existing ones, on the whole set of information. These data in fact must be processed to produce (or to enrich if processing is incremental) indexes used for queries (eg for business intelligence): in the author’s words, “to make queries on precomputed views rather than directly on the master dataset”.
The key technology here is of course Hadoop, both for the distributed management of big data and for processing through map-reduce powered algorithms. Quite interesting is the chapter about modeling, that presents a data schema that revolves around atomic facts, logically linked to form a graph of structures: this at the end is not too dissimilar to the star/snowflakes schemas that can be found in traditional OLAP data warehouses.
For the real time layer Marz proposes the use of the aforementioned Storm, in this case to update short-lived indexes, destined to be removed after a few hours when the batch layer is able to ingest the recent data they were based on.
While there is a hint of theory here and there, Big Data is very much practical and highly focused in presenting the Lambda Architecture (a much more suitable title should have been A reference architecture for Big Data processing). I wouldn't advice the book to developers totally new to the Big Data ecosystem: it presents in fact several specific technical solutions (Thrift, Pail, JCascalog, Cassandra, ElephantDB, Trident,…) that should help and simplify the problems described, but can be probably more appreciated only to those already skilled on the subject. In my viewpoint in fact it would be beneficial first to get the hands dirty on all the main technologies behind (Hadoop to start with and Storm just in case) before directly jumpstart to embrace the final Lambda Architecture as it is presented here.
Overall anyway I found the book interesting and stimulating: the fact that MEAP books can be easily bought with big discounts (just follow the Manning Twitter account to get promotional coupons) is also a plus worth mentioning here.
More information on Big Data can be found on the Manning web site.
Real use cases are abundant but Big Data has become a buzzword used now for everything regarding data analytics. Interesting viewpoints on that can be read on selected articles like "'Big Data' Is One Of The Biggest Buzzwords In Tech That No One Has Figured Out Yet" and "If Big Data Is Anything at All, This Is It". Moreover often the term "Big" is improperly used. Definitely 100 Gb of overall data, that can be processed in RAM on affordable machines (see for example this offer from Hetzner), can't be defined Big Data...
The strongest point of the Big Data book I'm reviewing here has been for me its ability to present a clear definition of the problem. Since the hype, I wanted to study more about Big Data and chose this Nathan Marz book thanks to the very good reputation of its author (well, co-author to say the truth): Marz was a technical architect at Twitter and founded several exciting open source projects (one for all Storm, a sort of ultra scalable, high-performance service bus).
Marz presents here his vision and recipes to deal with the Big Data problem, namely the Lambda Architecture. It is definitely a high end solution, to be used when great bunch of data must be processed with the lowest latency as possible. For this purpose the Lambda Architecture consists of two layers, the batch layer and the speed layer, the latter, as the name implies, to process the most recent data.
Albeit the book is very practical and describes directly the technical solutions and the open source technologies the Lambda Architecture is based on, it also expresses clearly the characteristic of Big Data processing. At the heart of it all there is the idea that analytics is just a "function" of all the available data and that the master dataset is immutable: new information should be only accumulated and never deleted. This way it is possible to reprocess the whole universe of information when needed. This allows several advantages like the ability to easily correct errors introduced in previous processing and the ability to elaborate new indicators or to refine existing ones, on the whole set of information. These data in fact must be processed to produce (or to enrich if processing is incremental) indexes used for queries (eg for business intelligence): in the author’s words, “to make queries on precomputed views rather than directly on the master dataset”.
The key technology here is of course Hadoop, both for the distributed management of big data and for processing through map-reduce powered algorithms. Quite interesting is the chapter about modeling, that presents a data schema that revolves around atomic facts, logically linked to form a graph of structures: this at the end is not too dissimilar to the star/snowflakes schemas that can be found in traditional OLAP data warehouses.
For the real time layer Marz proposes the use of the aforementioned Storm, in this case to update short-lived indexes, destined to be removed after a few hours when the batch layer is able to ingest the recent data they were based on.
While there is a hint of theory here and there, Big Data is very much practical and highly focused in presenting the Lambda Architecture (a much more suitable title should have been A reference architecture for Big Data processing). I wouldn't advice the book to developers totally new to the Big Data ecosystem: it presents in fact several specific technical solutions (Thrift, Pail, JCascalog, Cassandra, ElephantDB, Trident,…) that should help and simplify the problems described, but can be probably more appreciated only to those already skilled on the subject. In my viewpoint in fact it would be beneficial first to get the hands dirty on all the main technologies behind (Hadoop to start with and Storm just in case) before directly jumpstart to embrace the final Lambda Architecture as it is presented here.
Overall anyway I found the book interesting and stimulating: the fact that MEAP books can be easily bought with big discounts (just follow the Manning Twitter account to get promotional coupons) is also a plus worth mentioning here.
More information on Big Data can be found on the Manning web site.
Labels:
Analytics,
Apache Storm,
Big Data,
Hadoop,
Java,
My Nerdie Bookshelf
Subscribe to:
Posts (Atom)