Sunday, October 05, 2014

Semantic Web and Semantic Annotation for newbies: a very introductory presentation of the StoM project

I was asked to write in a simple way a presentation of the StoM project and the technologies behind it. What follows is the final result: I decided to post it here, so that it can hopefully be reused when some of my friends (very often indeed) asks me what Semantic Web is and especially what the heck is a semantic annotation!
I dearly hope my attempt to explain clearly these concepts has succeeded: to paraphrase one of Einstein’s most famous quotes, if you can explain a concept in a simple way, it means you have really understood it... Did I???

StoM reuses the results of a previous EU funded research project named SemLib, in which two main outcomes were produced: an annotation system and a semantic recommender. Both exploit Semantic Web technologies in their internal mechanisms.

First of all, Semantic Web is about adding explicit “meaning” to objects and services on the web: this meaning is expressed through metadata that can be unambiguously interpreted by software agents. It is about making software systems automatically understand each other.

This should not be confused with Artificial Intelligence: there’s no “imitation” of the cognitive processes of the human brain but simply the use of formal metadata to “better describe” a certain context on which a software operates. This “better description” consists of multiple ingredients:

1. using a specific convention to express information. This can be seen as the syntax for describing things. In particular Semantic Web has at its foundation a model based on “triples”, that is assertions of the form:
< Subject > < Predicate > < Object >
Some examples:
< X> < is a > < Document >
< X> < has author > < Luca De Santis >
< Luca De Santis > < is a > < Person >
< Luca De Santis > < works for > < Net7 >
< Net7 > < is a > < Company >
Albeit very simple in nature, this model is incredibly expressive and resorts in creating “a graph of facts”. “Deduction” by software agents is obtained by navigating and querying this graph. To better explain this concept I borrowed the two following images, both taken from the brilliant presentation Semantic Data Management in Graph Databases by Maribel Acosta: they show a graph that models authors, papers, conferences, the logical relationships amongst these concepts and how queries are resolved by navigating the graph structure.

2. using standard formalism to express triples, in particular the RDF (Resource Description Framework) modeling language. This can be seen as the grammar for describing things.

3. using “vocabularies” to express concepts, possibly those accepted as standards. This is incredibly important: to allow software agents “understanding" concepts and assertions, they must share the knowledge of the domain (isn’t it the same for humans?).

Multiple standard vocabularies are available for representing specific knowledge domains and use cases. For example:

  • Dublin Core can describe document metadata (eg author, title, date of creation etc);
  • FOAF and SIOC can describe persons and relationships amongst them;
  • SKOS can be used to define classification rules (lists of tags, taxonomies, etc);
  • GoodRelations formally describes e-commerce scenarios;
  • is a very rich vocabulary that can describe a huge number of “concepts” and their “properties”, including Creative works (Book, Movie, MusicRecording, Recipe, TVSeries, …), Events, Organizations, Persons, etc. It has in fact been introduced to manage Search Engine Optimization (SEO) by big search vendors like Google, Yahoo and Microsoft/Bing.
4. publishing on the web description of objects, so that they can be identified by their URI/URL.

5. linking information/metadata around, to enrich the amount of information available, exploiting those already specified elsewhere.

For example, the triple:
< >
    < cito:likes >
< >
contains two real navigable set of information in the Subject and Object (together with the property “likes” of the “Citation Typing Ontology” vocabulary). One can exploit the semantic information of the two specified URLs to infer information about the Subject (Luca De Santis is a Person, works for Net7 srl, his job title is IT Consultant and his contact point is his Twitter account -!/lucadex) and the Object (Nine Inch Nails is a Music Group of Industrial Rock).

Semantic Web has been one of the most discussed technologies for more than a decade (this term was introduced in a seminal article by Tim Berners-Lee on Scientific American in 2001!). After the hype has cooled down (so much that a lot of people completely lost interest on it) it has gained wide adoption in several fields, including: search (see below), SEO or the integration of content with Social Networks. The Open Data movement also boosted a huge interest in Semantic Web, because it is obviously important to describe free datasets in terms that can be easily understandable by automatic software agents. Likewise the widespread adoption of API services by software vendors could possibly in a near future be strictly linked to Semantic Web, since its technologies can provide better descriptions to application services and the data they ingest and produce.

Very often Semantic Web is associated to search: albeit sometimes this happens a bit too lightly (when this term is confused with natural language processing) it is also true that Semantic web can really empower enterprise search. In fact, while normal search engines can only recognize word occurrences, with semantics it becomes possible to identify "concepts" and disambiguate synonyms (eg. FED vs Federal Reserve System; Wall Street vs the New York Stock Exchange) or words with multiple meanings (Rock: Music or Geology? Wall Street: the NY District? The Movie? The Stock Exchange Market?). It’s not a case that Google is investing a lot on its Knowledge Graph, that exploits semantic information to enhance the search engine’s results.

The products at the basis of StoM exploit in their very nature Semantic Web technologies.

In particular the Semantic Annotation System allows users to add annotations on textual documents, published on the web. Annotations can be seen as the equivalent, for digital documents, of applying textual notes, underlines, highlights on paper documents or books (the so-called “marginalia”).

Annotations can be applied on a web page by using the familiar metaphor of adding marginalia on documents (see on this regard the brilliant W3C’s Web Annotation Architecture animation). They can be extremely beneficial for those that must manually process a great deal of documents (think of Students, Digital Humanities Scholars or Professional Categories like Lawyers).

Making use of Semantic technologies, it is possible to enrich the meaning of annotations and describe them, at least “internally”, through formal assertions. This way annotations are no longer simple textual comments but become statements (that is, triples) that are stored in a central repository. Formal, structured data can be therefore assigned to textual documents, which are naturally an unstructured form of data.

It is crucial that the annotation process is simple and intuitive for users: for example one should apply formal semantics to a web page by simply highlighting a piece of text and declaring in few steps its type (eg “Yellen” is a Person) or by linking it to an entry of a public dataset, like DBPedia/Wikipedia (eg. “Yellen” “is the same of” The latter action is very useful: this way the system can automatically “import” all the meaningful references stored in the remote dataset (like the fact that Janet Yellen is a woman and that she is the Chair of the FED) to enrich the available knowledge.

Annotations can facilitate semantic search engines to provide more suitable results. For example “power searches” that these systems can perform are:

  • fetch all documents that talk about women;
  • fetch all documents that refer the Federal Reserve System.

The document of the example above will be returned in both cases if the search engine can make use of the annotation on the word “Yellen”, even if the rest of the text doesn’t include any mention of Susan Yellen’s gender or of her role at FED.

Semantic Annotations therefore allow a better management of the knowledge that can be extracted from digital documents and opens the way for a more fine-grained reuse of this knowledge.

The efforts in StoM for the Annotation System will be mainly concentrated on four areas:

1. Usability: providing an annotation tool that is very simple and intuitive to use. Users should apply semantic annotations with the same conceptual effort of highlighting with different colours sentences in a textual document.

2. Facilitating the annotation process by transparently intepreting metadata in the page (title, language, SEO descriptions, ...) and through the integration of entity extractions services, like SpazioDati's DataTXT. This way annotations can be automatically created and presented to the user, for approval or rejection.

3. Providing a comprehensive environment to manage the annotation of web documents. Users can create “notebooks” to store annotations, share them with friends and colleagues, search amongst them and export their data in various formats (even Office Documents) for further reuse.

4. Proposing predefined use case scenarios, to address the needs of specific categories of users. For example, scholars or lawyers when logged in, will find, preloaded, all the vocabularies that specifically refer to their professions.

StoM's Semantic Recommender on the other hand uses Semantic Web technologies to provide better recommendations to users. This can be valuable in multiple scenarios, from E-Commerce (“You bought this: try that”) to News site (“You read this: check also that”).

The system uses the information of public semantic web data sources (like Wikipedia/DBPedia, Europeana, Jamendo, etc) to enrich the descriptions of certain items on which the recommendation should be based on and to create links and references amongst them. The engine navigates the resulting graph to find the items that can be interesting for a user according to her profile, the history of her purchases or the web pages she most liked.

In the example before it was known that Luca De Santis likes the band Nine Inch Nails. By navigating the DBPedia graph from the Nine Inch Nails entry, the system can automatically create a list of all associated bands and propose it to the user (“Luca, you like Nine Inch Nails, try also Skinny Puppy and Cabaret Voltaire!”).

In SemLib, the previous research project, this technology was basically at a very prototypal state. In StoM it will be reengineered to make it more robust, efficient and general-purpose (to make it suitable in multiple real life scenarios). Other recommendation techniques (eg collaborative filtering) will be also tested and, if effective, implemented in the system. A lot of attention in fact will be devoted to performance: the recommender in our vision should become a cloud based service so it must be essential to guarantee a linear scalability in terms of the amount of data that it can manage, by ensuring also consistent processing times, both for updating its internal indexes and for responding to recommendation queries. The choice of the most suitable recommendation algorithms must also take in consideration these aspects.

Despite the increase in adoption of Semantic Web technologies, we think (hope!) that the products that we are focusing on in StoM address a niche market that hasn’t been fully satisfied yet. For example, while several big vendors provide annotation services nowadays (think of Evernote) none of them present a model based on semantics, limiting the annotations to textual comments. This greatly hinders the chances of reuse of the information, making these services sometimes too limited for power users.
In StoM we are concentrating on identifying business needs that can be satisfied by our Semantic Web-powered technologies and to design services that can hopefully generate a real interest in the market.

Thanks to Francesca Di Donato (aka @ederinita) for reviewing this post and for providing precious hints, and to Natalia Mielech (aka @nmielech), who sparked the need to explain what we are trying to achieve in StoM to everybody, and not only to hopeless nerds like me!

Sunday, September 14, 2014

My Nerdie Bookshelf - "Big Data" by Nathan Marz and James Warren

Undoubtedly Big Data is the hype of 2014: if on the one hand the availability of cheap cloud resources, together with the huge increase of data sources, have paved the way for the incredible rise of interest in this subject, it is also true that behind it there is also a lot of marketing.

Real use cases are abundant but Big Data has become a buzzword used now for everything regarding data analytics. Interesting viewpoints on that can be read on selected articles like "'Big Data' Is One Of The Biggest Buzzwords In Tech That No One Has Figured Out Yet" and "If Big Data Is Anything at All, This Is It". Moreover often the term "Big" is improperly used. Definitely 100 Gb of overall data, that can be processed in RAM on affordable machines (see for example this offer from Hetzner), can't be defined Big Data...

The strongest point of the Big Data book I'm reviewing here has been for me its ability to present a clear definition of the problem. Since the hype, I wanted to study more about Big Data and chose this Nathan Marz book thanks to the very good reputation of its author (well, co-author to say the truth): Marz was a technical architect at Twitter and founded several exciting open source projects (one for all Storm, a sort of ultra scalable, high-performance service bus).

Marz presents here his vision and recipes to deal with the Big Data problem, namely the Lambda Architecture. It is definitely a high end solution, to be used when great bunch of data must be processed with the lowest latency as possible. For this purpose the Lambda Architecture consists of two layers, the batch layer and the speed layer, the latter, as the name implies, to process the most recent data.

Albeit the book is very practical and describes directly the technical solutions and the open source technologies the Lambda Architecture is based on, it also expresses clearly the characteristic of Big Data processing. At the heart of it all there is the idea that analytics is just a "function" of all the available data and that the master dataset is immutable: new information should be only accumulated and never deleted. This way it is possible to reprocess the whole universe of information when needed. This allows several advantages like the ability to easily correct errors introduced in previous processing and the ability to elaborate new indicators or to refine existing ones, on the whole set of information. These data in fact must be processed to produce (or to enrich if processing is incremental) indexes used for queries (eg for business intelligence): in the author’s words, “to make queries on precomputed views rather than directly on the master dataset”.

The key technology here is of course Hadoop, both for the distributed management of big data and for processing through map-reduce powered algorithms. Quite interesting is the chapter about modeling, that presents a data schema that revolves around atomic facts, logically linked to form a graph of structures: this at the end is not too dissimilar to the star/snowflakes schemas that can be found in traditional OLAP data warehouses.

For the real time layer Marz proposes the use of the aforementioned Storm, in this case to update short-lived indexes, destined to be removed after a few hours when the batch layer is able to ingest the recent data they were based on.

While there is a hint of theory here and there, Big Data is very much practical and highly focused in presenting the Lambda Architecture (a much more suitable title should have been A reference architecture for Big Data processing). I wouldn't advice the book to developers totally new to the Big Data ecosystem: it presents in fact several specific technical solutions (Thrift, Pail, JCascalog, Cassandra, ElephantDB, Trident,…) that should help and simplify the problems described, but can be probably more appreciated only to those already skilled on the subject. In my viewpoint in fact it would be beneficial first to get the hands dirty on all the main technologies behind (Hadoop to start with and Storm just in case) before directly jumpstart to embrace the final Lambda Architecture as it is presented here.

Overall anyway I found the book interesting and stimulating: the fact that MEAP books can be easily bought with big discounts (just follow the Manning Twitter account to get promotional coupons) is also a plus worth mentioning here.

More information on Big Data can be found on the Manning web site.

Monday, July 07, 2014

My Nerdie Bookshelf - "Getting things done - The art of stress-free productivity" by David Allen

I don't remember exactly how I heard about this book but it immediately caught my attention when I heard that it suggests productivity tricks based on the use of to-do lists. I was and still am a great fan of lists: they really helped me to untangle the huge mass of tasks and problems I have dealt with in the last ten years. I'm using lists for almost everything, including when I'm preparing the luggage for a trip.

I must admit that the author's method is quite interesting and effective. To say that it is simply based on lists is a huge understatement. David Allen proposes a process to deal with everyday's activities in which for every task that must be faced a "next action" is defined. Lists and calendars serve for keeping the mind free, instead of trying to maintain schedules, appointments, deadlines, notes and things to do all in one's memory.

Most of the advices proposed in Getting things done are purely based on common sense, but they are strangely often forgotten or dismissed, especially in work environments.
I've learnt some interesting few tricks from this book. Three are worth mentioning here, that is: the 2 minutes rule (if you need less than 2 minutes to do something don't defer it and do it now), the importance of always identifying a next action for each task and defining in meetings a clear purpose at the beginning and a list of actions (and responsibilities) at the end.

The book has been initially published in 2001 and there isn't much emphasis on the exploitation of technology (besides some random mentions of PDAs - that was way before smartphones! - and personal productivity features of desktop softwares like Outlook or Lotus Notes). This should not be considered detrimental of the book: its value is in the process, not in the suggested (low tech) implementation.

Despite these good things there are also some annoying facts about Getting things done. To start with, its messianic style, the constantly repeated mantra that your life can deeply change if you follow the advices of this book. Moreover while the method can really apply to everyone (and, to say the truth, this is often repeated throughout the pages) the examples in Getting things done refer always to a vip audience. All this gives the impression that David Allen is constantly promoting his activity as a consultant: this is legit of course, but it is sincerely annoying to get this feeling in a book you've bought.

Finally, despite the book is not very long, often concepts are repeated over and over again. You could actually get the same value from it even if it had 40% less pages.

Despite these flaws I deeply suggest the reading of Getting things done to those who feel in trouble to keep the pace of business and personal events. I honestly learnt some useful strategies here, which at the end is the only thing that matters in a book like this.

Getting things done - The art of stress-free productivity, David Allen, Penguin Books 2001

Source Wikipedia

Introducing My Nerdie Bookshelf

I love reading: this is a thing in the family, since between my wife, our 10 years old son and me our house is really flooded with books.

I enjoyed keeping track of the books I read, both in paper and electronic form, but I never liked to mix my leisure time with my professional reading. Therefore even at home I try to keep separate my technical bookshelf (My nerdy bookshelf) with the other (huge…) ones used to store everything, from gothic novels to italian thriller, from art books to comics.

I enjoy using Anobii to keep track of the “leisure time” books I read and buy, but, for what I’ve said before, I don’t feel like using it for my nerdy reading as well. At the same time, I do really feel the need to write some comments after I’ve read a professional book. That’s why I fancied about reactivating this very neglected blog to start the My Nerdy Bookshelf series of posts. Here I’ll report comments for the technical books I read, mostly related to my present IT profession of Project Manager and Software Architect (Ok, let’s also add Entrepreneur, since I’m an associate of Net7 srl).

I’m going to write these posts mostly for myself and of course if they can be valuable for others I’ll be more than happy. At the same time it’s going to be a personal thing, not a “professional” blogger activity. I’ll report my personal impressions, with humility and honesty: my 2 cents, hopefully with attitude!