Introducing Semantic Content Management
Article first published on the Zaizi’s blog.
In 2001, in a publication at Scientific American that has become legendary, Tim Berners-Lee, the father of the World Wide Web, coined the term Semantic Web for the first time. In those days, Berners-Lee considered that
The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users […..]. The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation
Certainly, the “classic” Web has always suffered from some limitations, mainly related to the lack of knowledge expressiveness. Since its beginnings and still nowadays, most of the Web’s content is created and published for being consumed by humans and not for being processed by machines, despite the fact that current machines are perfectly able to manage big amounts of data. This kind of unstructured content is quite difficult to process, mine, relate and categorize. In summary, it is difficult to extract value from it. The Semantic Web (also known as Web of Data) is all about making content machine-readable, allowing software agents to automatically accomplish complex tasks using that data and helping search engines to better understand what they are indexing for later providing more accurate results to the users.
Towards the Web of Data
Structured Content is not a brand-new term, although you might find it now everywhere. Any content represented using a predefined model is structured: a database, an XML file or a spreadsheet are good examples. Structured content is easy to process by machines because every single fact is perfectly defined as well as its relationships with other facts. Representation’s standards in the Web (HTML, CSS…) don’t structure the content, they just define the way the content is going to be displayed. Also, the hyperlinked structure of the Web doesn’t provide any sense regarding why two web pages are connected. The current Web is an enormous graph of connected pages through unspecified links. When a webpage links to another, a machine can know that the first one is promoting the second, but it can’t know why.
In the Semantic Web, each node of the graph is univocally identified, as well as its relationships with the rest of nodes. Thinking about web content, these nodes may be viewed as concepts or entities, ensuring that they are not just words in a document but are tied to a unique definition that everyone can find on the Web. To make the Web of Data a reality, it is important to have the huge amount of data on the Web available in a standard format. Furthermore, not only does the Semantic Web need access to data, but relationships among data should be made available too.
Linked Data has been the natural evolution of the right graph in the above figure. According to linkeddata.org, the term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. In a practical sense, Linked Data is a huge graph of interrelated datasets from different domains on the Web, that is, an enormous database of the world knowledge. Some well know examples of such datasets are DBpedia, a machine-readable version of Wikipedia, Freebase, a database of well-known people, places, and things and Geonames, a worldwide geographical database.
Why creating structured content is important?
As we will see in this post structuring content is not a trivial task and it might be worth spend some time and highlight the benefits of embracing this approach:
- Make sense of your content: content is finally put in context thanks to its underlying knowledge (each piece of information will be labeled as an entity. Besides annotating your content with your custom datasets, linking your content with the Linked Data cloud would allow to automatically take advantage of the information contained there in your site).
- Data interoperability: in the Web of Data, common standards are used for structuring the content. Let’s think for example on content aggregators or products comparators. Nowadays, e-commerce companies need to pay to include their products catalogs in such sites. If all e-commerce frameworks publish those catalogs using the same standard, a simple query would be enough for comparing the prices of the same product.
- Content Discovery: semantic information can help search engines to better understand what the content they index is about. Current major search engines are promoting the use of microformats for annotating the content with entities and concepts. Microformats are just another way to represent semantic data, using HTML and XHTML for embedding it in the content. Search engines can parse this information and understand better the meaning of the web pages. For example, schema.org is an effort between Google, Bing, Yahoo! and Yandex for building a common vocabulary that all sites should use for markup the content. Schema.org provides then a way to describe and categorize “real world” things like Persons, Organizations. Places, Movies, Events and so on, within the content. This information is used by their algorithms for improving and also contextualizing the results for the users using for example Rich-Snippets.
How do you structure your content?
Nowadays regardless of the introduction of these technologies by Google, Facebook, IBM and many other big tech players, still over 80% of the published content is unstructured. There are several reasons that explain this situation but the most important one is that human curation is far too expensive and structuring content is a job that only machines can handle. Redlink was founded to provide a solution for this problem and to democratise semantic technologies providing a Publishing Knowledge Graph solution for Enterprises and online publishers (just like Google, Facebook and Bing have their own Knowledge Graph we believe every company shall create its own). Completely built on top of three Open Source pillars from the Apache Software Foundation (Apache Stanbol, Apache Marmotta and Apache Solr), the Redlink Platform provides a wide range of cloud services for Content Enrichment, Linked Data Publishing and Semantic Search.
Redlink can structure your content by extracting entities and concepts and linking them with Linked Data datasets or your custom datasets. The extracted data can be represented using any format from the Semantic Web standards, even using microformats like RDFa and the already mentioned schema.org for improving SEO. Redlink enhances the content by applying some different analysis process that can include:
- Language Detection
- Named Entity Recognition
- Nouns Recognition
- Entity Linking
- Topic Annotation
- Sentiment Analysis
- Content Classification
- Third-party analysis engines like Machine Linking
Linking your content with datasets from the Linked Data cloud also allows you to use as your convenience the information contained in such resources.
Linked Data Publishing
Redlink helps in managing your company’s datasets as semantic data and publishing them as part of the Linked Data cloud, assuring proper connections with other datasets and assuring that your data is reachable and query-able efficiently. Even if you are not planning to publish your data in that way, managing it as semantic data can provide benefits like using it for Content Enrichment as well as Categorisation and Semantic Search.
Redlink provides an easy to use Semantic Search solution built on top of the well-known information retrieval framework Apache Solr. Besides it can apply all the features already provided by Solr (like faceting, autocompletion, highlighting, spell cheking and so on) both over enterprise content and custom datasets, Redlink’s Search services can also take advantage of the extracted entities/concepts in your content for automatically creating semantic indexes where documents are annotated with contextual information. This information can be used for improving content organisation and discovery.
A simple use case: Semantic e-Commerce
An e-commerce site may be interested in Semantic Publishing using the Redlink technology. Product catalogs can be imported as Semantic Knowledge Base and published using Linked Data principles, which would allow third-party agents to easily aggregate them. For example they can be published directly into the Google Knowledge Graph. The datasets can be then used for enriching and indexing copy content, product reviews or social networks, extracting mentions to the products and for instance analysing what the customers are saying about them. A powerful Semantic Search over the catalogs can also be provided.
How we can help you!
After several years spent in research and development of the core-technologies we’re setting up a cloud infrastructure providing semantic technologies to help online publishers move from web pages to datasets.
In the next blog post we will introduce a Redlink based solution developed by Zaizi for enabling Semantic Publishing and Semantic Content Management on Drupal.