Australian Powerhouse Museum houses around 101,068 objects collected from 1880 to the present day from steam engines to fine glassware, postage stamps to robot dogs out of which 66,303 objects were present in their online collection database. Creating a structured data for these items manually would be humungous task which involves lot of resources in terms of time, human resources and money.
The Powerhouse Museum tried people-powered tagging of these items for 2 years and then employed Open Calais Web Service to compliment the curation process. The OpenCalais API automatically creates rich semantic metadata for the content using natural language processing (NLP), machine learning and other methods. It analyzes the document and identifies entities, extract them and annotate them. The functions go well beyond classic entity identification and even returns the facts and events hidden within the text.
The API does a semantic markup on unstructured documents - recognizing people, places, companies, and events. These tags can be incorporated into other applications - for search, news aggregation, catalogs, etc. The metadata gives the ability to build maps (or graphs or networks) linking documents to people to companies to places to products to events.
There are quite a few interesting applications for this technology. These applications can range from creation and analyzation of social graphs to News Service and even Government data. It helps in knowing the kinds of entities in the text and allows developers to build intelligent search engines that look for related content. Similarly, Calais could enable links to countries and cities mentioned in the document thereby enabling better search and cross-linking.
Calais can also incorporate text analysis into browsers. A browser could call Calais on document load and obtain a list of people, places, companies, etc. which are embedded in the document. With this information the browser would be able to create a more interesting, more contextual, and relevant experience, thereby enabling more intelligent browsing.
The news article talks about a case where the API automatically generated tags for some swimwear designed by Speedo for the 1991 Australian swimming team that competed at the World Swimming Championships in Perth. Calais was correctly able to identify some important locations in the document ie., Perth, Sydney as well as an important corporation,Speedo. It also picked up the name of the designer, and the name of the person who owned the suits before the museum. However, it made errors in classification like "World Championships" as a company, and mistook the general text "international swimming organisation" as an actual organized body. Because Calais is built around people, places, and companies, general information about items may be lost on it. Tags that would be obvious to humans, such as swimming, swim wear, Olympics, or the year 1991, are beyond the scope of Calais.
How to improve Calais? Calais is based on AI techniques which can learn from its training and better algorithms. With each new document submitted into Calais the database gets richer and more complete - a growing semantic database of people, places, companies and events.
This is a roadmap to a semantic business powerhouse, which is clearly a great position to be in today's technology. Powerhouse's use of Calais is the first large scale deployment of the technology across a large public data set. It will be interesting to see the results of such kind of technologies as they evolve.
http://www.opencalais.com
http://www.readwriteweb.com/archives/australian_museum_uses_open_calais.php