Linked Open Data in the Cultural Heritage World: Issues for Information Creators and UsersBy Patricia Harpring posted
Since returning to sunny Los Angeles from lovely but cold Milwaukee, I’ve been contemplating the bright potential future for cultural heritage information in the Linked Open Data (LOD) environment. In Milwaukee, I spoke at a session titled “Brave New World: Using RDF and Linked Open Data for the Semantic Web” at the annual conference of the Visual Resources Association. (For an introduction to LOD, see Europeana’s video “Linked Open Data: What Is It?”). LOD is a way of publishing information so that it can be connected with other information (=linked) and freely used (=open). LOD promises to revolutionize information retrieval by disambiguating searches and connecting relevant information from different sources, all updated dynamically within the same global Web space.
Our session was well attended, and the audience was engaged as each speaker described LOD projects. I spoke about the issues involved in preparing the Getty vocabularies for publication in the LOD cloud. The Getty recently announced an Open Content Program, currently focused on free and unrestricted access to high-resolution digital images. LOD takes this open content policy one step further, making large, structured datasets freely available in machine-readable form. Controlled vocabularies expressed as LOD allow the synonyms, terms in various languages, and thesaural relationships to be used to significantly enhance discovery and retrieval. The Art & Architecture Thesaurus (AAT) was published as LOD in February 2014. Plans are in place to release the other Getty vocabularies as LOD: TGN (Thesaurus of Geographic Names) in July 2014, ULAN (Union List of Artist Names) in January 2015, and CONA (Cultural Objects Name Authority) in July 2015.
Publishing existing data as LOD is neither simple nor easy. We had to determine the open data license best suited for our data and our institution. The project to publish our electronic thesauri, developed over decades according to rigorous technical and intellectual standards, involved intense analysis and long hours of work by the Getty technical team (particularly Joan Cobb, Gregg Garcia, and consultant Vladimir Alexiev), working with a Getty LOD steering committee and external advisors. Technical issues required resolution. For example, although we used standard ontologies for LOD whenever possible, a Getty Vocabulary Program (GVP) ontology was required for certain classes and properties to express the richness of the AAT.
My presentation at the conference described how the Getty Vocabulary Program prepared our controlled vocabulary data for release as LOD. The Getty vocabularies lend themselves well to linking: records and key elements within records, such as terms and names, are identified by unique, persistent numerical IDs. In addition, the vocabularies already contain links: thesaural relationships (equivalence, hierarchical, and associative), and implied conceptual links between vocabularies. (CONA, the newest Getty vocabulary, with records for works of art, architecture, and material culture, is the exception, from the beginning having been both intellectually and electronically linked to the other vocabularies.)
An example of the editorial work needed to create truly linked data is the process of mapping the implied conceptual links to actual links. For instance, the nationality/culture controlled list within ULAN should now map to terms in the AAT. While much mapping could be done through algorithms, comparing the ULAN nationality term to AAT terms, it had to be vetted by the editorial staff. Where “East German” was a historical nationality in the ULAN list, it did not exist in the AAT; the term was added to the AAT so that the link could be made. In other cases, there were false matches or ambiguous matches. “Merovingian” in the ULAN list matched to two terms in the AAT (with different qualifiers). Editors had to indicate which “Merovingian” should be mapped to the ULAN nationality term. Extensive cleanup and research was also done for languages and associative relationships in the AAT.
For the VRA session, I laid out a use case scenario, where the Getty vocabularies could aid research and discovery in a future LOD environment. Let’s imagine that a researcher finds an interesting article online about the historical use of incense burners in Mexico. To explore the topic further today would require many hours or days of research; however, LOD will enable a new generation of search engines to follow the links between data sources to deliver more complete answers in much less time. In this use case, the AAT could provide variant spellings, synonyms in other languages for “incense burners,” and the narrower concept “censers” with its variant terms, enabling the researcher to instantaneously discover numerous museum sites and articles on this topic. The AAT hierarchy could also focus the search on censers attributed to Pre-Columbian cultures. The user could explore geographic regions where these censers were created through TGN place names, hierarchies, and linked maps. The names and biographies in ULAN could lead the user to pertinent information about artists and patrons associated with the creation of the censers. CONA, which ideally will have subject indexing, could provide links to photographs, paintings, or even YouTube videos portraying usage of censers (see an entertaining video of a “monster censer” at Santiago de Compostela, Spain).
Subsequent phases of Getty vocabulary LOD development will focus on using LOD on our own Web sites, collaboration with external sites, data harvesting, data visualization, etc. The world of LOD holds the promise for a marvelous environment where research and discovery possibilities are nearly infinite. However, while the number of LOD resources grows exponentially, LOD is so new that there are still relatively few examples where end-users can see the benefit now. The audience at the VRA session brought up questions and issues that merit further discussion. What can institutions do now to prepare for LOD? How can a visual resources department get administrative buy-in for an LOD project? What is the best way to deal with unreliable object data, already apparent in certain LOD resources where the data is not carefully vetted or “scrubbed” before publication? Could the Getty’s work authority, CONA, become a hub for cleaner, more reliable object data, accurately linked to the other Getty vocabularies? Stay tuned.
Patricia Harpring is the Managing Editor of the Getty Vocabulary Program, an operating unit of the Getty Research Institute (GRI) in Los Angeles.