Friday, March 21, 2014

Announcing the Perseus Lexical Inventory

Announcing the Perseus Lexical Inventory – an open linked data set
Many different linguistic services and tools are dependent on lexical information as it is commonly found in Latin and Greek dictionaries. Most of these applications rely on their own implementation of dictionaries, stem databases etc. but there is no centralized open-access resource on which these services can draw for supporting data. The Perseus Digital Library is releasing its lexical data as an open linked data set, starting with Latin and to be followed by Greek,  in the hopes that it may eventually become such a resource. Work on producing this data set has been a collaborative effort, and would not have been possible without the guidance of Neel Smith of Holy Cross and Helma Dik of the University of Chicago.

The core of the Perseus Lexical Inventory is a CITE collection of Lexical Entity URIs. Each Lexical Entity identifier has associated properties including a normalized form of the lexical entity (or lemma) and a short definition.   The accompanying linked data set includes links between the Lexical Entity URIs, morpheus lemmas, and entries in the Lewis and Short lexicons on Perseus, Alpheios and Logeion.  A VOID file describing the data set is available at http://data.perseus.org/ds/lexical/void and a SPARQL endpoint for querying the data set is at http://services.perseus.tufts.edu/fuseki/sparql.html.   There is also a simple demonstration query form that looks up entries based upon the Latin form at http://perseids.org/tools/lexical/query.html.  The Tufts Morphology Service (currently available at http://services.perseids.org/bsp/morphologyservice ) also supplies the corresponding Lexical Entity URIs for lemmas returned by Morpheus.

Subsequent updates to the data set will include links to ontologies and other collections of uniquely identifiable entities, including part of speech, lexical tokens or forms, stems, prefixes and suffixes, morphological analyses, metrical data, orthographical variants, and named entities.  The lexical entities and tokens will also be linked to their occurrences in dictionaries and other lexica, texts (i.e. of the Perseus corpus, among others), treebanks, etc. Finally we expect to link to other established and emerging data sets, including the Pleiades Gazetteer and the SNAP dataset of ancient prosopography, among others.

Our ultimate goal is for the lexical data sets to be completely open with various channels, including both user interfaces and service-based APIs, through which people and systems can contribute new data and corrections.
In keeping with the approach we have been taking with the release of our data (see the Perseus Catalog’s Roadmap towards Linked Data standards compliance) we are releasing the data knowing we have much work to do still, and will make progress towards the larger vision in incremental steps.  Our next steps will include release of a companion Greek Lexical Inventory, followed by the addition of the stem and lexical token data sets and development of APIs and interfaces for using and contributing to the data.

No comments:

Post a Comment