Labs No-Longer-Secret Ingredient

the JSTOR Thesaurus

Labs No-Longer-Secret Ingredient

Over the past year you may have noticed a quiet but powerful feature appear in many of the JSTOR Labs projects:  article-specific terms, or keywords.  For example, in both Understanding Shakespeare and Classroom Readings, you’ll find them underneath article titles separated by pipes:

In this post, I’d like to give you a background on this feature and describe how we’ve used it specifically on the Sustainability prototype.

Introducing the JSTOR Thesaurus

JSTOR, as you likely know, has content across virtually all of the academic disciplines from hundreds of different publishers.  If we had a way of classifying all of that disparate content in a consistent way, it could help people better find the content that they’re looking for.  These article-specific terms that crop up in a variety of ways in Labs projects are one approach we’ve been exploring to achieve this. They are generated by something called the JSTOR Thesaurus.

The JSTOR Thesaurus is a semantic index, or a rules-based hierarchal list of terms.  Let’s split that definition into its two parts:

A hierarchical list of terms - The Thesaurus is a hierarchy where the top terms are in that position because they are the most inclusive, and, at every subsequent level, a narrower term is a part/subset/instance/example of its broader term (i.e., a parent/child relationship).
Rules for applying those terms - Terms all by themselves can be ambiguous, so we create rules that define how and when a term is applied to document by our indexing   For example: the word herring can refer to either the type of fish or it could be talking about a red herring as used in literature. We use a rule like what’s below to tag only the fishy uses.

IF (MENTIONS “fish*” or MENTIONS “clupea*” OR WITH “salted” OR WITH “smoked” OR WITH “pickled” OR AROUND “spawn*”) USE Herring

To build our initial list of terms, we combined over twenty discipline- and subject-specific taxonomies.  These came from as disparate sources as NASA and ERIC, the U.S. Department of Education’s online repository for research literature on education.  This combination helped us achieve the broad and encompassing view we needed of the entire JSTOR corpus. We currently have over 57 thousand terms in the Thesaurus, but we never stop adding, editing and pruning terms, or adjusting and improving the rules they operate under.

The Sustainability Prototype

As Alex described last week, with the Sustainability Prototype our goal was to create a site smart enough that the content was greater than the sum of its parts.  To achieve this, we needed to make an investment in building out a more comprehensive set of terms for the interdisciplinary sections of the Thesaurus which Sustainability covers.  To do that we worked with a group of experts within the fields associated with Sustainability – ranging from Industrial Economics to Environmental History – to identify over 1500 key terms from across the Thesaurus that are specifically linked to the study of Sustainability.  Those same experts then helped us extend the set of synonyms and non-preferred terms that appear for each term.  The terms appear throughout the prototype, including on individual article pages, in the search results, and as a filter within search, helping users to find specific articles and content.  They also appear on topic pages like the one below, which you can browse through to create a mental map of the interdisciplinary terrain that Sustainability covers.

The Thesaurus is always improving and continues to grow, expand and deepen with each new project.  The Sustainability Prototype gave us the opportunity to showcase the potential of the Thesaurus, and we’re eager to see if it’s as impactful as we think it will be.  If, as you explore the site, you have suggestions or questions about the Thesaurus, we’d love to hear them!  Just shoot us an email atlabs@ithaka.org.