This is the third post in our Understanding Shakespeare series. In the first two posts Alex described the partnership and process behind the project. In this post we’ll take peek under the hood to look at the approach used in the content matching.
The core idea that we wanted to explore in this project was whether users would find value in being able to use a primary text as a portal for locating secondary literature, specifically journal content available from JSTOR. The basic premise was validated in the exploratory interviews we conducted with scholars and students. While the exact form of the capability still needed to be fleshed out we knew that we needed a means for connecting play content to journal articles at a pretty granular level.
So we had some preliminary user validation of our core idea, now we wanted to put this into a hi-fidelity prototype for our next round of user testing. For that we needed data, because, to quote a line from King Lear - “Nothing will come of nothing.” We had a lot of dots that needed connecting. In our vision for the tool we not only wanted to match article references with the passage in the play but also connect both of these to physical artifacts that provide visual representation of the article and play for web-based display and navigation. Specifically, we wanted to highlight and link the passages in the Folger Digital Text with the specific regions on the JSTOR scanned page images.
Our initial plan was to find explicit play references in JSTOR articles and create bi-directional links between the play and article using the mined references. This seemed a reasonable (and potentially straight-forward) approach as many of the articles we looked at used a convention where a referenced passage from a play was annotated with the act, scene, and line number in the referencing article. However, initial attempts at using these references proved problematic as multiple plays were often referenced in the same article and the play text referenced could be other than the Folger edition we were using. In addition to these challenges we also had to contend with text recognition errors in the article text that is produced using optical character recognition (OCR). While these were likely tractable problems we concluded that a fuzzy text matching approach would likely provide a more robust solution.
Our data scientist, Yuhang Wang, was tasked with designing and implementing an algorithm for performing the fuzzy text matching. Both block and inline quotes were to be considered in the matching process. Block quotes were identified by using OCR coordinates to identify text passages offset from surrounding text. Inline quotes were identified by text bounded by quotation (“) characters. After normalizing the text in the extracted quotes and play text, candidate matches were found using a fuzzy text matching process based on the Levenshtein distance measure. Levenshtein edit distance is a similarity measure of two texts that counts the minimum number of operations (the removal or insertion of a single character or the substitution of one character for another) required to transform one text into the other. Using this approach we found the substring from the play text with the smallest Levenshtein edit distance for each candidate quote.
The matching of article text to play passages required some significant computation as the Levenshtein edit distance had to be calculated for each extracted article quote and all possible substrings in the play text. For that we used our in-house Hadoop cluster and some carefully crafted MapReduce programs. It’s safe to say that prior to the advent of technologies such as Hadoop and MapReduce permitting highly parallelized text processing this project would not have been practical.
This fuzzy text matching approach worked well overall, identifying nearly 800,000 candidate matches in the 6 plays analyzed. After applying filtering thresholds to reduce the false hits we ended up with just over 26,237 matches for the initial version of the prototype. As might be expected, the matching accuracy was much better for longer quotes but tended to include a good number of false hits on smaller passages (15-20 characters or fewer) when the quote consisted of commonly used words and phrases. A future refinement of the filtering process will likely include a measurement of how common a phrase is in modern usage. This would enable us to keep an 11 character quote like “hurly burly” but inhibit matches for something that occurs more frequently like “is not true”.
Overall, we are pretty happy with the approach used and the results. We have started thinking about how to improve the robustness and accuracy of the approach and also how to generalize if for use with other texts. Stay tuned as this looks to be an area in which more interesting tools may well emerge from the Labs team.
We are also planning to make the dataset generated in this project available for use by other scholars and researchers. The dataset will include all of the candidate matches as a downloadable dataset. More to come on this soon… in the meantime here are a few extracts from the first 6 plays we’ve incorporated into the prototype.
Matched quotes by play: