FILTER BY PROJECT
"How Have Plants Shaped Societies?"
Blog Post about Plant Humanities Initiative Published at Scientific American
Tue 13 Nov 2018
Announcing the Plant Humanities Initiative
Fri 09 Nov 2018
I couldn't be more thrilled to announce, in collaboration with Dumbarton Oaks, the Plant Humanities Initiative. This three-year project will combine scholarly programming from Dumabrton Oaks with a new research tool from JSTOR Labs to help foster the development of this exciting new academic field.
Last December, Dumbarton Oaks and JSTOR hosted a daylong workshop to discuss botany, primary sources and digitization. During the workshop, JSTOR Labs led a "design jam" to brainstorm potential ways to help plant researchers: you can see a short video of the ideas that emerged here. To get an idea which of those ideas seemed most exciting, we performed some quick paper-prototype user-testing at Columbia University's Center for Science and Society. Here are the two ideas (still very not-fleshed-out -- they're still just concepts) that, based on that user-testing, seem most exciting:
Now, in collaboration with Dumbarton Oaks, and with generous funding from The Andrew W. Mellon Foundation, we're going to take one of these concepts and make it a reality. We will select and refine the concept to build based on the input from an amazing Advisory Committee.
What's most exciting, to me, about this project, is that this new tool will be developed in close collaboration with its users. Dumbarton Oaks will host fellows whose work will include incorporating the tool into their research. (You might consider applying for these fellowships.) Taking an iterative approach and including the users in our design activities will help us to make sure that what we build is useful both for these specific researchers, and others as well.
I look forward to sharing our progress with you as we go, and most especially to sharing the tool when it's developed!
Conceiving New Tools for Public Health Researchers
with Rutgers University
Thu 09 Aug 2018
In April 2018, JSTOR and Rutgers University convened a workshop of scholars, librarians and students to brainstorm new ways to support Public Health researchers. Using a series of design thinking activities, they conceived a number of new tools and services, which were subsequently user-tested with both students and faculty. This video describes that work, shares the ideas that emerged and presents the findings from the users tests.
Introducing Cultural History Baseball Cards
Wed 08 Aug 2018
As the JSTOR Labs team began its collaboration with the amazing folks at Library of Congress Labs, and we interviewed a series of baseball historians, researchers and journalists, I kept returning to the concept of baseball cards. I used to collect them, as a kid, nearly cracking my teeth on that gum they came with. I remember practicing the different batting stances shown in the pictures on the front, and studying the stories in the statistics the back. Each card was a tiny window onto the drama of a season of baseball, told one player at a time.
The challenge we set for ourselves with this project was to find a way to bring together collections that were alike in subject matter -- they were all related to baseball – but diverse in form and structure, ranging from academic articles to pictures of baseball jerseys. How to organize the material so that users could find their ways through it? We held a collaborative flash build with the LC Labs team to explore our options. The LC Labs team ended up organizing the material geographically: you can see the results of their work here: Mapping America’s Pastime.
We at JSTOR Labs were interested in organizing the material around the players, managers, and other figures at the center of the history of baseball, and that’s where the idea of baseball cards popped up. (Credit where it’s due: I think the idea first popped up in a conversation the LC and JSTOR Labs teams were having with Alex Stinson and Ben Vershbow from Wikimedia). If a “normal” baseball card told the story of a single season of baseball one player at a time, then ours would tell the history of baseball more broadly. It would focus less on the play on the field as represented in baseball statistics and instead shine a light on baseball in society.
And so, I am thrilled to announce the release of Cultural History Baseball Cards. Pick a player, view their card, browse through primary and secondary resources about that player from JSTOR, the Library of Congress, WikiMedia and the Smithsonian’s National Museum of African American History & Culture. The story of baseball, as seen through the people contributing to it – not just vibrant players like Jackie Robinson, Ted Williams and Willie Mays, but others, too, like Marvin Miller, the head of the players union, or Bill James, who spearheaded the statistical revolution in baseball.
We see the site more as a proof of concept than a finished product: certainly there are more collections that could and should be incorporated. There could also be value in connecting out to resources like Baseball Reference that look at these historical figures from different angles. One can also imagine expanding the idea beyond baseball. And those are just our ideas: we are eager to hear what you think. So, I encourage you to explore Cultural History Baseball Cards – with or without a stick of incredibly stale gum.
A Collaboration with Library of Congress Labs
Fri 06 Jul 2018
I am madly in love with baseball. There are few things I enjoy more than a summer evening at a ballpark, although listening to a game on the radio while sitting around a campfire comes pretty close. I love the ebb and flow of the game, I love the big and little stories it can tell, and I love its deep connection with our history. So, when the Library of Congress Labs team approached us about a collaboration centering on their Baseball Americana exhibition, I could hardly contain my excitement.
For the past few months, the two Labs teams have been laying the foundation for the work of building a tool to help baseball researchers. First, both teams have been preparing the content and data that we’ll use to build a proof of concept. Second, we’ve been interviewing baseball researchers of all sorts -- including, established sports historians, cultural studies scholars, students, and sports journalists - to better understand the tasks, goals and hurdles that they face currently when studying baseball and our cultural heritage.
Now, during the week of July 9th, the two teams will meet in Washington DC for a weeklong flash build. We don’t know yet precisely what we’ll build -- that’s part of the fun of these design-thinking-based methods -- but we are planning on sharing what we come up with the world on the last day, Friday. Please join the two Labs teams for a free webinar on Friday, at which we’ll describe the process we’ve gone through, review the data and content we’ve had to work with, and unveil the proof of concept tool that we develop. Register now, here: https://www.eventbrite.com/e/inside-baseball-baseball-collections-as-data-tickets-46442562956?aff=es2.
We’ll see you on Friday, July 13th. Play ball!
Text Analyzer Wins SSP's First Ever Previews Session People's Choice Award
Mon 02 Jul 2018
Good news alert! Here's the press release: https://www.sspnet.org/community/news/jstors-text-analyzer-product-wins-first-ever-ssp-previews-session-peoples-choice-award/
And here's a video of the 5-minute preview session itself: https://youtu.be/a66tTylPfhY?t=43m33s
Botany and the Humanities
Workshop at Dumbarton Oaks
Mon 19 Mar 2018
The goal of this workshop, cohosted by JSTOR and Dumbarton Oaks and held at Dumbarton Oaks in Washington D.C. in December 2017, was to foster an exchange about extending the primary sources available online and increasing their usefulness to researchers and students, with a focus on the social and historical contexts of botany and horticulture. Informed by user research conducted prior to the meeting, we discussed how individuals currently discover and use botanical primary sources, brainstormed and sketched possible new approaches to meet their needs, and considered what current and future approaches and source materials held the greatest promise for enhancing the value of this content for a broader array of research communities. This video captures the structured brainstorming and describes a handful of the most promising ideas that emerged at the workshop.
The Journal of Electronic Publishing
Has Published Our White Paper
Wed 28 Feb 2018
Sprechen Sie Textanalysierer?
Creating a Multilingual Text Analyzer
Thu 18 Jan 2018
We released Text Analyzer a little less than a year ago. Since then, we’ve made any number of adjustments to it based on your feedback (improving the recommendations, adding search filters, My List and MyJSTOR integration, etc.), and meanwhile its usage has grown along with our belief that this new way to search holds great promise. One of the reasons people tell us they come to Text Analyzer is that it helps to find related content in unfamiliar disciplines or subject areas. In doing so, it breaks them out of the disciplinary or citation-based siloes that they’d been working in, and we feel this is critical for multidisciplinary work and for scholarship more generally.
Today, I’m pleased to announce a new, experimental feature for Text Analyzer that hopefully will help to break people out of a different kind of silo, one based on language. You can now upload or point Text Analyzer at a document in any of a dozen languages, and it will help you to find related English-language content. This means that if you’re a researcher whose first language isn’t English, you can search with material in your native language. You’ll still have to translate or read the English-language results, but hopefully this makes it easier and more natural to find the right articles and chapters for you to read.
Currently, this experimental feature works for the following languages: English (but you knew that already), Arabic, (simplified) Chinese, Dutch, French, German, Hebrew, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish and Turkish.
Anyone who has used Google Translate will know that algorithmic translation can be a tricky game. Sometimes, it’s good enough to give you the gist of a text, but sometimes it stumbles and just produces gibberish. That’s why we think the approach we’ve taken is an interesting and potentially promising one.
Text Analyzer, you may recall, works using a topic model. Each topic is composed of a cluster of terms – if enough of those terms are found in a text, then it’s more likely that that topic is being discussed. What we’ve done is to create topic models in multiple languages, and link them at the high level-topic. We’ll post the technical details of this in a future blog post, but what this means is that when you upload a document in Russian, Text Analyzer uses its Russian topic model to figure out what the text is about (in Russian). Each topic inferred in a native language is associated with the corresponding topic in the English topic model which are then used by Text Analyzer to do what it’s done since it launched – find articles and chapters in JSTOR that are about the same topics.
The practical upshot of this is, we think, pretty exciting: cross-language content discovery, while avoiding both the expense of manual translation and the robotic inaccuracy of algorithmic translation.
I should signal that this translational feature is still very much experimental! First, at this point it currently only works in one direction – from these languages to English. JSTOR is primarily an English-language database, but it does have many documents in other languages. Ideally, this functionality would help English-speakers find those as well, but that is not yet developed. Second, at this point Text Analyzer only identifies a single language per uploaded document – we hope to be able to handle documents with multiple languages in the future. Last and most importantly, the non-English topic models are still in their early stages of development (in fact, if you’re a digital humanities practitioner interested in helping us developing one of our language-specific topic models, we’d love to hear from you!
We’ll keep working on all these things. In the meantime, I hope you’ll give this new feature a try and let us know what you think! You may not be a polyglot, but now with Text Analyzer, you can research like one.
Cited Loci of the Aeneid
Searching through JSTOR's content the classicists' way
Mon 06 Nov 2017
The sheer amount of data contained in JSTOR raises the question of what is the most effective way for scholars to search it. The answer to this question is inevitably going to be discipline-specific as scholars in different fields do have different strategies for retrieving bibliographic information.
For students and scholars in Classics, for example, the ability to search for articles that quote or refer to a specific text passage (or range of passages) is of essential importance. An index of cited passages, which is normally found at the end of monographs or edited volumens, serves precisely this purpose. Yet, when it comes to searching through archives of journal articles like JSTOR, full text search is often the only functionality offered. And, in many cases, it's not sufficient, or not the most effective way of retrieving bibliographic information.
Cited Loci of the Aeneid is a proof-of-concept aimed at showing how technologies that are being developed rather independently in the field of Digital Humanities, if combined together, can enable whole new ways of searching through large electronic archives. It's more a hack than it is a real project, and it is the result of an online conversation that went on for almost a year. This conversation brought together various research groups working on different yet intertwined topics: Neil Coffee, Chris Forstall, Caitlin Diddams and James Gawley who had been investigating in the Tesserae project the automatic detection of intertextual parallels within classical texts; Ron Snyder from JSTOR Labs who had been working on extraction of quotations of primary texts (e.g. Dante' Commedia, Virgil's Aeneid, etc.) and Matteo Romanello who had developed a system to capture canonical references to classical texts from the full text of journal articles.
Cited Loci of the Aeneid: a screenshot of the search interface.
2 Close, distant and scalable reading
The visualization of data in the Cited Loci of the Aeneid's interface was very much informed by two concepts, coming from the field of Digital Humanities: distant and scalable reading.
Distant reading, introduced by English literature scholar Franco Moretti, stands for the data-driven analysis of literary phenomena by means of quantitative methods. This idea, which has gained some momentum in recent years, has also sparked a debate among humanities scholars on whether reading literary texts is still necessary now that we can apply distant reading techniques to many of our literary corpora.
In an attempt to go beyond this dychotomy, Martin Müller has proposed the notion of scalable reading, a type of reading which moves constantly back and forth between these two perspectives on the text. Scalable reading essentially describes what happens when scholars have distant reading tools at their disposal. The patterns that emerge only when looking at a corpus of text from far force the scholars to go back and read the texts in order to understand and make sense of these patterns.
The visual index (i.e. heatmap) displayed on the left side of the interface provides a birds-eye view on the distribution of more than 11,000 quotations and roughly the same number of references across various books of the Aeneid. At a glance, the reader can detect some patterns, like for example the fact that Book 6 -- where the story of Aeneas' descent into the underworld is told -- has been quoted and referred to a great deal by scholars. The "scalable element" of the interface consists in the ability of selecting a text chunk of interest from this visual map and then examining the contexts where these quotations and references appear.
3 Behind the scenes: extracting quotations and references
On a technical level, quotations and references are extracted from JSTOR articles using two quite different approaches. Quotations are captured by means of Matchmaker, a software developed at the JSTOR Labs that employs fuzzy matching techniques to find possible quotations (i.e. reuses) of a target text -- the Aeneid in this specific case -- within a set of documents. While finding quotations requires us to identify the sequence of words from the original text that are quoted, the extraction of (canonical) references consists in identifying the piece of text that refers to an ancient work (e.g. "Virg, Aen. 6.264" refers to line 264 in Book 6 of the Virgilian poem). The references that are visualised in the Cited Loci of the Aeneid's interface were captured using a method and software developed by Matteo Romanello as part of his doctoral dissertation in digital humanities. This software is able to detect references written in a variety of languages, and following different citation styles, with a certain degree of accuracy.
4 Explore, read, annotate
Cited Loci of the Aeneid is a proof-of-concept of a new generation of user interfaces that will allow users to access in a whole new way large archives of publications like JSTOR. The main uses it enables are: the interactive exploration of thousands of JSTOR articles; the ability of reading the part of an article that match certain search criteria without leaving/moving out of the interface; the possibility of annotating what is visually displayed, and save them to a personal space.
A visual index (heatmap), which occupies left part side of the interface, provides an overview of the frequency and distribution of extracted quotations and references. Each cell represents a chunk of the text. The darker a cell, the higher the density of quotations and references which were found for that speicific chunk of text.
Matching articles from JSTOR are shown on the right-hand panel, together with the snippet of text where the quotation/reference was found and link to the article in JSTOR. The results can be filtered so as to show only article with either quotations of or references to the Virgilian text.
Thanks to the integration with the platform hypothes.is it is possible to annotate the visualisation (either privately or publicly). This way, you can take notes while you discover new articles related to the Aeneid.
5 Current use and future developments
The tool's first use was in the context of a graduate course on Latin intertextuality taught by Neil Coffee and his collaborators Caitlin Diddams and James Gawley at the University of Buffalo in the fall semester 2016. Students rapidly understood the potential of the tool and made use of it during the course. The tool was found to be especially useful by a a group of students who sorted through a list of parallels (i.e. similar passages) between the Aeneid and Homer. In fact, it allowed them to quickly gather up scholarship around a given passage of the Aeneid, and in doing so assess whether the passage had been previously recognised as bearing some degree of similarity with passages in the Iliad and the Odyssey.
As for the future developments of this, the natural way to go seems to explore the possibility of extending the interface to cover other works of poetry beyond the Aeneid, such as e.g. the Homeric poems. However, the detection of quotations of Greek literature works constitutes a considerable challenge, as the varying quality of OCR in the JSTOR corpus prevents the Matchmaker tool from identifying existing quotations.
You Can't Rollerskate in a Buffalo Herd
And You Can't Innovate When You're Afraid
Tue 17 Oct 2017
Here at JSTOR Labs, we think a lot about what it takes to create innovative tools and products. In this blog, we’ve touched on a few of the ingredients that foster a truly creative environment, including iterating, iterating again and then iterating still more, creating a deep understanding of the user, and developing open, collaborative partnerships. Today, I’d like to discuss one of the most important ingredients, and one that can get overlooked pretty easily: overcoming the fear of failure.
But first let’s talk about Country & Western music (1), which I just love. These days, when I’m in the mood for it, I’ll turn to the old standbys of Willie, Waylon, Merle and Johnny. When I was a kid, though, the country songs I liked were the completely silly ones with ludicrous names, like “A Boy Named Sue” and “May the Bird of Paradise Fly Up Your Nose.” One of my faves was a silly Roger Miller song called “You Can’t Rollerskate in a Buffalo Herd.” Here, have a listen.
(1) A risky transition! What if it doesn’t work? (Scared, yet?)
It’s a great song. It’s not a perfect country song, though, and not just because it doesn’t have much to say about Mama, or trains, or trucks, or prison, or gettin’ drunk (like the last, perfect verse of this song does). No, it’s not perfect because it never really explores the idea of rollerskating in a buffalo herd, which, c’mon, that’s some rich topical territory to mine, right there. I mean, why can’t you rollerskate in a buffalo herd? The song doesn’t tell you.
Having never tried it myself, I’ll have to conjecture. First, I imagine that the jostling and the horns and the fact that you’re on wheels makes it a bit tough to stay upright and safe. Second, if (no: when!) at some point you DO fall over, the combination of trampling, stampeding hooves and the wearing-rollerskates-thing means you may not be able to get up again. Scary stuff! Better not to do it in the first place. Thanks, Roger, for the heads-up.
Which, you’ll be relieved to hear, brings me to my point. If you fail at rollerskating-with-buffalo, the consequences are severe enough that it’s entirely rational to be afraid to try it at all. And that brings us back to innovation and its mortal enemy, the fear of failure. I’ve certainly heard lots about how you can’t innovate when you’re afraid, but the conclusion I hear most often is a disappointingly unhelpful, “So don’t be afraid.” What that mindset overlooks is that sometimes it’s entirely rational to be fearful. If you’re making major changes to the business model that sustains your enterprise, or if you’re making a bet that will affect your entire career (2), or you’re changing something that’s got thousands or even millions of current users, then you should be afraid. You’d be crazy and heedless to ignore that fear and just upend everything. Thing is, as Clay Christensen tells us, we also can’t get paralyzed by the fear and just not change. So what do you do? You create a space where it’s safe to fail.
(2) For instance, if you’re deciding to forego traditional publications and instead embrace digital scholarship that may or may not help you get tenure or promotion.
That’s the whole point of this post, so I’m going to go ahead and repeat it: You create a space where it’s safe to fail. This can be done in dozens of ways. Let’s explore some of the specific ways that work for JSTOR Labs:
Prototyping: There are many, many different types of prototypes, and ways to prototype. These range in effort required to produce them from napkin sketches and paper prototypes at one end to Wizard-of-Oz-tests and low-fidelity prototypes (i.e. wireframes) in the middle to high-fidelity, live-data prototypes at the other end. At heart, the point of a prototype is to learn as much as possible about your idea while investing as little effort as possible (making it less heartbreaking when it fails). Prototyping is particularly powerful when paired with the next two items on this list.
Rapid iterations: Innovation seldom comes in a single, big Eureka-moment. Instead, it’s the product of a thousand little realizations. The faster you can go through those learning iterations, the faster you can get to something really great, allowing you to make many small, less fear-inducing bets rather than one very large, scary one. Earlier in my career as a product developer, we’d test one or two different designs for a new product. Using various forms of prototyping and the next item on this list, JSTOR Labs tests dozens and sometimes even hundreds of designs in a matter of days or weeks.
In-person user-testing: Big releases are scary because the failure is so terribly public (hello, new Coke). But if instead you’re just showing it to a single person and they don’t like it, it’s just not much of a big deal. In fact, it’s kind of great, because then you can ask questions to better understand what would make this particular person stand up and go, “Wow.” We use guerilla testing when we need short, informal feedback from lots of people, and longer, scheduled tests when we need to dive deeper.
A/B testing and limited releases: In-person user-testing is essentially a very, very small release of the product. This idea can be expanded into still-limited releases through, for example, a pilot program. With a bit of tech, you can also do this programmatically, by releasing, say, to only 1% of your users (this, then, allows for a/b testing). Our colleagues at ITHAKA’s Build Smarter blog wrote about how ITHAKA and JSTOR use these kinds of tools.
Labeling: Setting expectations with users with a Beta label can be very powerful – it makes it easier to release early in order to get user feedback. We at JSTOR Labs tend to marry this label with a “Help us make this better” feedback link. The Beta label changes the relationship your users have with the tool and you; it brings them under the tent. With Text Analyzer, for example, we’ve received feedback from literally thousands of users through this simple form and social media. That feedback has been invaluable to us in finding and prioritizing issues.
Fun: The importance of this one really can’t be overstated. A sense of fun and play contributes to creativity. It also brings people together, making it safer to fail because others will pick you up. During our “design jam” workshops, participants have to come up with eight ideas in eight minutes and then share them. Then participants do it again, stealing each other’s ideas. Participants are encouraged to write down everything, even terrible ideas, because, first, you only have eight minutes, but also because your terrible idea might spark someone’s great one. When we run these workshops, we always send in advance a visual agenda that conveys to participants that they are in for some “serious fun.” For instance, we used this when we held a workshop with the Dante Society:
Using all these techniques doesn’t mean that there are no risks, or that fear completely goes away. But they do make that fear manageable. I’m still a little nervous, for example, that you’ll think it’s silly to organize this whole post around country music and fifty-year-old Roger Miller song. But at least I had fun writing it. And if it didn’t work? I can deal with that failure. It’ll help me write a better post next time.
SXSWedu 2018 PanelPicker
5 Reasons to Vote for JSTOR Labs!
Mon 07 Aug 2017
JSTOR Labs needs your support! We’ve proposed to host a workshop entitled “Design Thinking: New Ways to Teach Old Texts” at SXSWedu 2018 in Austin, TX. To be accepted, however, we need your votes! Why, you may ask, should you bother to vote for JSTOR Labs’ session?
- Because the kind of entrepreneurial, project-based learning that JSTOR Labs is expert in will help students and teachers tackle complex problems.
- Because there’s still so many more things we can do to improve how we teach literary and historical classic texts!
- Because sometimes we need to re-affirm our love of democracy.
- Because it is a universal truth that the creation of silly Lego videos should be rewarded.
- Because everyone who tweets a screencap of their vote to @abhumphreys will receive a very nice, custom surprise.
Vote for JSTOR Labs here: http://panelpicker.sxsw.com/vote/74783. Voting ends Friday, August 25.
I Am Lean QA
And You Can Too
Wed 02 Aug 2017
Greetings! I’m Aparna Bankston, one of the newest members of the JSTOR Labs team here in Ann Arbor, Michigan, and I’m your resident Quality Assurance (QA) Engineer, aka your “Guardian of Excellence!” Today, I’m going to share with you what it’s like to play this role on a team the works in the lean, agile way that JSTOR Labs does.
To do that, let’s first back up a second to talk about the role of the testing engineer on a team. When I first started in QA, we tested everything manually, going through a set of steps to test what the new code does to make sure that it did what the product managers and developers said it should. In recent years, we’ve automated much if not all of this testing. We write testing scripts and scenarios that test the different anticipated user workflows (the screenshot below shows an example of these scripts). Those tests are set up in a test suite which runs every time code is deployed, and tells developers whether or not their new code breaks something. This allows developers to get instant feedback on their work. It also saves me time, allowing me to move up in the process and work alongside the developer, thus getting tests in place earlier in the process.
But here’s the catch. In order to decide whether a new line of code is working, you need to know what it should be doing. That’s pretty easy on a fairly typical development team, where the team understands the product and works against a well-defined backlog of work. But JSTOR Labs doesn’t work that way. The Labs team runs constant experiments and tests in an effort to figure out how a product or feature should work – we don’t know it in advance because we’re working in uncharted territory. When we’re working on a project, the technology and interaction design are constantly changing as the team iterates towards a design that’s useful and useable to scholars and students. That rapid change makes it tough to figure out when to write automation to ensure quality is in place.
The question is no longer “can it be automated?”, but, "should it be automated and when?”
A great example of this is with the release of JSTOR Labs’ newest tool, Text Analyzer. When I joined the team, I worked with the team to figure out what functionality was less likely to change. A team like Labs is always learning, and part of that learning is understanding what is working. We quickly identified some core functionality, and I created automation tests of that. Those tests turned into a basic regression suite that run regularly and I monitor. This is important because as new developments are released, we want to ensure that core functionality continues to work. A team like Labs likes to “move fast and break things.” The work that I do helps the team move fast by helping it know what it’s breaking, when.
Livingstone's Zambezi Expedition:
Tue 25 Jul 2017
I was honored to present a poster featuring the JSTOR Labs-JSTOR Plants project, Livingstone’s Zambezi Expedition, as part of the 15th Smithsonian Botanical Symposium on May 19, 2017. With the theme, “Exploring the Natural World: Plants, People and Places,” the symposium featured a range of fascinating presentations on the history of botanical exploration (you can read more about the symposium here). Given the Symposium’s focus, this was a perfect venue to share the work we did to bring together specimens and historical materials in a new way and get feedback from scholars and teachers actively working in this field. Download the poster here and, as always, please contact us if you would like to share any comments or questions about the project.
Reimagining the Digital Monograph
Mon 12 Jun 2017
In December, when we first announced our Reimagining the Monograph project, one of the outputs of that project was a white paper. We released that paper as a draft for comment in the hopes that the community’s feedback would help to strengthen it. I couldn’t be more impressed with how thoughtful, insightful and GENEROUS you were with your feedback. Thank you.
Now, I’m pleased to announce we’ve incorporated that feedback into a final version of the white paper. The updated version is now available for download on the Reimagining the Monograph site. This new version retains much of the structure of the previous one, and it still includes the ethnographic user profiles showing how six scholars do research with print and digital monographs. It also includes minor changes throughout addressing both specific and general questions we received and clarifying many points. We have gone from twelve to thirteen principles for the reimagined monograph. Most significantly, we have added as an appendix a new landscape review of related projects, which helps to situate our work on this project amongst a number of other important initiatives. I hope that you’ll find these changes constructive, but, as ever, I am eager to hear any and all feedback you might have.
Reimagining long-form digital scholarship isn’t going to happen in one project, prototype or paper. We hope that the updated white paper contributes to this important ongoing conversation, and are eager to work with the community to take further steps.
Oh, the Places this App Will Go
Graduating JSTOR Snap
Mon 22 May 2017
It’s graduation season. Robes are being donned, caps thrown in the air, inspirational and/or trite speeches being given. Here at JSTOR Labs, we’ve got a graduation of our own to celebrate.
A few years ago, we developed a little web app called JSTOR Snap, which let you search for articles in JSTOR by taking a picture of any page of text. It was an interesting concept, and we learned a great deal from it. For example:
- Even though researchers say they don’t want to do research on their phones, when presented with a new experience that’s really optimized for the phone, some will.
- Designing for a mobile experience requires sequencing users’ needs into a very simple workflow, teasing out individual steps that might not have been explicitly defined in a desktop experience.
- It’s technically feasible to perform searches using inputs other than keywords, and on-the-fly OCR – even of pages using a phone’s camera – works just fine in most cases.
- And much more…
In many ways, these findings led directly to the development of our most recent project, Text Analyzer. Snap and Text Analyzer present new ways of doing scholarly searches, and they do so not with keywords but using large text files. They both allow users to take a picture of a page of text and search with that. They both use OCR services to extract text from image files. Heck, we even recycled Snap’s icon in the Text Analyzer mobile design:
With Text Analyzer, we were also able to improve upon Snap too: it accepts more kinds of input, better fitting within a researcher’s workflow; we improved the recommendation algorithm significantly; and users of Text Analyzer have sophisticated tools for refining their search.
Because of these improvements, we feel it’s the right time to redirect users of Snap to the Text Analyzer tool. Snap's designs will be archived here in this blog, and its functionality will continue to be refined as we keep working on Text Analyzer and other projects.
This evolution represents what we most hope for when we at JSTOR Labs release these projects: we want these ideas to go somewhere, to inform future generations of work, be it ours or others’. We want them to graduate.
And so, I hope you’ll join me today in celebrating Snap’s graduation. Our tiny little web app is all grown up now. It's throwing its cap in the air, and (sniff) we couldn't be prouder.
Is That Even a Thing
Fri 31 Mar 2017
How does a heavyweight research technique like ethnography fit into a lean user research process?
Hint: It doesn’t. We cheated. #worthit #sorrynotsorry
Tasked with the broad initiative of reimaging the monograph, we at JSTOR Labs knew we needed a lot of rich, detailed information about how scholars currently use monographs. After all, you can’t solve complex problems for someone until you have a deep understanding of those problems.
Normally, to prepare for a project, JSTOR Labs does a handful of hour long video chats. That’s leaner and quicker than ethnographic research, but still provides some of the contextual information about our users that helps us understand the users and their problems. For this project, though, the scale of the question we were asking was a great deal larger than we’ve typically faced. “How are books used in research” is considerably broader than something like “How do undergrads find articles and books about Shakespeare”? We didn’t think our usual approach would be rich enough.
Luckily for us, the Labs team works within the larger JSTOR and ITHAKA organizations. We were able to partner with the wonderful Christina Spencer, JSTOR’s User Research Manager, to craft and conduct the user research the project demanded.
A full-fledged ethnographic study may take years, perhaps with dozens of participants, and days, weeks, or months with each participant. This allows for time to build trust and observe the full spectrum of activities and interactions that those participants have. That methodology is extraordinarily valuable, but it surely isn’t lean. With Christina’s guidance, we found a happy medium. Our approach was lean enough to let us move forward quickly but in-depth enough to collect nuances from scholars’ environments and workflows that remain hidden in a Skype interview. Over the course of a couple weeks, Christina spent a day each with six history scholars, identified for us by AHA, who use monographs in their research. She documented everything they did in their regular work -- reading, taking notes, searching stacks at the library, writing, flipping through books, photocopying pages -- even taking breaks to play PokemonGo and watch TV. The goal of this type of observation is to learn how this work really happens and why. People are often unaware of the details of their processes, particularly workarounds they have developed over time; you have to watch them, and not just ask, to get that information.
We wanted to share what she learned both at a workshop we had planned and within the white paper we planned to draft. The trouble was, there was so much information! We spent days reading and re-reading notes to see where similarities, differences, themes, and surprises appeared. After significant reflection and discussion, the findings for the study were consolidated into profile infographics for dissemination. The profiles present a portrait of each scholar [names and identifying details changed for privacy], with particular emphasis on how books fit into their daily work. To see them in more detail, check out the appendix of Reimagining the Monograph (.pdf).
I won’t go through every piece of information in the profiles here, but I will use one as an example. It was a case of a “simple” question having a complex answer that was a challenge to represent visually. Before Christina observed scholars, we were curious for which tasks scholars preferred print vs. digital books. What she observed was that there was a spectrum of preference for each task performed with a book. We created a graphic with accompanying text to represent where each scholar fell on the spectrum for each book task and why.
This scholar, for example, preferred digital books for most tasks, but she found close reading easier with a print book. It’s more reliable to infer the type of book someone is likely to use for citation mining after a couple of hours of watching them mine citations and confirm with them what you saw, than to explicitly ask them in 20 seconds if they use digital or print books to mine citations.
We won’t conduct ethnographic studies for all, or even most, of JSTOR Labs’ new projects going forward. It takes a lot of time and effort, and is often overkill for the types of questions we’re attempting to answer. But, I’m glad to know it’s in our toolkit for the next time we come across a vast, murky problem we need to make sense of. Heavens knows there are a lot of these kinds of problems out there that need solving. #ethnographyftw #evenifitsnotlean
Text Analyzer Video
Amy & Amir
Fri 17 Mar 2017
Hear about how Amy and Amir find research they need with this new type of search called Text Analyzer.
Under the Hood of Text Analyzer
Tue 07 Mar 2017
The Text Analyzer (https://www.jstor.org/analyze) is a new way to search JSTOR: upload your own text or document, Text Analyzer processes the text to find the most significant topics and named entities (persons, locations, organizations) in the text and the recommends similar or related content from JSTOR.
In this post I'll provide a peek under the hood of the analyzer, describing the analysis processes and tools/technologies used. I'll also describe some features that have been incorported into the processing and interface intended to remove much of that black-box feel that often accompany a tool such as this.
How it does what it does
The Text Analyzer performs its processing in 3 main steps.
The first step involves the extraction of text from a submitted document, unless raw text is provided via direct entry or copy/paste, in which case it is simply passed to step 2.
- Text is extracted from documents using server-side processing. (Note that a submitted document is only retained on our servers long enough for the text to be extracted, and when completed the document is removed from our system.) Text can be extracted from most any document type including PDFs and MS Word documents.
- Text can also be extracted from images using a OCR, which is especially useful when using the Text Analyzer on a mobile phone with a built-in camera.
After raw text has been obtained up to 3 separate text analyses are performed in parallel.
- Topics explicitly mentioned in the text are identified using a the JSTOR thesaurus (a controlled vocabulary containing more than 40,000 concepts) and a human-curated set of rules in the MAIstro product from Access Innovations. Using MAIstro, concepts/terms from the JSTOR thesaruus are identified in unstructured text.
- Latent topics are inferred using an LDA (Latent Dirichlet allocation) topic model trained with JSTOR content associated with the terms in our controlled vocabulary. Our application of LDA topic modeling takes advantage of the controlled vocabulary and rules-based document tagging described above. With these tagged documents we are able to use the Labeled LDA (as described here) variant of LDA to train a model using our thesaurus as a predefined set of labels for the model topics.
- Named entities (persons, locations, organizations) are identified using multiple entity recognition services and tools. This includes Alchemy (part of IBMs Watson services), OpenCalais (from Thompson Reuters), the Stanford Named Entity Recognizer, and Apache OpenNLP. Entities recognized by the individual tools/services are aggregated and ranked using a voting scheme and a basic TF-IDF calculation to estimate the relative importance of the entity to the source document.
The 5 most significant topics (both those explicitly mentioned in the text and those inferred) and recognized named entities are used in a query to identify similar content from the JSTOR archive.
- A document is selected if at least one of the terms (topics or entities) in the query are found in a document. Using this 'OR'ing approach adding more terms increases the number of documents selected.
- Selected documents are ranked using a scoring calculation that considers both the weight of the term from the input text and the importance of the term in the selected document.
The weight of the query terms used (from the PRIORITIZED TERMS section of the UI) can be adjusted using the slider widgets. The preselected terms can be also removed and new terms can be added as needed.
- The Text Analyzer uses the top 5 terms identified in the source document based on weights calculated in the 3 analyses described above. These should be viewed as a starting point and adjusted as needed to match a specific area of interest. The Text Analyzer is designed to be used in an iterative process wherein the initial seed terms are augmented, removed or have their weights adjusted.
- A new query is automatically performed after any change in the query terms. After each new query the IDENTIFIED TERMS are updated to include related (co-occurring) topics found in the top results. In that way, the palette of terms available for use in a query are continually updated to reflect current user preferences. A user is also able to enter ad-hoc terms directly using the provided input box.
How it shows its work
First, the most significant topics and entities found during the text analysis phase are displayed in the IDENTIFIED TERMS section of the UI. The weight and origin of each term can be seen by hovering your cursor over the term. The tooltip shown will identify whether the term was obtained from the source text or whether it was derived from co-occurring topics found in similar documents. If the term was from the source document the tooltip will also identify whether it was a topic explicitly found in the document or was a latent topic inferred using the topic model. In all cases the value of the calculated weight is provided.
Here's an example of a tooltip for a term that was identified in the text of the source document. In this example, the term "National security" was found in the source document.
This is an example of a tooltip for a latent topic that was inferred from the source text using an LDA topic model. In this example, the text "Exportation" was not found in the source document but this topic was inferred from other frequency of words/phrases that are commonly associated with the topic Exportation based on the LDA model.
And, here's an example of a term ("Immigration legislation") that was neither found in nor inferred from the source document but may be related in some way based on the frequency of co-occurrence of the term in similar documents.
In the search to identify similar documents the analyzer uses the terms displayed in the PRIORITIZED TERMS section of the UI. The seed terms that were automatically selected from the IDENTIFIED TERMS can be easily customized (removed, added, weight changed) in the interface to match a users preference. The analyzer attempts to pick the most significant terms but with a vocabulary of more than 40,000 possibilities and the difficulties inherent in natural language processing it will invariably have some misses and/or questionable selections.
The best recommendations are obtained when the terms are tweaked to better align with the content desired. This is done with the controls in the PRIORITIZED TERMS section of the UI. Clicking the 'X' icon next to the term will remove it. Clicking on a term from the IDENTIFIED TERMS section will add it to the list. Ad-hoc terms may also be entered using the "Add your own term" input box. The weight for any term in the list can be adjust using the slider widget.
The results returned by the search include detailed information enabling a user to see why each document was selected and how its relevancy score was calculated resulting in its relative position in the search results list.
Any terms that were matched in the selection process are listed in the PRIORITIZED TERMS section for each search results item.
Hovering over the term will reveal a tooltip that shows the specific fields that were matched and the relative contribution of each matched field to the terms overall score. If multiple occurrences of a term were matched in a single field the number of occurrences are shown in parentheses. The tooltip also provides two values, one showing the relative importance of that term to the document and the calculated relevance for the term. The calculated term relevance is based on its importance to the document and the term weight specified by the PRIORITIZED TERMS slider.
In the example below the term "Illegal immigration" was recognized as a topic in this document. One instance of the term was found in the abstract and four in the document body. The overall contribution of the term to the documents relevancy score was 307. This term also had document weight of 10 (on a scale of 1-10) indicating that it was extremely important to this document.
The factors contributing to the documents overall relevancy score (and thus its position in the results list) can be seen by hovering over the document title in the search results. This will show the overall score and the relative contribution of each matched term. If the I prefer recent content checkbox is selected a Publication date boost will also be used in the relevancy calculation and shown in the tooltip.
In the example below the document matched 3 terms, "Illegal immigration", "Immigration policy", and "Economic migration". Of these, "Illegal immigration" was the most significant, equal to 45% of the documents overall relevancy score. Changing the value of the slider associated with this term in the PRIORITIZED TERMS section would increase or decrease the calculated value for this term, possibly resulting in different ranking in the results list.
In this next example we see the affect of checking the I prefer recent content checkbox. A moderate boost is given to documents with more recent publication dates. In this instance, the document shown received a boost that represents 23% of the documents total relevancy score.
That does it for now. Watch this space for more information on the Text Analyzer. We're in the early stages of exploring how this tool might be used and will be continually improving it based on user feedback. We are also working on improvements to the analyses that are performed and will share more about that as things progess.
On Beyond Keyword Search
Introducing Text Analyzer
Mon 06 Mar 2017
Keyword search is pretty darned magical. With just a few typed words and maybe a judiciously-applied Boolean or two, you can sift quickly through mind-bogglingly-enormous libraries of content. It's simple, it's powerful and it's ubiquitous. It's been around, seemingly, forever, and it's going to be around for a while to come.
But it's not perfect. Thinking only of keyword search within an academic context: junior researchers sometimes flail and thrash as they figure out the right keywords for their search – they know what they want, but what set of jargon-y terms will help them find it? At the other end of the spectrum, more experienced researchers can find themselves caught in discipline- or citation-based siloes, unaware of what they are unaware of (until the peer review feedback comes in…). I think JSTOR Labs might have something to help with these problems.
I'm thrilled to announce the beta release of Text Analyzer, a new way to search for articles and books on JSTOR. Upload a document – the paper you're writing, an article you’re reading, anything – and Text Analyzer inspects it, devises a set of terms the paper it "thinks" is about, and then recommends other scholarship from JSTOR based on those terms. To home in on the content you need, you can add and remove terms, increase the relative importance of some terms over others, or flag that you want you're more interested in current content, or only want to see content you have access to within JSTOR.
It's pretty flexible. It'll accept most kinds of documents: PDFs, Word, html, etc. You can cut-and-paste text into it. If you paste or drag a URL into what we've been calling "the magic box," the tool will go to the web page and analyze the text of that page (this works for Google Docs, too). If you access the tool on your phone, it will encourage you to use your phone's camera to take a picture of a page of text, which it will read and then search based on that text. Heck, if you upload a picture without any text in it, it will try to recognize what's in that picture and search on that (with, to be honest, varying degrees of success – when I uploaded my headshot, the tool searched for the terms "bald hill person").
Text Analyzer is still very much a beta! We think it will be useful to students and scholars, but we need your feedback to make it even better. This is why this is the first JSTOR Labs project to be developed directly within the user experience of the primary JSTOR site – so that we can refine it where it can have the biggest possible impact. I hope you’ll try it out and let us know what you think.
Imagining the Big Picture
...and Filling in the Details One Piece at a Time
Thu 23 Feb 2017
JSTOR Labs, as you may have heard, likes to work quickly. We brainstorm hundreds of ideas in a few-hour-long design jam, we test those ideas quickly with guerilla testing, and then we develop them in week-long flash builds. We do all of this because our experience is that moving fast keeps the team focused on the true value. In doing so, it helps us escape the terrible gravity of mediocrity (sometimes!).
But these techniques are not magic. Design thinking and lean startup approaches are not magic. In order to go fast, we scope our ideas so that they can be built quickly. We limit ourselves to simple user workflows. We build prototypes that don’t need to scale or integrate. We do this because we learn so much more by actually getting a “finished” idea in front of users.
The problem comes when the idea you want to explore doesn’t fit neatly into a small box. We faced just this challenge on our most recent project, Reimagining the Monograph. The idea behind the project was the belief that something was getting lost in the transition of long-form scholarly arguments into “journal-ized” collections of chapters available online at sites like Project Muse and JSTOR. There have been many many many many wonderful efforts to rethink the monograph from either the author’s perspective -- changing the form and shape of the monograph -- or the ecosystem – changing the business model that supports the activity. We wanted to tackle the problem from the researchers’ perspective.
But that’s still a very big canvas on which to paint! Monographs are an integral part to many disciplines’ discourse, and they are used in many and diverse ways by very many and diverse kinds of users. How could we still use gravity-defying design thinking techniques while avoiding premature tunnel vision?
The approach we ended up with is was akin to an artist sketching out in pencil their full vision across a big canvas, and then choosing one area in which to paint in the detail. We conducted our initial user research – observational ethnographies of six historians making use of print and digital monographs – with the broadest possible definition of scope. Similarly, when we conducted an ideation workshop with expert scholars, publishers, librarians and technologists, we inspired them to think as openly as possible about ways to improve the digital experience of monographs. The output of this big-picture thinking – or, to return to our metaphor, the sketch of our vision for a reimagined monograph – is a white paper. In the white paper, we describe a set of principles for the reimagined monograph that emerged from the workshop and research.
We then took just one of the ideas that emerged from the workshop – a way for researchers to better understand the topics covered within a monograph – and developed it into a fully designed, fully realized prototype, Topicgraph. Returning to our metaphor, this is the painted in portion, illustrating, we hope, the quality and kind that we want to see realized across the full canvas.
Of the dozen principles outlined in the white paper, Topicgraph only touches on two or three. The rest of this big canvas needs to be filled in. JSTOR Labs is eager to do its part – in fact, we’d welcome your suggestions for where to turn our monograph-reimagining attention to next. But the full canvas can’t be painted by just us – others in the community will need to contribute to it. We look forward to working with and cheering on anyone taking on this important task.
Understanding the US Constitution
Tue 13 Sep 2016
Understanding the US Constitution is a free app from JSTOR Labs that enables you to find every article on JSTOR that quotes a given part of the Constitution.
This short video shows you how it works.
An Iterative UX-Design Process
Wed 07 Sep 2016
When you perform a typical search on JSTOR, you often get a lot of results. A search for the “commerce clause,” for example, will get you 60,000 results, which is too many to go through individually. Still, with some search term kung-fu, you can ensure that the top results returned are the ones you really care about. When we we were building the Understanding the US Consitution app, however, we realized that this paradigm doesn’t quite work. The long list of articles would be every article that quotes or mentions the Commerce Clause of the Constitution; they can’t be sorted by similarity to search terms because there are none. Our mission: We had to find another way to sort or filter results, and it needed to be easily done on a smartphone.
For more than a year, we’d been mulling over a search results idea we’d been calling “the Equalizer” and looking for an opportunity to use it. We wondered, Was this our chance? The concept was a sort or filter based on the ability to say “I want more of this and less of that.” Let’s say, for example, that you’re researching the 5th Amendment. You could get every article that quotes it, then use the Equalizer to specify that you want more articles about interrogations and cross examination and fewer articles about acquittals and attorney-client privilege.
The Equalizer we had in mind was controlled by sliders. Sliders are an easily understandable and directly manipulatable UI control for touchscreens. When we showed it to potential users of the app, they seemed excited about the concept. The first version of the app had an equalizer that looked like the image on the left.
We wanted to to use the design on the right instead, since the shape of the slider bars adds another cue to their purpose, but it wasn’t technically feasible within the Ionic framework we used for development.
We tested it with university students at a coffee shop in Ann Arbor, MI and with historians at the 2016 American Historical Association conference. Both groups liked the app and how easy it was to get articles related to a particular part of the constitution. They were excited about the Equalizer… once we explained it to them. Uh-oh! Unfortunately, it was nigh impossible for someone to figure out what the Equalizer was for and what it did from looking at it. One person wouldn’t even dare to venture a guess. Back to the drawing board!
The first step was to eliminate the tabbed distinction between sorting and filtering. Strictly speaking, the Equalizer sorts results, but with thousands of results, a sort might as well be a filter. No one is checking out result #4,617.
As a team, we brainstormed six alternative ways to present the Equalizer, focusing on making it clear and intuitive without tabs and with only minimal distinctions between sorting and filtering. Because we had so many ideas, we got some quick feedback on them via a Question Test on Usability Hub to weed out the ones that were least clear. We showed the images below to 84 people – 14 people for each of the 6 designs - with the question “What would happen if you moved the slider[/item/toggle] up for “Social Philosophy”?
The test gave us superficial answers that didn’t dive deep into users’ understanding of the Equalizer, but that’s all we needed; it was enough to rule out three of the designs as being unclear. The “winners” have red borders, above.
Now with only three options to choose between (or mashup), we headed back to the coffee shop to talk to more students. We talked to 7 people, and at first we weren’t recognizing any trends. Then, we had a revelation. We’d been showing the three options in a random order, for good experimental protocol. They always understood the third one best! It was a matter of cumulative understanding: vaguely understanding the first design, making more sense of the second, then everything “clicking” by the time they saw the third design. To avoid that bias, we tried something unorthodox and started showing them all three at the same time. Then, some patterns started to emerge.
We considered many labels, and evolved the design further with more rounds of user testing. Finally, we came up with a successful “Prioritize results by topic” message (note that “Prioritize” sidesteps the whole sort vs. filter issue), and we added smaller notes pointing out that the sliders increase and decrease relative importance.
Our path to refining the Equalizer was an instance of a powerful interaction concept that proved difficult to present intuitively. It took us many iterations to get to something that worked well. Since the last post about Understanding the US Constitution, Android version of the app has been released. Analytics tell us that about one out of every six uses of the app includes moving sliders on the Equalizer. Considering that this is a completely new way of manipulating a search results list (at least, we couldn’t find anyone else who had done this!), that’s not half bad!
Rembrandt Project Image Matching
Fri 12 Aug 2016
In the Exploring Rembrandt project conducted earlier this year by the JSTOR Labs and Artstor Labs teams we looked at the feasibility of using image analysis to match images in the respective repositories. The JSTOR corpus contains more than 8 million embedded images and the Artstor Digital Library has more than 2 million high quality images. There is unquestionably a significant number of works in common between these image sets, especially in articles from the JSTOR Art and Art History disciplines. Matching these images manually would be impractical so we needed to determine whether this could be done using automation.
A key element that we wanted to incorporate into the Exploring Rembrandt prototype was the linking of images in JSTOR articles to a high-resolution counterpart in Artstor where these existed. This would allow a user to click on an image or link in a JSTOR search result and invoke the Artstor image viewer on the corresponding image in Artstor. For instance, when selecting the Night Watch in the Exploring Rembrandt prototype a list of JSTOR articles associated with the Night Watch is displayed and if the article includes an embedded image of the painting it is linked to the version in Artstor as can be seen here.
To accomplish this we needed to generate bi-directional links for the 5 Rembrandt works selected.
For the image matching we first identified a set of candidate images (query set) in JSTOR by searching for images in articles in which the text ‘Rembrandt’ occurred in either the article title or in an image caption. This yielded approximately 9,000 images. Similarly in Artstor, 420 candidate images associated with Rembrandt were identified and an image set was created for them. JSTOR images were compared against Artstor images. For this we used OpenCV, an open source image-processing library supporting a wide range of uses and programming languages and operating systems.
For the matching of images we designed a pipeline process as illustrated below.
Artstor images were the training set and JSTOR images were the query set.
The first phase ensures that the training and query sets are similar in colorspace and longside. The images are resized to 300px longside, converted to grayscale and keypoints and descriptors are extracted. Images that do not have more than 20 keypoints are discarded, since they end up creating excess false positives. The 300px size and 20 keypoints attributes were determined through experimentation and reflects a sweet spot for accuracy and computational efficiency in our process.
The second phase performs the matching. The process utilizes AKAZE algorithm; efficient for matching images that contain rich features. The generated keypoints between two images are compared and the nearest distance between each of keypoints is measured for similarity or likeness.
The final phase collects the data and exports the result as a csv file.
The tuning parameters for Rembrandt set:
- Max long side 300px
- Keypoints > 20,
- Matches > 30
- Inliers > outliers.
This image matching process identified 431 images from the 9,000 JSTOR candidates with probable matches to one or more of the 420 Artstor images. These 431 images were then incorporated into the Exploring Rembrandt prototype.
While the number of JSTOR and Artstor images used in this proof of concept was relatively small, 9,000 and 420, respectively, the technical feasibility of automatically matching images was validated.
The main lesson learned was finding the tuning parameters for matching between two vastly different types of corpuses. Further work should apply the same tuning parameters, visualize the results and build up from there.
Future work will include further experimentation and tuning of the myriad of parameters used by OpenCV and the scaling of the processing enabling this to be performed on much larger image sets, eventually encompassing the complete 10 plus images in the two repositories.
Fri 05 Aug 2016
In the midst of an election year, I’m reminded of the power of the political speech. It can inspire, spur people to action and even alter the course of history. Many of us are familiar from history class with John F. Kennedy’s appeal to Americans to ask what they can do for their country or Ronald Reagan’s call for the Soviet Union to “Tear down this wall!” But what is the context behind these often mythic speeches? Why were they given in the first place? Did they achieve their intended effect and how do they continue to impact us today?
JSTOR provides a rich corpus of scholarship to help answer these questions. For example, in The Volunteering Decision: What Prompts It? What Sustains It? by Paul C. Light, I discovered that President Kennedy’s soaring call for volunteerism prompted subsequent presidents to promote service and even create new government programs to support it. With this in mind, Labs sought to create a prototype tool that could help students and the general public discover the context behind some of the most important presidential speeches in U.S. history.
An early prototype matching quotes from some presidential speeches can be seen at http://labs.jstor.org/presidential-speeches
Using the JSTOR Labs Matchmaker algorithm, we matched famous lines in important presidential speeches to content in the JSTOR corpus. With the data in hand, we sought to try something different interface wise – we are labs after all! Our idea was to use the algorithmically created data and present it in a way that told more of a story than would be possible with simply a list of results. This presentation included incorporating high quality images of each president, the date and location of the speech and visual emphasis on the quote itself. Additionally, to support the narrative of a story, we limited the set of results to fewer than ten and included a featured article. The philosophy behind this choice was to emphasize rapid discovery and contextual learning of the scholarship rather than pure research. The choice to include a featured article is especially important as it gives the user one focal point to approach the scholarship that can transform into greater interest.
To narrow the matched articles took some experimentation. In one method, we ranked articles based on how their top-weighted key terms compared to the aggregated top-weighted key terms of a speech’s full body of related articles, or in other words, how similar an article was to the most prevalent themes and topics (e.g. political protest) of the greater body of articles. Another method ranked articles for a speech by a similarity score which represents how close the matched text is to the quoted line. We also experimented with logistic regression using a training data-set explicitly labeled with relevancy by hand. The end result of these efforts is a curated set of articles that provides solid context to each presidential speech.
An important next step for this project is user-testing – both for the interface as well as the hypothesis behind the story-driven design. Users provide valuable feedback and can help us understand better how to design stories and rich interfaces using the JSTOR corpus of millions of scholarly articles.
Introducing "Understanding the U.S. Constitution"
Thu 07 Jul 2016
I am thrilled to announce the release of Understanding the U.S. Constitution, a free research app for your phone or tablet that lets you use the text of the Constitution as a pathway into the scholarship about it. We can't wait to see how high school and college-level teachers and students use it to enhance their study of our democracy.
To get an idea why we created the app, let’s create a new measurement. We’ll call it the quotable quotient, or QQ, and we’ll use to to gauge the extent to which a given author or document is quotation-worthy.* Shakespeare, for instance, I think it’s safe to say, would have a pretty high QQ. In fact, you could look at JSTOR Labs’ Understanding Shakespeare project, and its accompanying API, as an attempt to calculate Shakespeare’s QQ. Many of his lines have been quoted over a hundred times in the articles of JSTOR, and the most quoted – no surprises here: Hamlet’s “to be or not to be” speech – was quoted over 750 times.** Pretty quotable! Way to go, Will.
Well, as far as QQ goes, the U.S. Constitution leaves Shakespeare in its dust. To pick only the most extreme example, the Fourteenth Amendment, including the Equal Protection Clause, has been quoted almost 2,500 times in the articles in JSTOR. Article I’s Elastic Clause has nearly 1,000.
So when we decided to expand on the idea of using the primary text as a portal into the scholarship about it, it only made sense for us to turn to the U.S. Constitution. Our goal for this was to create something as useful as Understanding Shakespeare, but designed for a mobile experience. That required a rethinking of the basic user experience and design. It also required powerful filtering and sorting functionality to help zero in on, within the 1,000+ articles quoting your clause, the precise set analysis that you need.
The app is currently available for download from the IOS App Store. And don’t worry, Android users: an Android version of the app is in the works and should be complete later this summer – we’ll let you know when it’s available!
The app, like all JSTOR Labs projects, is a work in progress, released in part to help us test a new idea and to learn how we can better realize it. I hope you try it out and, if you do, we’re eager to hear your thoughts on it by email or Twitter. Who knows? A pithy summary and a retweet or two might help improve your own personal QQ.
* Because we’re JSTOR we’ll focus our specific QQ on how quotable documents are in a scholarly context. One could imagine the same measurement applying to pop culture more broadly, in which case someone would have to publish some giant QQ bracket, the inevitable result of which would be The Godfather going to the mattresses against The Simpsons.
** Roughly. The quotation-matching algorithm that powers both Understanding Shakespeare and Understanding the U.S. Constitution uses fuzzy-text matching, and the precise number depends on how we calibrate the algorithm, including setting a minimum string size and percent confidence. The QQ numbers listed in this post are all based on the default settings we use.
You Got Your Chocolate in My Peanut Butter
Tue 07 Jun 2016
When Artstor and ITHAKA announced earlier this year that they were joining forces, James Shulman sent this tweet:
If you’re a child of the seventies and eighties like me, this calls to mind a series of ads for peanut butter cups. I could describe the basic plot, but it’s much more fun to watch one. Here, enjoy: https://www.youtube.com/embed/O7oD_oX-Gio.
Pretty glorious, right?
The thing is, think a bit about the gap between the a-ha moment shown in the ad – you got your chocolate in my peanut butter! – and the actual product being advertised. That gap – why a “cup?” why that particular shape and size? etc. – is the gap faced by ITHAKA and Artstor right now. We believe strongly that combining these two organizations will lead to great things, and we have any number of ideas to consider, but how do we figure out which ones will be best for the us and the community? Well, you’re on the JSTOR Labs blog, so you’ve probably guessed the answer already: by talking to users. By experimenting. By partnering. By doing.
I’m pleased to introduce Exploring Rembrandt, a proof of concept brought to you by JSTOR Labs and Artstor Labs. With it, students can start at five canonical Rembrandt paintings and discover the scholarship in JSTOR about that painting in way that’s easier and more powerful than just typing in “Rembrandt AND ‘Night Watch’” into a JSTOR or Google search.
When we interviewed art history teachers at the start of this project, they described the challenges in going from the works of art to scholarship about that art: it can be overwhelming, especially for undergrads. It can be difficult to move across disciplines, or to excite non-majors, despite the multidisciplinary aspect of art history. When we designed and developed the site together during a one-week flash-build with the Artstor Labs team, we tried to address these challenges.
Exploring Rembrandt, I’m sure, isn’t perfect – I’d wager that peanut butter cups weren’t designed perfectly the first time around either. It’s the first in a series of experiments exploring how to combine the forces and flavors of Artstor and ITHAKA. We’re eager to hear from you – #artstorjstormashup – whether we were successful, and what experiments you’d like to see next!
My Year as a JSTOR Labs Intern
Thu 02 Jun 2016
My name is Jake Silva and I’m a master’s student from the University of Michigan’s School of Information specializing in Human Computer Interaction and Data Science. For the past nine months, I’ve worked on the JSTOR Labs team in Ann Arbor, Michigan where I’ve observed and participated in the team’s use of cutting edge technologies and methodologies in UX Design and Research, Data Science and Product Development. These include guerilla user-testing, week-long design sprints, fuzzy text-matching, geospatial tagging and optical-character recognition technology. While the use of these technologies and methodologies is exciting and innovative, they serve as tools to support Labs in fulfilling its core mission: exploring ways to make scholarship richer and more accessible for researchers, teachers and students as well as the general public.
This mission became very evident during our development of a mobile app focused on constitutional scholarship. To test our design ideas, we set up inside a coffee shop across from the University of Michigan and recruited students to participate in 10-minute usability tests. We used paper prototypes to observe their interactions with our design ideas and gauged what needed to change. During these tests, one political science student became so excited by the idea of the app that he offered to recruit other students for user testing and asked us to contact him once we released it. It’s this excitement and usefulness for users that Labs ultimately strives to achieve with its projects. This example also illustrates one of the principle methodologies Labs uses to test, validate, and build its ideas: the design sprint.
The design sprint, which we also refer to as a “flash build” or Labs week, is inspired by the “Lean Startup” movement, a model for iterating on ideas rapidly based on user input and data. For the sprints, the team meets in a single location and uses the lean methodology to produce a working prototype in a week. I participated first hand in this process during our Labs week to develop a constitutional scholarship app. On Monday, we brought in folks outside the Labs team and focused on brainstorming ideas for the app. This involved multiple individual sketching rounds, brief design presentations to the room and consensus building by vote to come up with a few design solutions. We then turned these into paper prototypes to test with users the next day. Once a clear design solution emerged from the tests, we began to refine each interaction in the app as well as create the necessary technological infrastructure to build it. At each iteration of our design, we went out into the field to test and validate with users. As a participant in these tests, I learned the incredible value they provide in informing the design of a product. As a team, we each bring our own biases, assumptions, and ideas to a design solution and risk overlooking design flaws as we spend more time with the product. User testing helps mitigate these problems and is an unbiased way to solve conflicting design ideas within a team. As an aspiring user-centered design professional, I was happy to see and learn from our team’s emphasis on placing the user first during our sprints.
In addition to honing my UX, data, design, and development skills, I also learned what it takes for a team to be successful. The Labs team is a small but diverse group, incorporating individuals with engineering, UX design, visual design, and product and project management skills. We met for daily standup meetings and reflected throughout the year on how we could improve as a team, where we succeeded and failed and how we fit into the greater JSTOR team. This is important as tools, methodologies, and priorities rapidly shift in a mission-driven technology company. I’ve learned so much from this fantastic team over the last nine months and greatly appreciate their support and interest in my development. I’m also happy to have worked on making scholarship richer and more accessible. Finally, I feel lucky to have had the opportunity to work in such an innovative space and look forward to returning to JSTOR after this summer for the 2016-17 academic year.
Seeking Teachers of Poetry to Test our New Annotation Tool
Tue 12 Jan 2016
For the past few months, JSTOR Labs has been working on Annotation Space, a tool to help teachers of poetry. Annotation Space lets teachers share a poem with their class to annotate and discuss, informed by scholarship from JSTOR about the poem. It was developed in partnership with two incredible organizations: Hypothesis and the Poetry Foundation. Now, we’re looking for a handful of teachers to test-drive this cool tool in their classrooms during this spring semester. For the test-drive, we’ll set up access to the tool for you and your students -- all we’d ask is that you construct an assignment for your class around this tool, and that it be possible for us to gather feedback afterward from you and your students. If you might be interested, let us know!
But wait, how do you know if you’d be interested if you haven’t seen it? Let me step you through the tool to give you an idea what you might expect:
First, you the teacher choose a poem (see the selection we have to work with, below). We’ll set up a private site for just you and your class, which will look like this:
You and your students can make both personal annotations as well as ones that are seen by the entire class. These annotations appear on separate “tabs.”
To make an annotation, just highlight text in the poem and click the icon.
When your students want to bolster their annotations with scholarship, they hop over to the "JSTOR" tab, where they can browse through articles that quote each line of the poem.
If you’re interested in testing this with your class, shoot us an email at email@example.com. Your input while we’re still polishing and refining this site will be invaluable. Thanks.
Here’s the list of poems you’ll be able to choose from:
The Soldier – Rupert Brooke
Concerning a Nobleman – Caroline Dudley
The Love Song of J. Alfred Prufrock – T. S. Eliot
Snow – Robert Frost
The Witch of Coos – Robert Frost
Night Piece – James Joyce
First Fig – Edna St. Vincent Millay
To Whistler, American – Ezra Pound
In a Station of the Metro – Ezra Pound
Eros Turannos – Edwin Arlington Robinson
Chicago – Carl Sandburg
Sunday Morning – Wallace Stevens
Anecdote of the Jar – Wallace Stevens
The Snow Man – Wallace Stevens
Spring – William Carlos Williams
Chinese New Year – Edith Wyatt
The Fisherman – William Butler Yeats
The Scholars – William Butler Yeats
A Prayer for My Daughter – William Butler Yeats
A Heckuva Year
Tue 22 Dec 2015
With 2015 packing its bags to go and 2016 knocking on the door, I’d like to take a brief moment to step back and ruminate on how far Labs has come in the past twelve months. My apologies in advance to those of you with a low tolerance for navel-gazing…
A year ago at this time, we had three live projects: Classroom Readings, the first version of Understanding Shakespeare, and the first version of JSTOR Sustainability. In the past year we substantially updated both Understanding Shakespeare and Sustainability. We redesigned Shakespeare’s home page and jumped from six plays to the full set of thirty-eight. We added Topic Pages and Influential Articles to Sustainability, helping “scholars in interdisciplinary fields understand and navigate literature outside of their core areas of expertise.” We released JSTOR Snap, which lets you take a picture of any page of text and discover content in JSTOR about the same topics. We built the ReflowIt proof of concept to test out a potential method for handling page-scan content on small, mobile screen. With the JSTOR Global Plants team, we built Livingstone’s Zambezi Expedition, which lets you browse primary and secondary materials both chronologically and geographically. We created an open, public API on top of the Understanding Shakespeare data and used it to create oodles of visualizations of Shakespearean scholarship. And we completely overhauled and expanded this Labs site.
This litany of projects doesn’t even include those that we’re still working on, such as a U.S. Constitution mobile app and a tool that allows teachers to share a poem with their class for them to annotate and discuss, informed by scholarship on JSTOR about the poem. It’s been a productive year.
All of this would not have been possible without our partners, who throughout the year have been open, collaborative and creative. We started the year with just one: the Folger Shakespeare Library, and that partnership remains strong and fruitful. Since then, we’ve worked with the great Eigenfactor team at the University of Washington’s DataLab. We are working on an exciting annotation project with both Hypothesis and the Poetry Foundation. We’ve begun one exploration with the Anxiety of Democracy program of the Social Science Resource Center and another with University of Richmond’s Digital Scholarship Lab. I am grateful for the opportunity to work with such enthusiastic partners and excited about the partnerships to come.
Speaking of gratitude: every day, I count my blessings to be working with the talented, committed, fun and just-plain-awesome individuals within the Labs team. Ron, Jessica, Kate and Beth are each veritable rock stars in their respective fields, and that kinda makes JSTOR Labs a supergroup. (I’ll leave it to you to decide whether we’re The Traveling Wilburys or Cream. Maybe Atoms for Peace?) I’m lucky to be a member of this group, and I can’t wait to see what they create next.
Tue 20 Oct 2015
JSTOR Snap lets you take a picture of any page of text with your smartphone's camera and discover articles in JSTOR about the same topic. This short video shows you how it works.
So Many Ideas! So Many Questions!
Thu 15 Oct 2015
Ron, Jessica, Alex, and my other Labs teammates have many posts about how we nurse a loosely-defined project into becoming one of the prototypes you see here at JSTOR Labs. But we often get asked how an idea goes from, well, an idea to a project.
Taking an idea from seed to sapling, so to speak, starts with us getting a handle on it—asking simply: what do we know about this idea? Is it the result of a colleague’s brainstorm? Is it a strategic direction for JSTOR that needs incubation and testing? Is it a piece of functionality we think would be really useful?
Then we ask: can we do this? What would it take? Would we benefit from a partnership with an organization outside of JSTOR?
Turns out, we have many ideas. To keep track of them all, we created a visual backlog:
Each Post-It note represents an idea: a piece of functionality (purple), a product-line (orange), a product (raspberry—yes, raspberry), a potential collaboration with a partner (teal), and since all categorizations need one, there are ideas categorized as “other” (chartreuse).
We organize—or, maybe, orient is a better word—the potential work along two axes:
The ‘x’ axis defines how much we are working on that idea, bounded by the descriptions “Not working on it” on the left and “Working on it!” on the right.
The ‘y’ axis defines how much we understand about an idea from “We don’t know what it is!” at the bottom to “We understand it!” at the top.
Why do we use these two axes instead of a more typical backlog, which might be organized either by Priority (that is, the items we believe will have the biggest impact are ranked higher) or by Readiness (that is, we won’t start working on a thing until we understand what it is)? As a labs team, we don’t know the potential impacts of our ideas–we have guesses and interests, but our purpose is to learn about this impact by doing. That eliminates Readiness as a ranking approach: we need to sometimes work on ideas that we don’t understand yet on the assumption that it is only by working with them that we can understand them. So, as an idea first starts firming up—we have a concept taking shape, the timing for a partnership is just right—we pick up its representative stickie and move it to a place approximating how much we understand about it and how close we are to beginning work on it.
Next, we start forming the central hypothesis we’ll test. Though, to borrow a phrase from Jeopardy, please put your hypothesis in the form of a question:
What would it look like to use a primary text as an anchor for (or portal to) secondary text found in JSTOR?
What would it look like to use a picture as the starting point for a search?
What would an alternate way of engaging with a primary source collection look like?
How can we provide researchers with better topical finding aids and keys to influential articles within a given field?
Can we create a better reading and display experience for page scan content on mobile devices?
With an idea and a hypothesis in hand, we’ve got a project. If we’re partnering with an organization, we start by meeting with them and brainstorming around the idea. If we’re not, we usually brainstorm in-house and then we start by interviewing scholars and users to validate both our hypothesis and our approach… but that’s a whole other blog post. Interested in learning more about our process? Let us know!
The (Rapid) Evolution of an Idea
Wed 30 Sep 2015
JSTOR Sustainability, a digital library of academic research covering issues related to environmental stress and its challenges for human society, features an innovative method of browsing scholarly articles that have most influenced fields of study. That method was conceived and built in a week-long "flash build" in Seattle with JSTOR Labs and the DataLab team from the University of Washington. This video demonstrates how they rapidly went from an initial napkin-sketch through to the final product.
The Lifestory of JSTOR Labs' Website
Mon 31 Aug 2015
I am Jane Mengtian Zhang, an intern at JSTOR Labs in Ann Arbor for the summer of 2015. As a web developer and UX designer, I have been working on the new JSTOR Labs website and have witnessed the site gradually coming into form. In about three months’ time, the site was designed, developed, and continuously revised to provide a smooth, unique, and responsive user experience. Here, I would like to share the building process of the website, as well as my own internship experience at JSTOR Labs.
The new JSTOR Labs website aims to convey the fact that JSTOR Labs is an innovative and forward-thinking team. Based on the initial ideas and modular designs for the site, my first step was to implement the Pattern Library, which served as a style guide and reference tool throughout the development process. Working with Kate, our visual designer, we envisioned the initial outline and interactive elements of the site. With the help of Jessica, my mentor, as well as the JSTOR UX team, we conducted coffee shop guerrilla user tests (a quick informal way of user testing with on-the-spot recruiting) with paper prototypes, as well as an online 5-second test, which allowed us to see the opportunities and issues with the current design both in terms of usability and branding effects.
coding and design, going hand-in-hand
The feedback from this round of user testing led to the first build of the website. Using Django CMS, I worked on models, views and templates to construct, store, and present website data. The CMS provided a user-friendly front-end content management interface, while also allowing customized design and creation. Combining the usage of HTML5, CSS, and JQuery, as well as third-party plugins such as Foundation, MixItUp, AddThis, and Google Analytics, I crafted the user interface and realize the pre-designed interactions step-by-step, gradually bringing our visions of the site to life.
design alternatives of module interactions
After deploying a working version of the site, we carried out another round of user testing on the University of Michigan campus. By letting users freely explore the site and then asking them to complete a series of tasks, we were able to see if the website was easy to learn and to use within a short amount of time. I also took part in user interviews with librarians and publishers, which granted us valuable feedback on the content and structure of the website.
The final step of developing the site involved lots of fine-tuning work on overall consistency, accessibility, and cross-platform compatibility. We nearly pulled our hair out positioning and balancing the site’s layout and debating over content presentation on every single page, striving to find the best design solutions through our iterations of experimenting and revising. With the new Labs site soon coming out, it is my hope that the values and visions of JSTOR Labs can be communicated to our users through an enjoyable browsing experience.
While spending most of my time doodling, coding, and debugging the Labs site, I was able to explore other JSTOR Labs projects as a member of the team, such as brainstorming designs for the Understanding Shakespeare API, guerrilla testing for the JSTOR Sustainability site with cupcake incentives, interviews with researchers and teachers, and so on. As a student specializing in library and information science, I especially enjoyed the interviews with librarians and publishers, which granted me a deeper understanding of the existing issues and concerns in online research and learning. The lively conversations and debates around these issues, which take place almost every day in JSTOR Labs, opened my mind to new ways of thinking in both UX and digital librarianship. This learning experience, both rich and diverse in nature, is sure to be valuable for years to come.
At the end of my internship, I would like to express my thanks to the JSTOR Labs team, with whom I have spent one of the greatest summers ever. I hope you enjoyed exploring this site, and stay tuned in with JSTOR Labs’ projects in the future, for their magic is real. :)
JSTOR Sustainability Topic Pages
Mon 31 Aug 2015
JSTOR Sustainability Topic Pages provide road signs to help researchers navigate an interdisciplinary terrain. This short video walks you through the features and content of JSTOR Sustainability.
My Summer at JSTOR Labs
Wed 12 Aug 2015
My name is Xinyu Sheng and I’m a master’s student from the School of Information at University of Michigan. This summer, I worked as an intern at JSTOR Labs in Ann Arbor, Michigan. It’s an innovative environment and I gained invaluable experience from a very nice team. My job involved both user research and web application development, and I’d like to share a bit about my experience here.
In the eleven short weeks I spent here, I worked on a number of different projects, applying a variety of user research methods to different products in various development phases, capturing user needs and translating findings into design trade-offs. I assisted with user testing on the Sustainability site that Alex wrote about a few weeks ago. I also helped with a redesign of the Labs Site you’re currently reading. For this redesign, which you should see on the site soon, I helped the team as it brainstormed design ideas and conducted a various user tests to identify usability issues and gather user preferences. We conducted five-second tests, cupcake testing (a quick version of a usability test with a cupcake as the reward), and A/B testing. These tests allowed me to observe the context and user behavior, and to get firsthand user feedback that could be consolidated in our next iterations. They also helped me better understand the priority of each issue so as to make reasonable decisions with limited time and resources.
I spent most of my time creating an open and public API to the data within Labs’ popular Understanding Shakespeare. I conducted a series of five interviews with Digital Humanities scholars to collect detailed information about use cases for the API and inspire me with design ideas for the visualizations I would build on top of that API. Working closely with Ron, Labs’ lead technologist, I started with a data cleaning task to understand the process of "quote matching" between the text of the plays and JSTOR articles. Then I added the rest of plays: the site now has all 38 Shakespeare plays (available on our redesigned website, here). Also, I did some data processing with XML, parsing to rebuild the data structure and generate extra fields for indexing. To further prepare for the API, I learned how to use Django and the Django REST framework.
With these preparations behind us, we began to develop the API. The Labs team plans to release the API soon, but I can give you a sneak preview of how it was built and share a visualization I made using it. The API uses a REST framework combining indexed data with a SOLR search engine. The SOLR server enables customized data queries that allow data retrieval for every possible need. The data it returns has a clear nested structure in JSON format, which makes it easy to manipulate and efficiently helps users with data mining, or building their own applications/visualizations. For example, I built a visualization, a pack of zoomable circles that you can see here, using D3.js to show a hierarchy of all the plays.
To navigate the visualization, think of the largest circle as a representation of the “universe” of Shakespeare’s plays and the smaller circles labeled Tragedy, Comedy, History, and Romance as “galaxies” within that universe.
Within a galaxy, there are plays and within plays, there are circles representing the acts in that play.
Finally, within the acts are circles named for characters, with the size of circle used to indicate the number of times each character’s lines are referenced in an article on JSTOR.
The API will be available soon, and I look forward to seeing what others create using it!
In addition to honing my technical and UX skills, I’ve also learned a great deal during the internship about how organizations function in the real world, such as how different departments cooperate and support each other. More importantly, despite the high level of diversity in Labs team members, we collaborated effectively with nice team chemistry. I really enjoyed working with the whole team who are so professional, interesting, and super supportive. It’s been a great summer at JSTOR Labs. Thank you all!
Xinyu, left, with Kate, Jessica, Beth, and Ron.
I wish JSTOR Labs every success in the future.
Under the Hood of JSTOR Snap
Wed 05 Aug 2015
In this installment of the JSTOR Labs blog we take a long-overdue look under the hood of the JSTOR Snap photo app.
We developed and launched Snap way back in February. If you’ve been following the blog you’ll remember that Snap is a proof of concept mobile app that was developed to test a core hypothesis – that a camera-based content discovery tool would provide value to users and was something they would actually use. Our findings on that question were somewhat mixed back in February. Users were definitely intrigued by the concept but the app hasn’t received a ton of use since. However, user feedback and reactions in demos of the app that we’ve conducted since February continue to suggest there is some real potential here. Where we ultimately go with this is hard to say right now, but the possibilities continue to intrigue both users and us. Additional background on the user testing approach can be found here, including a short “how we did it” video.
In addition to testing the core user hypothesis we also wanted to see what it would take to actually build this thing. While doing this quickly was important (because that’s what we do) we also wanted to see if a solution could be developed that produced quality recommendations with reasonable response times. So it wasn’t just a question of technology plumbing to support user testing. We were really interested in seeing if our topic modeling, key term extraction, and recommender capabilities were up to the task of generating useful recommendations from a phone camera-generated text image.
The technical solution involved three main areas of work – the development of the mobile app itself, the extraction of text from photos, and the generation of recommendations based on concepts inferred from the extracted text I’ll describe the technology approach employed in each of these three areas and share some general findings and impressions of each.
First, the mobile app: this represented Labs first project involving mobile development so we took a fairly simple approach. No native app development for us this time (although we’d briefly considered it). We decided to go with a basic mobile web application, but do so in such a way that it could be transitioned into a hybrid mobile app capable of running natively on the major phone operating systems, if needed. For the web app framework we decided on JQuery Mobile after conducting a quick survey of possible approaches. There were many good candidates to choose from but we ultimately selected JQuery Mobile based on its general popularity (figuring it would be easy to find plugins and get advice) and perceived ease of learning. All-in-all we were satisfied with the JQuery Mobile approach. As we’d hoped, the learning curve was rather modest and the near ubiquity of JQuery made this a good choice for our initial mobile project.
Going into the project my single biggest worry was whether we’d be able to do on-the-fly OCR processing with acceptable quality and response times. I’d initially considered developing a custom back-end OCR service based on the Tesseract or OCRopus OCR engines. After some initial prototyping it quickly became apparent that this approach, while technically feasible, would take more time and effort to get right than we could afford on this short project. Based on that we decided to go with an on-line service. There were a number of options to choose from here but we ended up going with OCR Web Service. We’ve been very happy with this choice. The OCR accuracy is excellent, response times are relatively good, and the price is quite reasonable. The only real work involved here was the development of a SOAP client for our backend, python/Django-based service to use.
Our last challenge involved the generation of recommendations from the OCR text. For this we first needed to identify the key terms and concepts from the extracted text. This is a two-part process, one involving key term identification using a rule-based tagger that identifies terms from a controlled vocabulary that JSTOR has been developing for a couple years now (more on that can be found here). The second part of this process involved topic inference (using LDA topic models generated from the full JSTOR corpus). The key terms and topics associated with the OCR text were then used to find other documents in JSTOR with similar characteristics.
We haven’t performed any formal testing of the generated recommendations yet, but feedback from users has been pretty good, at least in cases where the input is good. This is a situation where the expression “garbage-in, garbage-out” really applies. If a dark or blurry image is used (or even one with sparse text) the recommendations produced are much less targeted than when we have sharp images with text rich in relevant terms and concepts. I’d encourage you to give it a try for yourself. Go tohttp://labs.jstor.org/snap using your smartphone and try this on some text you’re familiar with. We’d love to hear what you think about the app and the recommendations.
Labs' No-Longer-Secret Ingredient
the JSTOR Thesaurus
Wed 22 Jul 2015
Over the past year you may have noticed a quiet but powerful feature appear in many of the JSTOR Labs projects: article-specific terms, or keywords. For example, in both Understanding Shakespeare and Classroom Readings, you’ll find them underneath article titles separated by pipes:
In this post, I’d like to give you a background on this feature and describe how we’ve used it specifically on the Sustainability prototype.
Introducing the JSTOR Thesaurus
JSTOR, as you likely know, has content across virtually all of the academic disciplines from hundreds of different publishers. If we had a way of classifying all of that disparate content in a consistent way, it could help people better find the content that they’re looking for. These article-specific terms that crop up in a variety of ways in Labs projects are one approach we’ve been exploring to achieve this. They are generated by something called the JSTOR Thesaurus.
The JSTOR Thesaurus is a semantic index, or a rules-based hierarchal list of terms. Let’s split that definition into its two parts:
A hierarchical list of terms - The Thesaurus is a hierarchy where the top terms are in that position because they are the most inclusive, and, at every subsequent level, a narrower term is a part/subset/instance/example of its broader term (i.e., a parent/child relationship).
Rules for applying those terms - Terms all by themselves can be ambiguous, so we create rules that define how and when a term is applied to document by our indexing For example: the word herring can refer to either the type of fish or it could be talking about a red herring as used in literature. We use a rule like what’s below to tag only the fishy uses.
IF (MENTIONS “fish*” or MENTIONS “clupea*” OR WITH “salted” OR WITH “smoked” OR WITH “pickled” OR AROUND “spawn*”) USE Herring
To build our initial list of terms, we combined over twenty discipline- and subject-specific taxonomies. These came from as disparate sources as NASA and ERIC, the U.S. Department of Education’s online repository for research literature on education. This combination helped us achieve the broad and encompassing view we needed of the entire JSTOR corpus. We currently have over 57 thousand terms in the Thesaurus, but we never stop adding, editing and pruning terms, or adjusting and improving the rules they operate under.
The Sustainability Prototype
As Alex described last week, with the Sustainability Prototype our goal was to create a site smart enough that the content was greater than the sum of its parts. To achieve this, we needed to make an investment in building out a more comprehensive set of terms for the interdisciplinary sections of the Thesaurus which Sustainability covers. To do that we worked with a group of experts within the fields associated with Sustainability – ranging from Industrial Economics to Environmental History – to identify over 1500 key terms from across the Thesaurus that are specifically linked to the study of Sustainability. Those same experts then helped us extend the set of synonyms and non-preferred terms that appear for each term. The terms appear throughout the prototype, including on individual article pages, in the search results, and as a filter within search, helping users to find specific articles and content. They also appear on topic pages like the one below, which you can browse through to create a mental map of the interdisciplinary terrain that Sustainability covers.
The Thesaurus is always improving and continues to grow, expand and deepen with each new project. The Sustainability Prototype gave us the opportunity to showcase the potential of the Thesaurus, and we’re eager to see if it’s as impactful as we think it will be. If, as you explore the site, you have suggestions or questions about the Thesaurus, we’d love to hear them! Just shoot us an email firstname.lastname@example.org.
A Tour Guide in a Multidisciplinary World
Tue 14 Jul 2015
I’ve been thinking about the impact of a truly great tour guide. Maybe it’s because I just took my family to Italy for vacation, and the difference between last week’s tour through the Vatican and my previous one could not have been greater. Chalk some of that difference up to the heat wave currently choking Europe, and to the fact that this time my wife and I had two (lovely, tired) children to lug around. But much of the difference is that the last time we were there we had a tremendous tour guide leading us through. I think his name was Michael and he was both deeply knowledgeable and also quite charming. In a museum overflowing with masterpieces, he led us to those works most worthy of our attention. He told the stories of works we didn’t know and he deepened our appreciation of known works like the ceiling of the Sistine Chapel. We walked away edified and inspired, all because of his expert guidance.
All of which brings me to what JSTOR Labs has been trying to do in a new prototype we’re building for people working in or interested in the subject of Sustainability. Whether they’re a student just learning a topic that could be their future specialty or an established scholar branching out into new territory, researchers of all kinds have to introduce themselves to new topics and fields. When they do, researchers use tour guides just like I did in Rome. The tour guide might be a thesis advisor or it might be a colleague familiar with the topic. Whoever it is, they perform many of the same functions as Michael at the Vatican: they point out the key works (articles, books); they provide context for and connections between these works; they tell the story of the topic. With JSTOR Sustainability, our goal was to augment these tour guides.
To achieve this, we’ve combined a number of methods:
Content Selection: We used a series of topic modeling exercises to identify over 250,000 articles currently available in JSTOR that are relevant to Sustainability, environmental stress and the related challenges for human society. Since Sustainability is an inherently interdisciplinary study, topic modeling allowed us to discover relevant content across the sciences, social sciences and humanities. This content set formed the base for us to build upon.
Keywords: JSTOR has been developing a corpus-wide semantic index that can algorithmically associate all of our documents with subject keywords – in fact, many of the Labs projects have taken advantage of this new index. With Sustainability, we worked with scholars and subject matter experts to review the keywords for their areas of expertise in order to both broaden and deepen the index and make it even more valuable to researchers.
Topic Pages: For Sustainability-related keywords, we’ve created individual pages to serve as quick overviews and jumping-off-places. Using Linked Open Data, we’ve connected DBpedia overviews with JSTOR-specific information about each topic, such as Key Authors and Related Topics.
Influential Articles: Last, we partnered with Jevin West and the University of Washington DataLab – the folks behind the citation-network-based metric known as Eigenfactor – to understand article networks and to incorporate a timeline on each topic page showing the topic’s most influential articles.
I encourage you to see the results for yourself by checking out the prototype at labs.jstor.org/sustainability. On that site, you can search and browse through all the topic pages you want openly, checking out the keywords and the influential articles timeline. If you want to access individual articles through the site, you or your institution will have to sign up for our beta program. If you’re interested, please shoot an email to email@example.com with Sustainability Prototype in the subject line or reach out on Twitter or Facebook.
Over the coming weeks, we’ll shine the Labs Blog spotlight on each of these methods, much like, oh, a tour guide might introduce you to one painting after another. There are a lot of exciting stories to tell, and I can’t wait to share them with you. In the meantime, enjoy the site, something that will be easier if you can avoid hundred-degree temperatures and two cranky kids…
Content Curation for Livingstone's Zambezi Expedition
Thu 28 May 2015
When we began brainstorming with the JSTOR Labs team about this project, we, the JSTOR Primary Source content team, knew that Global Plants contained a wealth of materials from Livingstone’s Zambezi Expedition that included letters, specimens, and botanical illustrations. Armed with the idea to build a map connecting the specimens with the other materials, we began the hunt for more Zambezi Expedition materials in Global Plants and the JSTOR archive.
Step 1: Content Discovery
We started on the Global Plants live site searching for: letters to and from Livingstone, Kirk, Baines, and Meller; specimens collected by Livingstone and Kirk; paintings completed by Baines. We searched for anything with the keyword “Zambezi” or “Zambesi” to pull in related works. Finding content from the JSTOR Archive started with a similar approach. This allowed us to find relevant content within the journals of the Royal Geographical Society of London, the very organization that funded the expedition. The Journal of the Royal Geographical Society of London includes letters and updates about the expedition before, during and after the event.
Step 2: Export Content
In both instances we worked with a “wide net” theory- capture all the records that could be relevant, export those records, and then trim out the unrelated ones. We compiled a list of content across many collections, from different institutions, contributed over many years. With the help of our awesome Software Development Manager, Chakradhar Sreeramoju, we exported the content from Global Plants and the JSTOR archive.
Step 3: Standardization: Augmenting and Refining Data
We then got to work refining and cleaning the data, so that we had more uniform data that could be used to help drive functionality and display. Some work was relatively automated, for example standardizing dates and personal names, excluding paintings completed after Baines was dismissed in 1860, or removing specimens Kirk collected after the Expedition ended. Others were more difficult, and required a close read to confidently include or exclude. Identifying dates and localities often required examining the objects individually; a sample of lichen and wood that Dr. Kirk donated to the Royal Botanic Gardens, Kew might have its place of collection listed in the description, or a letter might state in the title the location from where it was written:
Step 4: Adding Geographic Context
By the end of this work we had identified 118 unique localities, with only 100 objects missing a locality. We assigned these materials a generic locality near the mouth of the Zambezi, labelled “Zambezi Expedition”. To find coordinates for the 118 localities we utilized a combination of resources, including the Getty Thesaurus of Geographic Names Online, Harvard University’s AfricaMap, the Columbia Gazeteer, and Google Maps. We also used the functionality built by Labs that overlaid the historical maps from the expedition on top of current maps. This was particularly helpful for working with locality place names that were identified on the historic maps but are no longer in use.
The work to identify localities is definitely a challenge to replicating this project for other expeditions, as it did require time to locate and verify an identified locality. But the value it adds to understanding the expedition and its impact is enormous. For example, you can see what kinds of plants grew in the same areas, or compare the Combretum imberbe Wawra specimen that Kirk collected in 1859 to the sample of Combretum imberbe Wawra wood the expedition donated to Royal Botanic Gardens, Kew in the following year. Or, examine the Dasystachys drimiopsis illustrations Baker that Matilda Smith created for Curtis’ Botanical Magazine in 1898 alongside the original specimen, collected by Kirk in 1859, shown in the image below.
These sorts of connections are certainly possible to uncover in Global Plants, but it is difficult; the 800 objects collected or created during the Zambezi Expedition represent just .033% of the 2.4 million objects in Global Plants.
A major benefit of Global Plants is that plants that were collected in the same locality or on the same expedition can be reconnected and discoverable in one place. But the size of Global Plants can make it difficult to uncover these connections if you don’t already know you’re looking for them. We see projects like the Zambezi Expedition as a way to help strengthen these connections and increase the types of discovery Global Plants facilitates to better encompass browsing and exploring. As we knew we would, we learned a lot from the process of doing, both about the Zambezi Expedition and the multitude of uses such a rich dataset provides. And we’re very happy Labs gave us the technical support, enthusiasm, and opportunity to uncover so much about the Zambezi Expedition anew.
Livingstone's Zambezi Expedition
Thu 14 May 2015
Livingstone’s Zambezi Expedition lets you explore material related to Dr. David Livingstone’s mid-nineteenth century exploration of southeastern Africa. This short video walks you through the features and content on this, our latest Labs project. We've also posted a high def 1080p version of the video.
Exploring with Livingstone
Introducing Livingstone's Zambezi Expedition
Wed 13 May 2015
JSTOR Global Plants is big. It contains some 2,171,000 plant specimens and roughly 240,000 primary sources contributed by herbaria from all over the world, and is still growing. That enormity has helped it to become an indispensible resource for plant taxonomists and botanists. But that sheer number can also be overwhelming to non-specialists, making it harder to find the cool tidbits and eye-opening factoids it contains. What’s the old line? “There are eight million stories in the naked city?” Well, there are some two and half million stories in JSTOR Global Plants, and we were interested in ways that would help researchers, teachers and students discover those stories.
We started by gathering together all the content we could find related to a single expedition: David Livingstone’s expedition up the Zambezi River to Lake Nyassa and beyond in east Africa. JSTOR’s completely-awesome Plants team did this work, finding nearly a thousand items, including maps drawn by Livingstone himself, letters between members of the expedition party, reports to the Royal Geographic Society on the expedition’s progress (and gruff dismissals of their requests for more money), and, of course, hundreds of plant specimens Livingstone’s botanist John Kirk brought back to the Kew Herbarium in London. With these treasures collected, we then looked for ways to link all of these items to make it easier to find the connections between them, and, by extension, all those hidden stories.
(Of course, the Labs team did this in yet another flash build.)
Inspired by the New York Public Library Labs’ work with New York City historic maps, we started to tinker with Map Warper, an open source tool to overlay historic maps on top of today’s geo-precise maps. Onto this layered map, we pinned all the content we’d found from Global Plants, JSTOR and beyond. On top of all this, we added an animated timeline to help people discover how the expedition unfolded over time.
We think the combo of being able to explore both geographically and chronologically is exciting and powerful. We hope it will help researchers, teachers and students to discover all the important stories contained inside of this collection, stories like Livingstone’s dawning awareness of the horrors of slavery, the conflict and correspondence with the expedition funders, and personal stories like the tragic death of Livingstone’s wife. We also hope it might inspire librarians, archivists and publishers – anyone who is a custodian of such rich historical and cultural material – to find ways to share and enrich that content. In doing so, they’ll make it possible for researchers to get a bit of the same thrill of discovery as Livingstone. But with fewer mosquitoes.
Building JSTOR Snap
Fri 20 Feb 2015
JSTOR Snap enables you to take a picture of any page of text and get a list of research articles from JSTOR on the same topic. Watch this video to learn more about how our team created JSTOR Snap as part of a "flash build" with participation from University of Michigan students and faculty.
Tue 17 Feb 2015
As described in the previous post, the JSTOR Labs team held another weeklong flash build in December. When we talked to students and faculty who use JSTOR and similar products for research on Monday of that week, they had no desire to use a phone to conduct a search for academic literature. Knowing that their concerns may have been based on how thingsare instead of how they could be, we persisted.
Later in the week, we zeroed in on a search that allowed researchers to take a photo of a page of text with their smartphone camera. The photo is run through OCR and the resulting text has topic modeling applied and then the app presents articles from JSTOR about the same topic. The photo-as-search concept was well received, but users were still unsure that the search results could be made manageable on a smartphone.
One of the biggest challenges we faced was displaying search results on a tiny screen. For each article in the list, users wanted to see display title, author(s), journal, year, as well as a way to save it to a list. They also said that keywords and abstracts are extraordinarily helpful in quickly gauging the value of an article. That's a lot to try to fit in even on a large screen, and phone screens, while growing, aren't large!
We mocked up two fundamental ways to show results. First, we had a more-typical "list-view:" that is, the results arranged vertically in a scrolling, numbered list (the first two phones below show variations on this first approach). Second, we experimented with displaying one result at a time. That gave us a reasonable amount of space in which display all the necessary information about an article – in some cases even more than is usually seen in search results – while also working with a phone-friendly swipe-motion to navigate through the results. It also, unfortunately, meant that a user had to take an action (swiping) to see more than one result. (The second two phones below are two various of this approach.)
4 ways we tested to display search results on a small screen.
We showed all of these ideas to students and faculty and heard that, while they hated the way the list-view looked, that was the way it would have to be: one item per screen simply wouldn’t work for them. Then, an interesting thing happened that highlights the importance of not only listening to what users want, but also searching for the reasons why they want what they want.
We asked why the results needed to be shown together and not one at a time, and that uncovered a hidden step in their workflow. When they search for articles, they scan through the list twice, each time seeking to answer a different question. First, they scan over the whole list to answer the question: "Are my search parameters getting me what I want?" Then, once they’ve assessed that their search is on the right track, they go item-by-item within the results answering for each the question, “Is this specific result worthy of further investigation?”
This led us to ask: What if we showed a results summary to answer that first question before showing any actual results? If it worked, we could use the single-item view and users wouldn't have to swipe through individual list items before finding out these weren’t even the results they wanted to see. We came up with the sketch shown on the right.
It was a great compromise and users were pleased with it. There was sufficient screen real estate to show all the desired article information without requiring extra swiping between articles just to see if they had used the right search.
With the right questions and insights, it's possible to meet user needs better than the best solution they could envision themselves.
Sketch of a search results summary screen.
Final designs for summary screen and an article.
Wed 11 Feb 2015
When we started this most recent JSTOR Labs project, we interviewed a number of our users about their mobile research habits, hurdles and desires. What we heard was a resounding chorus:
“Research on my phone? Never. Not if I can avoid it.”
“The screen’s too small.”
“I use so many tabs – it wouldn’t work on my phone.”
And yet, they also all talked about how much they relied on their phone for other actions – email, texting, paying friends, etc. – which was more in line with all the data points we’ve all been reading showing that “mobile is eating the world.”
Our goal with this prototype was to poke at this paradox. We wanted to create a phone experience that wasn’t a stripped-down, small-screen version of the desktop experience, but instead was one that you couldn’t replicate on a larger screen. We wanted to provide functionality not available on a desktop or laptop, in an experience that took advantage of the way users interact with a phone.
To achieve that goal, we did another flash build. The Labs team spent a week holed up at the Espresso Royale coffeehouse just off the UMich campus. We hung these signs up:
By the end of the week, we’d spoken with almost twenty students and faculty. We showed them a series of paper and digital prototypes, and then designed and built an experience based on their input. The prototype we built – which we’re calling JSTOR Snap – lets a user take a picture of any page of text (say, a page of a textbook, or a syllabus, or a class assignment, or the first page of an in-progress essay) and it will return research articles about the same topic. Users can then swipe through the articles to evaluate them and then save a list for reading later. When we put in their hands a phone with that experience on it, we heard:
“This could be better than doing research on my laptop.”
It’s still just a prototype – more of a concept car than something you’d want to ride across country – but I hope you’ll give Snap a try. Just point the browser of your Android or IOS smart phone at http://labs.jstor.org/snap, and then share your thoughts and reactions on Twitter (#jstorsnap), email (firstname.lastname@example.org) or in the comments below.
NYPL Open E-book Hackathon
Reflowing Page Scans for Small Screens
Fri 23 Jan 2015
On January 14, the JSTOR Labs team took part in New York Public Library Labs’ Open Book Hack Day. Hoo-boy, what a great day. We were inspired by a dozen awesome projects, and we met oodles of smart, creative, like-minded people, all of whom are working to increase access to knowledge. Our project for the day was an experiment in improving the reading experience of page-scan content on a mobile device. Dubbed ReflowIt and best experienced using a smartphone, our project re-flows page scan content for a handful of articles so that it renders more easily on a small screen, and does so without having to rely on imperfect OCR text.
The image below, a page scan of one of the articles in JSTOR’s open Early Journal Content (EJC) collection, demonstrates the problem. This content, like many digitized-print collections, consists of scans of the original page with associated OCR’ed text and metadata. This works fine for discovery and display on a large screen, but when a user tries to view this content on a phone’s small screen, the text is small, and they need to pan and zoom and pinch in order to read it.
One way to solve this problem is to present not the image of the page but the OCR’ed text. The challenge with this approach is that OCR is imperfect and especially so with some of the historical fonts found in the EJC content. Reflowit works by reflowing the images of the words for better reading on a small device:
We accomplished this by working with a couple of open-source tools, the venerable imagemagick and a tool that we’ve only recently discovered calledk2pdfopt. k2pdfopt was developed to convert PDFs for use with Kindle e-readers. It has a ton of configuration options and can be coerced into converting PDFs for other types of device. Once we have a mobile-optimized PDF from k2pdfopt we then use imagemagick to extract and resample the individual page images for use with mobile apps (either web apps or native).
The reflowing of text regions from scanned images works surprisingly well, especially for those documents with simple layouts and modern typefaces. However, when trying to reflow image text in documents with more complex formats the results are spottier, and in a few cases downright awful. It’s likely that the handling of these more difficult cases can be improved with some pre-processing and configuration tuning.
While it's clear that this sort of approach will never be perfect, this quick proof of concept has shown that it is possible to perform automated reformatting of PDFs generated from scanned images with acceptable levels of accuracy in many cases. Based on this initial exploration we believe both the general approach and the specific tools used hold promise as a way to impact a lot of content economically. This should be viewed as but one in a suite of techniques that could be used for making this type of content more mobile-friendly. We will provide a more in-depth discussion of the approach used, our detailed findings, and possible areas for future research and development in an upcoming blog post. We’re interested in hearing the community’s interest in this approach, and suggestions for how it might be used. For example, one idea we had was to use this to improve the accessibility of page scan content by increasing the display-size of the rendered pages. We’d love to hear more ideas! Email us or toss in a comment below.
Wed 17 Dec 2014
This is the third post in our Understanding Shakespeare series. In the first two posts Alex described the partnership and process behind the project. In this post we’ll take peek under the hood to look at the approach used in the content matching.
The core idea that we wanted to explore in this project was whether users would find value in being able to use a primary text as a portal for locating secondary literature, specifically journal content available from JSTOR. The basic premise was validated in the exploratory interviews we conducted with scholars and students. While the exact form of the capability still needed to be fleshed out we knew that we needed a means for connecting play content to journal articles at a pretty granular level.
So we had some preliminary user validation of our core idea, now we wanted to put this into a hi-fidelity prototype for our next round of user testing. For that we needed data, because, to quote a line from King Lear - “Nothing will come of nothing.” We had a lot of dots that needed connecting. In our vision for the tool we not only wanted to match article references with the passage in the play but also connect both of these to physical artifacts that provide visual representation of the article and play for web-based display and navigation. Specifically, we wanted to highlight and link the passages in the Folger Digital Text with the specific regions on the JSTOR scanned page images.
Our initial plan was to find explicit play references in JSTOR articles and create bi-directional links between the play and article using the mined references. This seemed a reasonable (and potentially straight-forward) approach as many of the articles we looked at used a convention where a referenced passage from a play was annotated with the act, scene, and line number in the referencing article. However, initial attempts at using these references proved problematic as multiple plays were often referenced in the same article and the play text referenced could be other than the Folger edition we were using. In addition to these challenges we also had to contend with text recognition errors in the article text that is produced using optical character recognition (OCR). While these were likely tractable problems we concluded that a fuzzy text matching approach would likely provide a more robust solution.
Our data scientist, Yuhang Wang, was tasked with designing and implementing an algorithm for performing the fuzzy text matching. Both block and inline quotes were to be considered in the matching process. Block quotes were identified by using OCR coordinates to identify text passages offset from surrounding text. Inline quotes were identified by text bounded by quotation (“) characters. After normalizing the text in the extracted quotes and play text, candidate matches were found using a fuzzy text matching process based on the Levenshtein distance measure. Levenshtein edit distance is a similarity measure of two texts that counts the minimum number of operations (the removal or insertion of a single character or the substitution of one character for another) required to transform one text into the other. Using this approach we found the substring from the play text with the smallest Levenshtein edit distance for each candidate quote.
The matching of article text to play passages required some significant computation as the Levenshtein edit distance had to be calculated for each extracted article quote and all possible substrings in the play text. For that we used our in-house Hadoop cluster and some carefully crafted MapReduce programs. It’s safe to say that prior to the advent of technologies such as Hadoop and MapReduce permitting highly parallelized text processing this project would not have been practical.
This fuzzy text matching approach worked well overall, identifying nearly 800,000 candidate matches in the 6 plays analyzed. After applying filtering thresholds to reduce the false hits we ended up with just over 26,237 matches for the initial version of the prototype. As might be expected, the matching accuracy was much better for longer quotes but tended to include a good number of false hits on smaller passages (15-20 characters or fewer) when the quote consisted of commonly used words and phrases. A future refinement of the filtering process will likely include a measurement of how common a phrase is in modern usage. This would enable us to keep an 11 character quote like “hurly burly” but inhibit matches for something that occurs more frequently like “is not true”.
Overall, we are pretty happy with the approach used and the results. We have started thinking about how to improve the robustness and accuracy of the approach and also how to generalize if for use with other texts. Stay tuned as this looks to be an area in which more interesting tools may well emerge from the Labs team.
We are also planning to make the dataset generated in this project available for use by other scholars and researchers. The dataset will include all of the candidate matches as a downloadable dataset. More to come on this soon… in the meantime here are a few extracts from the first 6 plays we’ve incorporated into the prototype.
Matched quotes by play:
Fri 14 Nov 2014
We had a partner to work with in the Folger Shakespeare Library. And we had the idea: give scholars and students better links between the plays and the scholarship. The question was: how should we go about exploring and developing this idea? For our answer, we took our inspiration from the Nordstrom Innovation Lab, who test out new ideas using a “flash build” – you can see a video of one of their builds here.
You might have heard of a flash mob, where a group of people converge on a place at the same time, puts on some performance (pillow fight!), and then disbands as quickly as they formed. A flash build is similar, but with, sadly, less dancing and fewer pillows. The team meets in a single location, focuses on a single effort, and, informed by regular access to end users, comes away with a piece of working software.
With this inspiration, we decided to organize ourselves around a flash build, in which the Labs team would hold onsite at the Folger the week of September 29th. To make that work, we had some preparation to get through.
We started with a series of interviews of Shakespeare scholars and students, whom we contacted with the help of the Folger. These interviews were exploratory by nature. They gave us a picture of who we were trying to help and how they currently moved between the play and scholarship about the play.
Informed by these interviews and the possibilities they suggested, we embarked on the work to create the links between the play and the scholarship. We’ll describe that work in more detail in a future post, but in the meantime suffice to say that Ron Snyder, the Labs team’s lead technologist and Yuhang Wang, JSTOR’s Data Scientist, worked some magic to get a deep level of granularity that made this site possible. We then connected that data into a relatively simple front end, on which we could iterate quickly during the flash build.
Our preparatory work complete, the Labs team arrived at the Folger on September 29th ready for the flash build. We started the week with a “design jam” with Folger staff. A design jam is an organized brainstorming technique in which participants seek to generate many possible approaches to a particular design problem; in this case, we were looking for ways we might link the play to the secondary literature. You can see an example of the sort of material created in this picture.
By Wednesday, we had a good idea of what would be both valuable to users and feasible technically. By Thursday, we had a working site and informed by another interview or four, we’d polished that to the point where it did a pretty good job of demonstrating the possibility of this approach. On Friday, we demonstrated the working site to the employees and fellows at the Folger.
In the few weeks following the flash build, we did do some polishing of the site, and we added five more plays to the site to increase the likelihood that people might find it useful. You can see what we came up with at Understanding Shakespeare. We’re interested in hearing what you think of it. We also would love to hear your stories of how you’ve used the site. Toss a comment in below, or send us an email at email@example.com.
Mon 10 Nov 2014
Let's dive into our latest effort – Understanding Shakspeare – in a series of posts, each one dealing with a particular aspect of the project. In this first post, we’ll look at the project’s genesis and the collaboration with the Folger Shakespeare Library. In later posts, we’ll walk through the process we used to build the site and look at how we created the data that underlies the site.
This project began with the partnership. The Folger Shakespeare Library in Washington, DC, is an incredible institution.
It houses one of the world’s largest and most important collections of Shakespeare-related materials, including a large number of the earliest printed editions of Shakespeare's plays.
In addition, they are the publisher of Shakespeare Quarterly, an anchor journal in Shakespeare studies, World Shakespeare Bibliography, a core resource in the field, the Folger Editions, the best-selling critical editions of Shakespeare’s works in North America, and the Folger Digital Texts, openly available electronic versions of those same editions. Last but by no means least, they connect to a network of scholars and students working in the field.
We first began to speak with the Folger about a possible partnership this past summer. In those conversations, we crafted an approach that fostered open, exploratory collaboration, focused on innovation. Perhaps this is easier to describe by saying what we didn’t do: we didn’t discuss what services one team could provide for the other. Nor did we try to specify in great detail from the start what deliverables would be met by each party.
Instead, we discussed what each team could bring to the collaborative table. In Folger’s case, that’s the list two paragraphs back. On JSTOR’s side, we had the full digital archives of Shakespeare Quarterly (or, SQ) along with 2,000 other journals, and we had a recently-formed Labs team. Given that set of assets, we discussed what opportunities were most worth exploring. Together, we settled on the idea of linking – in some way – the primary texts of Shakespeare’s plays with the secondary scholarship of SQ and, ideally, other content in the JSTOR archive.
Those three words “in some way” are important. When we started this project, we did not know how we would link the plays with the scholarship, either technically or as a user experience. We just knew that we wanted to. The plan we constructed helped us answer that “how” in a lean and efficient manner, informed by user input and technical exploration. I’ll describe the details of that plan in the next post.
In the meantime, you can see the outcome of this collaboration at the Understanding Shakespeare site. We hope that this open and exploratory partnership with Folger is the first of many such partnerships for the Labs team.
Helping Teachers Find Reading Assignments
Sun 05 Oct 2014
It started with the data. We at JSTOR had ample evidence – from emails, tweets, and untold conversations – that our content was being used by teachers as part of their classes. But of the 10 million documents contained within JSTOR, which content? We decided to find out. Our Data Scientist looked at the usage profiles of all our articles over the past few years, and discovered a pattern that looked right: at a single institution, relatively flat usage of a document on both sides of a two-week spike of usage. That profile led to a collection of over 9,000 articles that we believed had been used in a course. It was an impressive set of content, ranging from the evolution of the bicycle to The Death of Ivan Ilych. It led us to wondering how teachers find and select this content, and whether there was something JSTOR could do to make it easier for them.
The approach we took to find out sounds a bit like the set up to a reality TV show: we gave ourselves one week and only one week. Inspired by those 9,000 articles, the Labs team decided to spend one intensive week in our Ann Arbor offices finding out how teachers identify this content and how JSTOR could make it easier. We worked with JSTOR’s User Experience Researchers to find a variety of teachers willing to work with us. Each day during one week in late June, two teachers came in and spoke with us. By the end of the week, we’d spoken with ten high school, community college, undergraduate and graduate-level teachers, in a variety of disciplines, including English and Language Arts, History, and Psychology. These conversations were our daily check-in to see if we were on the track towards what would be most helpful for teachers. [Classroom-Readings-LandingSM]
At the start of the week, we were in information-gathering mode, trying to learn as much as possible about what teachers looked for when selecting articles, what their process for discovery was, and whether and how it differed from the processes they might use for their own research (for those that did it). By Tuesday, we had some theories about what might be helpful and tested these by showing teachers hand-drawn “paper prototypes” like the one to the right, which the teachers interacted with by “clicking” with their finger.
By the end of the week, we had migrated from paper prototypes to a fully-functioning and designed website, the site that we’re pleased and eager to share with you today: JSTOR Classroom Readings. We hope you like it, and cannot wait to hear from you.
The Bubble of an Idea
Wed 01 Oct 2014
Here at ITHAKA, we’re a creative, think-y bunch, and there are dozens – hundreds! – of ideas bubbling around. Sit in one of our kitchens and you’ll see idea-bubbles forming with every casual chat. We have a lot of similarly creative and thoughtful partners, and our meetings with them lead to still more ideas floating out of conference rooms and into the hallways. Sometimes the air is so thick with them it can be hard to see.
Sometimes, we select one of these ideas and decide to turn it from a substance-less bubble into an actual thing. New features and programs like JSTOR Daily, Register & Read and, heck, JSTOR itself were all just ideas at one point. The challenge comes when we have to choose which of these bubbles floating around has the most potential to help advance scholarly research, teaching and learning.
So we created a team – Labs – to help us solve this problem. (You can breathe easier now: I’m done with precious bubble metaphors.) Labs seeks out new concepts and opportunities for JSTOR, Ithaka S+R and Portico. We refine and validate new ideas through research and experimentation. Since we have SO many ideas, it's important that we can evaluate ideas quickly – to that end, we're using “Lean Startup” methodology. If you’re curious about the methodology, Eric Ries’ book of the same name is a great place to start, but so is Steve Blank’s latest, the Business Model Generation, Marty Cagan’s blog, or a dozen other sources.
We hope that this small team using this methodology will be a powerful combination. We hope that it will open up new forms of partnership with others seeking to learn with us. We hope that this will give us a better way to choose between all these ideas. On this blog, we’ll share what we learn along the way. I hope you’ll join us for the voyage.