FILTER BY RELATED PROJECTS
Rembrandt Project Image Matching
Fri 12 Aug 2016
In the Exploring Rembrandt project conducted earlier this year by the JSTOR Labs and Artstor Labs teams we looked at the feasibility of using image analysis to match images in the respective repositories. The JSTOR corpus contains more than 8 million embedded images and the Artstor Digital Library has more than 2 million high quality images. There is unquestionably a significant number of works in common between these image sets, especially in articles from the JSTOR Art and Art History disciplines. Matching these images manually would be impractical so we needed to determine whether this could be done using automation.
A key element that we wanted to incorporate into the Exploring Rembrandt prototype was the linking of images in JSTOR articles to a high-resolution counterpart in Artstor where these existed. This would allow a user to click on an image or link in a JSTOR search result and invoke the Artstor image viewer on the corresponding image in Artstor. For instance, when selecting the Night Watch in the Exploring Rembrandt prototype a list of JSTOR articles associated with the Night Watch is displayed and if the article includes an embedded image of the painting it is linked to the version in Artstor as can be seen here.
To accomplish this we needed to generate bi-directional links for the 5 Rembrandt works selected.
For the image matching we first identified a set of candidate images (query set) in JSTOR by searching for images in articles in which the text ‘Rembrandt’ occurred in either the article title or in an image caption. This yielded approximately 9,000 images. Similarly in Artstor, 420 candidate images associated with Rembrandt were identified and an image set was created for them. JSTOR images were compared against Artstor images. For this we used OpenCV, an open source image-processing library supporting a wide range of uses and programming languages and operating systems.
For the matching of images we designed a pipeline process as illustrated below.
Artstor images were the training set and JSTOR images were the query set.
The first phase ensures that the training and query sets are similar in colorspace and longside. The images are resized to 300px longside, converted to grayscale and keypoints and descriptors are extracted. Images that do not have more than 20 keypoints are discarded, since they end up creating excess false positives. The 300px size and 20 keypoints attributes were determined through experimentation and reflects a sweet spot for accuracy and computational efficiency in our process.
The second phase performs the matching. The process utilizes AKAZE algorithm; efficient for matching images that contain rich features. The generated keypoints between two images are compared and the nearest distance between each of keypoints is measured for similarity or likeness.
The final phase collects the data and exports the result as a csv file.
The tuning parameters for Rembrandt set:
- Max long side 300px
- Keypoints > 20,
- Matches > 30
- Inliers > outliers.
This image matching process identified 431 images from the 9,000 JSTOR candidates with probable matches to one or more of the 420 Artstor images. These 431 images were then incorporated into the Exploring Rembrandt prototype.
While the number of JSTOR and Artstor images used in this proof of concept was relatively small, 9,000 and 420, respectively, the technical feasibility of automatically matching images was validated.
The main lesson learned was finding the tuning parameters for matching between two vastly different types of corpuses. Further work should apply the same tuning parameters, visualize the results and build up from there.
Future work will include further experimentation and tuning of the myriad of parameters used by OpenCV and the scaling of the processing enabling this to be performed on much larger image sets, eventually encompassing the complete 10 plus images in the two repositories.
Fri 05 Aug 2016
In the midst of an election year, I’m reminded of the power of the political speech. It can inspire, spur people to action and even alter the course of history. Many of us are familiar from history class with John F. Kennedy’s appeal to Americans to ask what they can do for their country or Ronald Reagan’s call for the Soviet Union to “Tear down this wall!” But what is the context behind these often mythic speeches? Why were they given in the first place? Did they achieve their intended effect and how do they continue to impact us today?
JSTOR provides a rich corpus of scholarship to help answer these questions. For example, in The Volunteering Decision: What Prompts It? What Sustains It? by Paul C. Light, I discovered that President Kennedy’s soaring call for volunteerism prompted subsequent presidents to promote service and even create new government programs to support it. With this in mind, Labs sought to create a prototype tool that could help students and the general public discover the context behind some of the most important presidential speeches in U.S. history.
An early prototype matching quotes from some presidential speeches can be seen at http://labs.jstor.org/presidential-speeches
Using the JSTOR Labs Matchmaker algorithm, we matched famous lines in important presidential speeches to content in the JSTOR corpus. With the data in hand, we sought to try something different interface wise – we are labs after all! Our idea was to use the algorithmically created data and present it in a way that told more of a story than would be possible with simply a list of results. This presentation included incorporating high quality images of each president, the date and location of the speech and visual emphasis on the quote itself. Additionally, to support the narrative of a story, we limited the set of results to fewer than ten and included a featured article. The philosophy behind this choice was to emphasize rapid discovery and contextual learning of the scholarship rather than pure research. The choice to include a featured article is especially important as it gives the user one focal point to approach the scholarship that can transform into greater interest.
To narrow the matched articles took some experimentation. In one method, we ranked articles based on how their top-weighted key terms compared to the aggregated top-weighted key terms of a speech’s full body of related articles, or in other words, how similar an article was to the most prevalent themes and topics (e.g. political protest) of the greater body of articles. Another method ranked articles for a speech by a similarity score which represents how close the matched text is to the quoted line. We also experimented with logistic regression using a training data-set explicitly labeled with relevancy by hand. The end result of these efforts is a curated set of articles that provides solid context to each presidential speech.
An important next step for this project is user-testing – both for the interface as well as the hypothesis behind the story-driven design. Users provide valuable feedback and can help us understand better how to design stories and rich interfaces using the JSTOR corpus of millions of scholarly articles.
Introducing "Understanding the U.S. Constitution"
Thu 07 Jul 2016
I am thrilled to announce the release of Understanding the U.S. Constitution, a free research app for your phone or tablet that lets you use the text of the Constitution as a pathway into the scholarship about it. We can't wait to see how high school and college-level teachers and students use it to enhance their study of our democracy.
To get an idea why we created the app, let’s create a new measurement. We’ll call it the quotable quotient, or QQ, and we’ll use to to gauge the extent to which a given author or document is quotation-worthy.* Shakespeare, for instance, I think it’s safe to say, would have a pretty high QQ. In fact, you could look at JSTOR Labs’ Understanding Shakespeare project, and its accompanying API, as an attempt to calculate Shakespeare’s QQ. Many of his lines have been quoted over a hundred times in the articles of JSTOR, and the most quoted – no surprises here: Hamlet’s “to be or not to be” speech – was quoted over 750 times.** Pretty quotable! Way to go, Will.
Well, as far as QQ goes, the U.S. Constitution leaves Shakespeare in its dust. To pick only the most extreme example, the Fourteenth Amendment, including the Equal Protection Clause, has been quoted almost 2,500 times in the articles in JSTOR. Article I’s Elastic Clause has nearly 1,000.
So when we decided to expand on the idea of using the primary text as a portal into the scholarship about it, it only made sense for us to turn to the U.S. Constitution. Our goal for this was to create something as useful as Understanding Shakespeare, but designed for a mobile experience. That required a rethinking of the basic user experience and design. It also required powerful filtering and sorting functionality to help zero in on, within the 1,000+ articles quoting your clause, the precise set analysis that you need.
The app is currently available for download from the IOS App Store. And don’t worry, Android users: an Android version of the app is in the works and should be complete later this summer – we’ll let you know when it’s available!
The app, like all JSTOR Labs projects, is a work in progress, released in part to help us test a new idea and to learn how we can better realize it. I hope you try it out and, if you do, we’re eager to hear your thoughts on it by email or Twitter. Who knows? A pithy summary and a retweet or two might help improve your own personal QQ.
* Because we’re JSTOR we’ll focus our specific QQ on how quotable documents are in a scholarly context. One could imagine the same measurement applying to pop culture more broadly, in which case someone would have to publish some giant QQ bracket, the inevitable result of which would be The Godfather going to the mattresses against The Simpsons.
** Roughly. The quotation-matching algorithm that powers both Understanding Shakespeare and Understanding the U.S. Constitution uses fuzzy-text matching, and the precise number depends on how we calibrate the algorithm, including setting a minimum string size and percent confidence. The QQ numbers listed in this post are all based on the default settings we use.
You Got Your Chocolate in My Peanut Butter
Tue 07 Jun 2016
When Artstor and ITHAKA announced earlier this year that they were joining forces, James Shulman sent this tweet:
If you’re a child of the seventies and eighties like me, this calls to mind a series of ads for peanut butter cups. I could describe the basic plot, but it’s much more fun to watch one. Here, enjoy: https://www.youtube.com/embed/O7oD_oX-Gio.
Pretty glorious, right?
The thing is, think a bit about the gap between the a-ha moment shown in the ad – you got your chocolate in my peanut butter! – and the actual product being advertised. That gap – why a “cup?” why that particular shape and size? etc. – is the gap faced by ITHAKA and Artstor right now. We believe strongly that combining these two organizations will lead to great things, and we have any number of ideas to consider, but how do we figure out which ones will be best for the us and the community? Well, you’re on the JSTOR Labs blog, so you’ve probably guessed the answer already: by talking to users. By experimenting. By partnering. By doing.
I’m pleased to introduce Exploring Rembrandt, a proof of concept brought to you by JSTOR Labs and Artstor Labs. With it, students can start at five canonical Rembrandt paintings and discover the scholarship in JSTOR about that painting in way that’s easier and more powerful than just typing in “Rembrandt AND ‘Night Watch’” into a JSTOR or Google search.
When we interviewed art history teachers at the start of this project, they described the challenges in going from the works of art to scholarship about that art: it can be overwhelming, especially for undergrads. It can be difficult to move across disciplines, or to excite non-majors, despite the multidisciplinary aspect of art history. When we designed and developed the site together during a one-week flash-build with the Artstor Labs team, we tried to address these challenges.
Exploring Rembrandt, I’m sure, isn’t perfect – I’d wager that peanut butter cups weren’t designed perfectly the first time around either. It’s the first in a series of experiments exploring how to combine the forces and flavors of Artstor and ITHAKA. We’re eager to hear from you – #artstorjstormashup – whether we were successful, and what experiments you’d like to see next!
My Year as a JSTOR Labs Intern
Thu 02 Jun 2016
My name is Jake Silva and I’m a master’s student from the University of Michigan’s School of Information specializing in Human Computer Interaction and Data Science. For the past nine months, I’ve worked on the JSTOR Labs team in Ann Arbor, Michigan where I’ve observed and participated in the team’s use of cutting edge technologies and methodologies in UX Design and Research, Data Science and Product Development. These include guerilla user-testing, week-long design sprints, fuzzy text-matching, geospatial tagging and optical-character recognition technology. While the use of these technologies and methodologies is exciting and innovative, they serve as tools to support Labs in fulfilling its core mission: exploring ways to make scholarship richer and more accessible for researchers, teachers and students as well as the general public.
This mission became very evident during our development of a mobile app focused on constitutional scholarship. To test our design ideas, we set up inside a coffee shop across from the University of Michigan and recruited students to participate in 10-minute usability tests. We used paper prototypes to observe their interactions with our design ideas and gauged what needed to change. During these tests, one political science student became so excited by the idea of the app that he offered to recruit other students for user testing and asked us to contact him once we released it. It’s this excitement and usefulness for users that Labs ultimately strives to achieve with its projects. This example also illustrates one of the principle methodologies Labs uses to test, validate, and build its ideas: the design sprint.
The design sprint, which we also refer to as a “flash build” or Labs week, is inspired by the “Lean Startup” movement, a model for iterating on ideas rapidly based on user input and data. For the sprints, the team meets in a single location and uses the lean methodology to produce a working prototype in a week. I participated first hand in this process during our Labs week to develop a constitutional scholarship app. On Monday, we brought in folks outside the Labs team and focused on brainstorming ideas for the app. This involved multiple individual sketching rounds, brief design presentations to the room and consensus building by vote to come up with a few design solutions. We then turned these into paper prototypes to test with users the next day. Once a clear design solution emerged from the tests, we began to refine each interaction in the app as well as create the necessary technological infrastructure to build it. At each iteration of our design, we went out into the field to test and validate with users. As a participant in these tests, I learned the incredible value they provide in informing the design of a product. As a team, we each bring our own biases, assumptions, and ideas to a design solution and risk overlooking design flaws as we spend more time with the product. User testing helps mitigate these problems and is an unbiased way to solve conflicting design ideas within a team. As an aspiring user-centered design professional, I was happy to see and learn from our team’s emphasis on placing the user first during our sprints.
In addition to honing my UX, data, design, and development skills, I also learned what it takes for a team to be successful. The Labs team is a small but diverse group, incorporating individuals with engineering, UX design, visual design, and product and project management skills. We met for daily standup meetings and reflected throughout the year on how we could improve as a team, where we succeeded and failed and how we fit into the greater JSTOR team. This is important as tools, methodologies, and priorities rapidly shift in a mission-driven technology company. I’ve learned so much from this fantastic team over the last nine months and greatly appreciate their support and interest in my development. I’m also happy to have worked on making scholarship richer and more accessible. Finally, I feel lucky to have had the opportunity to work in such an innovative space and look forward to returning to JSTOR after this summer for the 2016-17 academic year.
Seeking Teachers of Poetry to Test our New Annotation Tool
Tue 12 Jan 2016
For the past few months, JSTOR Labs has been working on Annotation Space, a tool to help teachers of poetry. Annotation Space lets teachers share a poem with their class to annotate and discuss, informed by scholarship from JSTOR about the poem. It was developed in partnership with two incredible organizations: Hypothesis and the Poetry Foundation. Now, we’re looking for a handful of teachers to test-drive this cool tool in their classrooms during this spring semester. For the test-drive, we’ll set up access to the tool for you and your students -- all we’d ask is that you construct an assignment for your class around this tool, and that it be possible for us to gather feedback afterward from you and your students. If you might be interested, let us know!
But wait, how do you know if you’d be interested if you haven’t seen it? Let me step you through the tool to give you an idea what you might expect:
First, you the teacher choose a poem (see the selection we have to work with, below). We’ll set up a private site for just you and your class, which will look like this:
You and your students can make both personal annotations as well as ones that are seen by the entire class. These annotations appear on separate “tabs.”
To make an annotation, just highlight text in the poem and click the icon.
When your students want to bolster their annotations with scholarship, they hop over to the "JSTOR" tab, where they can browse through articles that quote each line of the poem.
If you’re interested in testing this with your class, shoot us an email at firstname.lastname@example.org. Your input while we’re still polishing and refining this site will be invaluable. Thanks.
Here’s the list of poems you’ll be able to choose from:
The Soldier – Rupert Brooke
Concerning a Nobleman – Caroline Dudley
The Love Song of J. Alfred Prufrock – T. S. Eliot
Snow – Robert Frost
The Witch of Coos – Robert Frost
Night Piece – James Joyce
First Fig – Edna St. Vincent Millay
To Whistler, American – Ezra Pound
In a Station of the Metro – Ezra Pound
Eros Turannos – Edwin Arlington Robinson
Chicago – Carl Sandburg
Sunday Morning – Wallace Stevens
Anecdote of the Jar – Wallace Stevens
The Snow Man – Wallace Stevens
Spring – William Carlos Williams
Chinese New Year – Edith Wyatt
The Fisherman – William Butler Yeats
The Scholars – William Butler Yeats
A Prayer for My Daughter – William Butler Yeats
A Heckuva Year
Tue 22 Dec 2015
With 2015 packing its bags to go and 2016 knocking on the door, I’d like to take a brief moment to step back and ruminate on how far Labs has come in the past twelve months. My apologies in advance to those of you with a low tolerance for navel-gazing…
A year ago at this time, we had three live projects: Classroom Readings, the first version of Understanding Shakespeare, and the first version of JSTOR Sustainability. In the past year we substantially updated both Understanding Shakespeare and Sustainability. We redesigned Shakespeare’s home page and jumped from six plays to the full set of thirty-eight. We added Topic Pages and Influential Articles to Sustainability, helping “scholars in interdisciplinary fields understand and navigate literature outside of their core areas of expertise.” We released JSTOR Snap, which lets you take a picture of any page of text and discover content in JSTOR about the same topics. We built the ReflowIt proof of concept to test out a potential method for handling page-scan content on small, mobile screen. With the JSTOR Global Plants team, we built Livingstone’s Zambezi Expedition, which lets you browse primary and secondary materials both chronologically and geographically. We created an open, public API on top of the Understanding Shakespeare data and used it to create oodles of visualizations of Shakespearean scholarship. And we completely overhauled and expanded this Labs site.
This litany of projects doesn’t even include those that we’re still working on, such as a U.S. Constitution mobile app and a tool that allows teachers to share a poem with their class for them to annotate and discuss, informed by scholarship on JSTOR about the poem. It’s been a productive year.
All of this would not have been possible without our partners, who throughout the year have been open, collaborative and creative. We started the year with just one: the Folger Shakespeare Library, and that partnership remains strong and fruitful. Since then, we’ve worked with the great Eigenfactor team at the University of Washington’s DataLab. We are working on an exciting annotation project with both Hypothesis and the Poetry Foundation. We’ve begun one exploration with the Anxiety of Democracy program of the Social Science Resource Center and another with University of Richmond’s Digital Scholarship Lab. I am grateful for the opportunity to work with such enthusiastic partners and excited about the partnerships to come.
Speaking of gratitude: every day, I count my blessings to be working with the talented, committed, fun and just-plain-awesome individuals within the Labs team. Ron, Jessica, Kate and Beth are each veritable rock stars in their respective fields, and that kinda makes JSTOR Labs a supergroup. (I’ll leave it to you to decide whether we’re The Traveling Wilburys or Cream. Maybe Atoms for Peace?) I’m lucky to be a member of this group, and I can’t wait to see what they create next.
Tue 20 Oct 2015
JSTOR Snap lets you take a picture of any page of text with your smartphone's camera and discover articles in JSTOR about the same topic. This short video shows you how it works.
So Many Ideas! So Many Questions!
Thu 15 Oct 2015
Ron, Jessica, Alex, and my other Labs teammates have many posts about how we nurse a loosely-defined project into becoming one of the prototypes you see here at JSTOR Labs. But we often get asked how an idea goes from, well, an idea to a project.
Taking an idea from seed to sapling, so to speak, starts with us getting a handle on it—asking simply: what do we know about this idea? Is it the result of a colleague’s brainstorm? Is it a strategic direction for JSTOR that needs incubation and testing? Is it a piece of functionality we think would be really useful?
Then we ask: can we do this? What would it take? Would we benefit from a partnership with an organization outside of JSTOR?
Turns out, we have many ideas. To keep track of them all, we created a visual backlog:
Each Post-It note represents an idea: a piece of functionality (purple), a product-line (orange), a product (raspberry—yes, raspberry), a potential collaboration with a partner (teal), and since all categorizations need one, there are ideas categorized as “other” (chartreuse).
We organize—or, maybe, orient is a better word—the potential work along two axes:
The ‘x’ axis defines how much we are working on that idea, bounded by the descriptions “Not working on it” on the left and “Working on it!” on the right.
The ‘y’ axis defines how much we understand about an idea from “We don’t know what it is!” at the bottom to “We understand it!” at the top.
Why do we use these two axes instead of a more typical backlog, which might be organized either by Priority (that is, the items we believe will have the biggest impact are ranked higher) or by Readiness (that is, we won’t start working on a thing until we understand what it is)? As a labs team, we don’t know the potential impacts of our ideas–we have guesses and interests, but our purpose is to learn about this impact by doing. That eliminates Readiness as a ranking approach: we need to sometimes work on ideas that we don’t understand yet on the assumption that it is only by working with them that we can understand them. So, as an idea first starts firming up—we have a concept taking shape, the timing for a partnership is just right—we pick up its representative stickie and move it to a place approximating how much we understand about it and how close we are to beginning work on it.
Next, we start forming the central hypothesis we’ll test. Though, to borrow a phrase from Jeopardy, please put your hypothesis in the form of a question:
What would it look like to use a primary text as an anchor for (or portal to) secondary text found in JSTOR?
What would it look like to use a picture as the starting point for a search?
What would an alternate way of engaging with a primary source collection look like?
How can we provide researchers with better topical finding aids and keys to influential articles within a given field?
Can we create a better reading and display experience for page scan content on mobile devices?
With an idea and a hypothesis in hand, we’ve got a project. If we’re partnering with an organization, we start by meeting with them and brainstorming around the idea. If we’re not, we usually brainstorm in-house and then we start by interviewing scholars and users to validate both our hypothesis and our approach… but that’s a whole other blog post. Interested in learning more about our process? Let us know!
The (Rapid) Evolution of an Idea
Wed 30 Sep 2015
JSTOR Sustainability, a digital library of academic research covering issues related to environmental stress and its challenges for human society, features an innovative method of browsing scholarly articles that have most influenced fields of study. That method was conceived and built in a week-long "flash build" in Seattle with JSTOR Labs and the DataLab team from the University of Washington. This video demonstrates how they rapidly went from an initial napkin-sketch through to the final product.
JSTOR Sustainability Topic Pages
Mon 31 Aug 2015
JSTOR Sustainability Topic Pages provide road signs to help researchers navigate an interdisciplinary terrain. This short video walks you through the features and content of JSTOR Sustainability.
The Lifestory of JSTOR Labs' Website
Mon 31 Aug 2015
I am Jane Mengtian Zhang, an intern at JSTOR Labs in Ann Arbor for the summer of 2015. As a web developer and UX designer, I have been working on the new JSTOR Labs website and have witnessed the site gradually coming into form. In about three months’ time, the site was designed, developed, and continuously revised to provide a smooth, unique, and responsive user experience. Here, I would like to share the building process of the website, as well as my own internship experience at JSTOR Labs.
The new JSTOR Labs website aims to convey the fact that JSTOR Labs is an innovative and forward-thinking team. Based on the initial ideas and modular designs for the site, my first step was to implement the Pattern Library, which served as a style guide and reference tool throughout the development process. Working with Kate, our visual designer, we envisioned the initial outline and interactive elements of the site. With the help of Jessica, my mentor, as well as the JSTOR UX team, we conducted coffee shop guerrilla user tests (a quick informal way of user testing with on-the-spot recruiting) with paper prototypes, as well as an online 5-second test, which allowed us to see the opportunities and issues with the current design both in terms of usability and branding effects.
coding and design, going hand-in-hand
The feedback from this round of user testing led to the first build of the website. Using Django CMS, I worked on models, views and templates to construct, store, and present website data. The CMS provided a user-friendly front-end content management interface, while also allowing customized design and creation. Combining the usage of HTML5, CSS, and JQuery, as well as third-party plugins such as Foundation, MixItUp, AddThis, and Google Analytics, I crafted the user interface and realize the pre-designed interactions step-by-step, gradually bringing our visions of the site to life.
design alternatives of module interactions
After deploying a working version of the site, we carried out another round of user testing on the University of Michigan campus. By letting users freely explore the site and then asking them to complete a series of tasks, we were able to see if the website was easy to learn and to use within a short amount of time. I also took part in user interviews with librarians and publishers, which granted us valuable feedback on the content and structure of the website.
The final step of developing the site involved lots of fine-tuning work on overall consistency, accessibility, and cross-platform compatibility. We nearly pulled our hair out positioning and balancing the site’s layout and debating over content presentation on every single page, striving to find the best design solutions through our iterations of experimenting and revising. With the new Labs site soon coming out, it is my hope that the values and visions of JSTOR Labs can be communicated to our users through an enjoyable browsing experience.
While spending most of my time doodling, coding, and debugging the Labs site, I was able to explore other JSTOR Labs projects as a member of the team, such as brainstorming designs for the Understanding Shakespeare API, guerrilla testing for the JSTOR Sustainability site with cupcake incentives, interviews with researchers and teachers, and so on. As a student specializing in library and information science, I especially enjoyed the interviews with librarians and publishers, which granted me a deeper understanding of the existing issues and concerns in online research and learning. The lively conversations and debates around these issues, which take place almost every day in JSTOR Labs, opened my mind to new ways of thinking in both UX and digital librarianship. This learning experience, both rich and diverse in nature, is sure to be valuable for years to come.
At the end of my internship, I would like to express my thanks to the JSTOR Labs team, with whom I have spent one of the greatest summers ever. I hope you enjoyed exploring this site, and stay tuned in with JSTOR Labs’ projects in the future, for their magic is real. :)
My Summer at JSTOR Labs
Wed 12 Aug 2015
My name is Xinyu Sheng and I’m a master’s student from the School of Information at University of Michigan. This summer, I worked as an intern at JSTOR Labs in Ann Arbor, Michigan. It’s an innovative environment and I gained invaluable experience from a very nice team. My job involved both user research and web application development, and I’d like to share a bit about my experience here.
In the eleven short weeks I spent here, I worked on a number of different projects, applying a variety of user research methods to different products in various development phases, capturing user needs and translating findings into design trade-offs. I assisted with user testing on the Sustainability site that Alex wrote about a few weeks ago. I also helped with a redesign of the Labs Site you’re currently reading. For this redesign, which you should see on the site soon, I helped the team as it brainstormed design ideas and conducted a various user tests to identify usability issues and gather user preferences. We conducted five-second tests, cupcake testing (a quick version of a usability test with a cupcake as the reward), and A/B testing. These tests allowed me to observe the context and user behavior, and to get firsthand user feedback that could be consolidated in our next iterations. They also helped me better understand the priority of each issue so as to make reasonable decisions with limited time and resources.
I spent most of my time creating an open and public API to the data within Labs’ popular Understanding Shakespeare. I conducted a series of five interviews with Digital Humanities scholars to collect detailed information about use cases for the API and inspire me with design ideas for the visualizations I would build on top of that API. Working closely with Ron, Labs’ lead technologist, I started with a data cleaning task to understand the process of "quote matching" between the text of the plays and JSTOR articles. Then I added the rest of plays: the site now has all 38 Shakespeare plays (available on our redesigned website, here). Also, I did some data processing with XML, parsing to rebuild the data structure and generate extra fields for indexing. To further prepare for the API, I learned how to use Django and the Django REST framework.
With these preparations behind us, we began to develop the API. The Labs team plans to release the API soon, but I can give you a sneak preview of how it was built and share a visualization I made using it. The API uses a REST framework combining indexed data with a SOLR search engine. The SOLR server enables customized data queries that allow data retrieval for every possible need. The data it returns has a clear nested structure in JSON format, which makes it easy to manipulate and efficiently helps users with data mining, or building their own applications/visualizations. For example, I built a visualization, a pack of zoomable circles that you can see here, using D3.js to show a hierarchy of all the plays.
To navigate the visualization, think of the largest circle as a representation of the “universe” of Shakespeare’s plays and the smaller circles labeled Tragedy, Comedy, History, and Romance as “galaxies” within that universe.
Within a galaxy, there are plays and within plays, there are circles representing the acts in that play.
Finally, within the acts are circles named for characters, with the size of circle used to indicate the number of times each character’s lines are referenced in an article on JSTOR.
The API will be available soon, and I look forward to seeing what others create using it!
In addition to honing my technical and UX skills, I’ve also learned a great deal during the internship about how organizations function in the real world, such as how different departments cooperate and support each other. More importantly, despite the high level of diversity in Labs team members, we collaborated effectively with nice team chemistry. I really enjoyed working with the whole team who are so professional, interesting, and super supportive. It’s been a great summer at JSTOR Labs. Thank you all!
Xinyu, left, with Kate, Jessica, Beth, and Ron.
I wish JSTOR Labs every success in the future.
Under the Hood of JSTOR Snap
Wed 05 Aug 2015
In this installment of the JSTOR Labs blog we take a long-overdue look under the hood of the JSTOR Snap photo app.
We developed and launched Snap way back in February. If you’ve been following the blog you’ll remember that Snap is a proof of concept mobile app that was developed to test a core hypothesis – that a camera-based content discovery tool would provide value to users and was something they would actually use. Our findings on that question were somewhat mixed back in February. Users were definitely intrigued by the concept but the app hasn’t received a ton of use since. However, user feedback and reactions in demos of the app that we’ve conducted since February continue to suggest there is some real potential here. Where we ultimately go with this is hard to say right now, but the possibilities continue to intrigue both users and us. Additional background on the user testing approach can be found here, including a short “how we did it” video.
In addition to testing the core user hypothesis we also wanted to see what it would take to actually build this thing. While doing this quickly was important (because that’s what we do) we also wanted to see if a solution could be developed that produced quality recommendations with reasonable response times. So it wasn’t just a question of technology plumbing to support user testing. We were really interested in seeing if our topic modeling, key term extraction, and recommender capabilities were up to the task of generating useful recommendations from a phone camera-generated text image.
The technical solution involved three main areas of work – the development of the mobile app itself, the extraction of text from photos, and the generation of recommendations based on concepts inferred from the extracted text I’ll describe the technology approach employed in each of these three areas and share some general findings and impressions of each.
First, the mobile app: this represented Labs first project involving mobile development so we took a fairly simple approach. No native app development for us this time (although we’d briefly considered it). We decided to go with a basic mobile web application, but do so in such a way that it could be transitioned into a hybrid mobile app capable of running natively on the major phone operating systems, if needed. For the web app framework we decided on JQuery Mobile after conducting a quick survey of possible approaches. There were many good candidates to choose from but we ultimately selected JQuery Mobile based on its general popularity (figuring it would be easy to find plugins and get advice) and perceived ease of learning. All-in-all we were satisfied with the JQuery Mobile approach. As we’d hoped, the learning curve was rather modest and the near ubiquity of JQuery made this a good choice for our initial mobile project.
Going into the project my single biggest worry was whether we’d be able to do on-the-fly OCR processing with acceptable quality and response times. I’d initially considered developing a custom back-end OCR service based on the Tesseract or OCRopus OCR engines. After some initial prototyping it quickly became apparent that this approach, while technically feasible, would take more time and effort to get right than we could afford on this short project. Based on that we decided to go with an on-line service. There were a number of options to choose from here but we ended up going with OCR Web Service. We’ve been very happy with this choice. The OCR accuracy is excellent, response times are relatively good, and the price is quite reasonable. The only real work involved here was the development of a SOAP client for our backend, python/Django-based service to use.
Our last challenge involved the generation of recommendations from the OCR text. For this we first needed to identify the key terms and concepts from the extracted text. This is a two-part process, one involving key term identification using a rule-based tagger that identifies terms from a controlled vocabulary that JSTOR has been developing for a couple years now (more on that can be found here). The second part of this process involved topic inference (using LDA topic models generated from the full JSTOR corpus). The key terms and topics associated with the OCR text were then used to find other documents in JSTOR with similar characteristics.
We haven’t performed any formal testing of the generated recommendations yet, but feedback from users has been pretty good, at least in cases where the input is good. This is a situation where the expression “garbage-in, garbage-out” really applies. If a dark or blurry image is used (or even one with sparse text) the recommendations produced are much less targeted than when we have sharp images with text rich in relevant terms and concepts. I’d encourage you to give it a try for yourself. Go tohttp://labs.jstor.org/snap using your smartphone and try this on some text you’re familiar with. We’d love to hear what you think about the app and the recommendations.
Labs' No-Longer-Secret Ingredient
the JSTOR Thesaurus
Wed 22 Jul 2015
Over the past year you may have noticed a quiet but powerful feature appear in many of the JSTOR Labs projects: article-specific terms, or keywords. For example, in both Understanding Shakespeare and Classroom Readings, you’ll find them underneath article titles separated by pipes:
In this post, I’d like to give you a background on this feature and describe how we’ve used it specifically on the Sustainability prototype.
Introducing the JSTOR Thesaurus
JSTOR, as you likely know, has content across virtually all of the academic disciplines from hundreds of different publishers. If we had a way of classifying all of that disparate content in a consistent way, it could help people better find the content that they’re looking for. These article-specific terms that crop up in a variety of ways in Labs projects are one approach we’ve been exploring to achieve this. They are generated by something called the JSTOR Thesaurus.
The JSTOR Thesaurus is a semantic index, or a rules-based hierarchal list of terms. Let’s split that definition into its two parts:
A hierarchical list of terms - The Thesaurus is a hierarchy where the top terms are in that position because they are the most inclusive, and, at every subsequent level, a narrower term is a part/subset/instance/example of its broader term (i.e., a parent/child relationship).
Rules for applying those terms - Terms all by themselves can be ambiguous, so we create rules that define how and when a term is applied to document by our indexing For example: the word herring can refer to either the type of fish or it could be talking about a red herring as used in literature. We use a rule like what’s below to tag only the fishy uses.
IF (MENTIONS “fish*” or MENTIONS “clupea*” OR WITH “salted” OR WITH “smoked” OR WITH “pickled” OR AROUND “spawn*”) USE Herring
To build our initial list of terms, we combined over twenty discipline- and subject-specific taxonomies. These came from as disparate sources as NASA and ERIC, the U.S. Department of Education’s online repository for research literature on education. This combination helped us achieve the broad and encompassing view we needed of the entire JSTOR corpus. We currently have over 57 thousand terms in the Thesaurus, but we never stop adding, editing and pruning terms, or adjusting and improving the rules they operate under.
The Sustainability Prototype
As Alex described last week, with the Sustainability Prototype our goal was to create a site smart enough that the content was greater than the sum of its parts. To achieve this, we needed to make an investment in building out a more comprehensive set of terms for the interdisciplinary sections of the Thesaurus which Sustainability covers. To do that we worked with a group of experts within the fields associated with Sustainability – ranging from Industrial Economics to Environmental History – to identify over 1500 key terms from across the Thesaurus that are specifically linked to the study of Sustainability. Those same experts then helped us extend the set of synonyms and non-preferred terms that appear for each term. The terms appear throughout the prototype, including on individual article pages, in the search results, and as a filter within search, helping users to find specific articles and content. They also appear on topic pages like the one below, which you can browse through to create a mental map of the interdisciplinary terrain that Sustainability covers.
The Thesaurus is always improving and continues to grow, expand and deepen with each new project. The Sustainability Prototype gave us the opportunity to showcase the potential of the Thesaurus, and we’re eager to see if it’s as impactful as we think it will be. If, as you explore the site, you have suggestions or questions about the Thesaurus, we’d love to hear them! Just shoot us an email email@example.com.
A Tour Guide in a Multidisciplinary World
Tue 14 Jul 2015
I’ve been thinking about the impact of a truly great tour guide. Maybe it’s because I just took my family to Italy for vacation, and the difference between last week’s tour through the Vatican and my previous one could not have been greater. Chalk some of that difference up to the heat wave currently choking Europe, and to the fact that this time my wife and I had two (lovely, tired) children to lug around. But much of the difference is that the last time we were there we had a tremendous tour guide leading us through. I think his name was Michael and he was both deeply knowledgeable and also quite charming. In a museum overflowing with masterpieces, he led us to those works most worthy of our attention. He told the stories of works we didn’t know and he deepened our appreciation of known works like the ceiling of the Sistine Chapel. We walked away edified and inspired, all because of his expert guidance.
All of which brings me to what JSTOR Labs has been trying to do in a new prototype we’re building for people working in or interested in the subject of Sustainability. Whether they’re a student just learning a topic that could be their future specialty or an established scholar branching out into new territory, researchers of all kinds have to introduce themselves to new topics and fields. When they do, researchers use tour guides just like I did in Rome. The tour guide might be a thesis advisor or it might be a colleague familiar with the topic. Whoever it is, they perform many of the same functions as Michael at the Vatican: they point out the key works (articles, books); they provide context for and connections between these works; they tell the story of the topic. With JSTOR Sustainability, our goal was to augment these tour guides.
To achieve this, we’ve combined a number of methods:
Content Selection: We used a series of topic modeling exercises to identify over 250,000 articles currently available in JSTOR that are relevant to Sustainability, environmental stress and the related challenges for human society. Since Sustainability is an inherently interdisciplinary study, topic modeling allowed us to discover relevant content across the sciences, social sciences and humanities. This content set formed the base for us to build upon.
Keywords: JSTOR has been developing a corpus-wide semantic index that can algorithmically associate all of our documents with subject keywords – in fact, many of the Labs projects have taken advantage of this new index. With Sustainability, we worked with scholars and subject matter experts to review the keywords for their areas of expertise in order to both broaden and deepen the index and make it even more valuable to researchers.
Topic Pages: For Sustainability-related keywords, we’ve created individual pages to serve as quick overviews and jumping-off-places. Using Linked Open Data, we’ve connected DBpedia overviews with JSTOR-specific information about each topic, such as Key Authors and Related Topics.
Influential Articles: Last, we partnered with Jevin West and the University of Washington DataLab – the folks behind the citation-network-based metric known as Eigenfactor – to understand article networks and to incorporate a timeline on each topic page showing the topic’s most influential articles.
I encourage you to see the results for yourself by checking out the prototype at labs.jstor.org/sustainability. On that site, you can search and browse through all the topic pages you want openly, checking out the keywords and the influential articles timeline. If you want to access individual articles through the site, you or your institution will have to sign up for our beta program. If you’re interested, please shoot an email to firstname.lastname@example.org with Sustainability Prototype in the subject line or reach out on Twitter or Facebook.
Over the coming weeks, we’ll shine the Labs Blog spotlight on each of these methods, much like, oh, a tour guide might introduce you to one painting after another. There are a lot of exciting stories to tell, and I can’t wait to share them with you. In the meantime, enjoy the site, something that will be easier if you can avoid hundred-degree temperatures and two cranky kids…
Content Curation for Livingstone's Zambezi Expedition
Thu 28 May 2015
When we began brainstorming with the JSTOR Labs team about this project, we, the JSTOR Primary Source content team, knew that Global Plants contained a wealth of materials from Livingstone’s Zambezi Expedition that included letters, specimens, and botanical illustrations. Armed with the idea to build a map connecting the specimens with the other materials, we began the hunt for more Zambezi Expedition materials in Global Plants and the JSTOR archive.
Step 1: Content Discovery
We started on the Global Plants live site searching for: letters to and from Livingstone, Kirk, Baines, and Meller; specimens collected by Livingstone and Kirk; paintings completed by Baines. We searched for anything with the keyword “Zambezi” or “Zambesi” to pull in related works. Finding content from the JSTOR Archive started with a similar approach. This allowed us to find relevant content within the journals of the Royal Geographical Society of London, the very organization that funded the expedition. The Journal of the Royal Geographical Society of London includes letters and updates about the expedition before, during and after the event.
Step 2: Export Content
In both instances we worked with a “wide net” theory- capture all the records that could be relevant, export those records, and then trim out the unrelated ones. We compiled a list of content across many collections, from different institutions, contributed over many years. With the help of our awesome Software Development Manager, Chakradhar Sreeramoju, we exported the content from Global Plants and the JSTOR archive.
Step 3: Standardization: Augmenting and Refining Data
We then got to work refining and cleaning the data, so that we had more uniform data that could be used to help drive functionality and display. Some work was relatively automated, for example standardizing dates and personal names, excluding paintings completed after Baines was dismissed in 1860, or removing specimens Kirk collected after the Expedition ended. Others were more difficult, and required a close read to confidently include or exclude. Identifying dates and localities often required examining the objects individually; a sample of lichen and wood that Dr. Kirk donated to the Royal Botanic Gardens, Kew might have its place of collection listed in the description, or a letter might state in the title the location from where it was written:
Step 4: Adding Geographic Context
By the end of this work we had identified 118 unique localities, with only 100 objects missing a locality. We assigned these materials a generic locality near the mouth of the Zambezi, labelled “Zambezi Expedition”. To find coordinates for the 118 localities we utilized a combination of resources, including the Getty Thesaurus of Geographic Names Online, Harvard University’s AfricaMap, the Columbia Gazeteer, and Google Maps. We also used the functionality built by Labs that overlaid the historical maps from the expedition on top of current maps. This was particularly helpful for working with locality place names that were identified on the historic maps but are no longer in use.
The work to identify localities is definitely a challenge to replicating this project for other expeditions, as it did require time to locate and verify an identified locality. But the value it adds to understanding the expedition and its impact is enormous. For example, you can see what kinds of plants grew in the same areas, or compare the Combretum imberbe Wawra specimen that Kirk collected in 1859 to the sample of Combretum imberbe Wawra wood the expedition donated to Royal Botanic Gardens, Kew in the following year. Or, examine the Dasystachys drimiopsis illustrations Baker that Matilda Smith created for Curtis’ Botanical Magazine in 1898 alongside the original specimen, collected by Kirk in 1859, shown in the image below.
These sorts of connections are certainly possible to uncover in Global Plants, but it is difficult; the 800 objects collected or created during the Zambezi Expedition represent just .033% of the 2.4 million objects in Global Plants.
A major benefit of Global Plants is that plants that were collected in the same locality or on the same expedition can be reconnected and discoverable in one place. But the size of Global Plants can make it difficult to uncover these connections if you don’t already know you’re looking for them. We see projects like the Zambezi Expedition as a way to help strengthen these connections and increase the types of discovery Global Plants facilitates to better encompass browsing and exploring. As we knew we would, we learned a lot from the process of doing, both about the Zambezi Expedition and the multitude of uses such a rich dataset provides. And we’re very happy Labs gave us the technical support, enthusiasm, and opportunity to uncover so much about the Zambezi Expedition anew.
Livingstone's Zambezi Expedition
Thu 14 May 2015
Livingstone’s Zambezi Expedition lets you explore material related to Dr. David Livingstone’s mid-nineteenth century exploration of southeastern Africa. This short video walks you through the features and content on this, our latest Labs project. We've also posted a high def 1080p version of the video.
Exploring with Livingstone
Introducing Livingstone's Zambezi Expedition
Wed 13 May 2015
JSTOR Global Plants is big. It contains some 2,171,000 plant specimens and roughly 240,000 primary sources contributed by herbaria from all over the world, and is still growing. That enormity has helped it to become an indispensible resource for plant taxonomists and botanists. But that sheer number can also be overwhelming to non-specialists, making it harder to find the cool tidbits and eye-opening factoids it contains. What’s the old line? “There are eight million stories in the naked city?” Well, there are some two and half million stories in JSTOR Global Plants, and we were interested in ways that would help researchers, teachers and students discover those stories.
We started by gathering together all the content we could find related to a single expedition: David Livingstone’s expedition up the Zambezi River to Lake Nyassa and beyond in east Africa. JSTOR’s completely-awesome Plants team did this work, finding nearly a thousand items, including maps drawn by Livingstone himself, letters between members of the expedition party, reports to the Royal Geographic Society on the expedition’s progress (and gruff dismissals of their requests for more money), and, of course, hundreds of plant specimens Livingstone’s botanist John Kirk brought back to the Kew Herbarium in London. With these treasures collected, we then looked for ways to link all of these items to make it easier to find the connections between them, and, by extension, all those hidden stories.
(Of course, the Labs team did this in yet another flash build.)
Inspired by the New York Public Library Labs’ work with New York City historic maps, we started to tinker with Map Warper, an open source tool to overlay historic maps on top of today’s geo-precise maps. Onto this layered map, we pinned all the content we’d found from Global Plants, JSTOR and beyond. On top of all this, we added an animated timeline to help people discover how the expedition unfolded over time.
We think the combo of being able to explore both geographically and chronologically is exciting and powerful. We hope it will help researchers, teachers and students to discover all the important stories contained inside of this collection, stories like Livingstone’s dawning awareness of the horrors of slavery, the conflict and correspondence with the expedition funders, and personal stories like the tragic death of Livingstone’s wife. We also hope it might inspire librarians, archivists and publishers – anyone who is a custodian of such rich historical and cultural material – to find ways to share and enrich that content. In doing so, they’ll make it possible for researchers to get a bit of the same thrill of discovery as Livingstone. But with fewer mosquitoes.
Building JSTOR Snap
Fri 20 Feb 2015
JSTOR Snap enables you to take a picture of any page of text and get a list of research articles from JSTOR on the same topic. Watch this video to learn more about how our team created JSTOR Snap as part of a "flash build" with participation from University of Michigan students and faculty.
Tue 17 Feb 2015
As described in the previous post, the JSTOR Labs team held another weeklong flash build in December. When we talked to students and faculty who use JSTOR and similar products for research on Monday of that week, they had no desire to use a phone to conduct a search for academic literature. Knowing that their concerns may have been based on how thingsare instead of how they could be, we persisted.
Later in the week, we zeroed in on a search that allowed researchers to take a photo of a page of text with their smartphone camera. The photo is run through OCR and the resulting text has topic modeling applied and then the app presents articles from JSTOR about the same topic. The photo-as-search concept was well received, but users were still unsure that the search results could be made manageable on a smartphone.
One of the biggest challenges we faced was displaying search results on a tiny screen. For each article in the list, users wanted to see display title, author(s), journal, year, as well as a way to save it to a list. They also said that keywords and abstracts are extraordinarily helpful in quickly gauging the value of an article. That's a lot to try to fit in even on a large screen, and phone screens, while growing, aren't large!
We mocked up two fundamental ways to show results. First, we had a more-typical "list-view:" that is, the results arranged vertically in a scrolling, numbered list (the first two phones below show variations on this first approach). Second, we experimented with displaying one result at a time. That gave us a reasonable amount of space in which display all the necessary information about an article – in some cases even more than is usually seen in search results – while also working with a phone-friendly swipe-motion to navigate through the results. It also, unfortunately, meant that a user had to take an action (swiping) to see more than one result. (The second two phones below are two various of this approach.)
4 ways we tested to display search results on a small screen.
We showed all of these ideas to students and faculty and heard that, while they hated the way the list-view looked, that was the way it would have to be: one item per screen simply wouldn’t work for them. Then, an interesting thing happened that highlights the importance of not only listening to what users want, but also searching for the reasons why they want what they want.
We asked why the results needed to be shown together and not one at a time, and that uncovered a hidden step in their workflow. When they search for articles, they scan through the list twice, each time seeking to answer a different question. First, they scan over the whole list to answer the question: "Are my search parameters getting me what I want?" Then, once they’ve assessed that their search is on the right track, they go item-by-item within the results answering for each the question, “Is this specific result worthy of further investigation?”
This led us to ask: What if we showed a results summary to answer that first question before showing any actual results? If it worked, we could use the single-item view and users wouldn't have to swipe through individual list items before finding out these weren’t even the results they wanted to see. We came up with the sketch shown on the right.
It was a great compromise and users were pleased with it. There was sufficient screen real estate to show all the desired article information without requiring extra swiping between articles just to see if they had used the right search.
With the right questions and insights, it's possible to meet user needs better than the best solution they could envision themselves.
Sketch of a search results summary screen.
Final designs for summary screen and an article.
Wed 11 Feb 2015
When we started this most recent JSTOR Labs project, we interviewed a number of our users about their mobile research habits, hurdles and desires. What we heard was a resounding chorus:
“Research on my phone? Never. Not if I can avoid it.”
“The screen’s too small.”
“I use so many tabs – it wouldn’t work on my phone.”
And yet, they also all talked about how much they relied on their phone for other actions – email, texting, paying friends, etc. – which was more in line with all the data points we’ve all been reading showing that “mobile is eating the world.”
Our goal with this prototype was to poke at this paradox. We wanted to create a phone experience that wasn’t a stripped-down, small-screen version of the desktop experience, but instead was one that you couldn’t replicate on a larger screen. We wanted to provide functionality not available on a desktop or laptop, in an experience that took advantage of the way users interact with a phone.
To achieve that goal, we did another flash build. The Labs team spent a week holed up at the Espresso Royale coffeehouse just off the UMich campus. We hung these signs up:
By the end of the week, we’d spoken with almost twenty students and faculty. We showed them a series of paper and digital prototypes, and then designed and built an experience based on their input. The prototype we built – which we’re calling JSTOR Snap – lets a user take a picture of any page of text (say, a page of a textbook, or a syllabus, or a class assignment, or the first page of an in-progress essay) and it will return research articles about the same topic. Users can then swipe through the articles to evaluate them and then save a list for reading later. When we put in their hands a phone with that experience on it, we heard:
“This could be better than doing research on my laptop.”
It’s still just a prototype – more of a concept car than something you’d want to ride across country – but I hope you’ll give Snap a try. Just point the browser of your Android or IOS smart phone at http://labs.jstor.org/snap, and then share your thoughts and reactions on Twitter (#jstorsnap), email (email@example.com) or in the comments below.
NYPL Open E-book Hackathon
Reflowing Page Scans for Small Screens
Fri 23 Jan 2015
On January 14, the JSTOR Labs team took part in New York Public Library Labs’ Open Book Hack Day. Hoo-boy, what a great day. We were inspired by a dozen awesome projects, and we met oodles of smart, creative, like-minded people, all of whom are working to increase access to knowledge. Our project for the day was an experiment in improving the reading experience of page-scan content on a mobile device. Dubbed ReflowIt and best experienced using a smartphone, our project re-flows page scan content for a handful of articles so that it renders more easily on a small screen, and does so without having to rely on imperfect OCR text.
The image below, a page scan of one of the articles in JSTOR’s open Early Journal Content (EJC) collection, demonstrates the problem. This content, like many digitized-print collections, consists of scans of the original page with associated OCR’ed text and metadata. This works fine for discovery and display on a large screen, but when a user tries to view this content on a phone’s small screen, the text is small, and they need to pan and zoom and pinch in order to read it.
One way to solve this problem is to present not the image of the page but the OCR’ed text. The challenge with this approach is that OCR is imperfect and especially so with some of the historical fonts found in the EJC content. Reflowit works by reflowing the images of the words for better reading on a small device:
We accomplished this by working with a couple of open-source tools, the venerable imagemagick and a tool that we’ve only recently discovered calledk2pdfopt. k2pdfopt was developed to convert PDFs for use with Kindle e-readers. It has a ton of configuration options and can be coerced into converting PDFs for other types of device. Once we have a mobile-optimized PDF from k2pdfopt we then use imagemagick to extract and resample the individual page images for use with mobile apps (either web apps or native).
The reflowing of text regions from scanned images works surprisingly well, especially for those documents with simple layouts and modern typefaces. However, when trying to reflow image text in documents with more complex formats the results are spottier, and in a few cases downright awful. It’s likely that the handling of these more difficult cases can be improved with some pre-processing and configuration tuning.
While it's clear that this sort of approach will never be perfect, this quick proof of concept has shown that it is possible to perform automated reformatting of PDFs generated from scanned images with acceptable levels of accuracy in many cases. Based on this initial exploration we believe both the general approach and the specific tools used hold promise as a way to impact a lot of content economically. This should be viewed as but one in a suite of techniques that could be used for making this type of content more mobile-friendly. We will provide a more in-depth discussion of the approach used, our detailed findings, and possible areas for future research and development in an upcoming blog post. We’re interested in hearing the community’s interest in this approach, and suggestions for how it might be used. For example, one idea we had was to use this to improve the accessibility of page scan content by increasing the display-size of the rendered pages. We’d love to hear more ideas! Email us or toss in a comment below.
Wed 17 Dec 2014
This is the third post in our Understanding Shakespeare series. In the first two posts Alex described the partnership and process behind the project. In this post we’ll take peek under the hood to look at the approach used in the content matching.
The core idea that we wanted to explore in this project was whether users would find value in being able to use a primary text as a portal for locating secondary literature, specifically journal content available from JSTOR. The basic premise was validated in the exploratory interviews we conducted with scholars and students. While the exact form of the capability still needed to be fleshed out we knew that we needed a means for connecting play content to journal articles at a pretty granular level.
So we had some preliminary user validation of our core idea, now we wanted to put this into a hi-fidelity prototype for our next round of user testing. For that we needed data, because, to quote a line from King Lear - “Nothing will come of nothing.” We had a lot of dots that needed connecting. In our vision for the tool we not only wanted to match article references with the passage in the play but also connect both of these to physical artifacts that provide visual representation of the article and play for web-based display and navigation. Specifically, we wanted to highlight and link the passages in the Folger Digital Text with the specific regions on the JSTOR scanned page images.
Our initial plan was to find explicit play references in JSTOR articles and create bi-directional links between the play and article using the mined references. This seemed a reasonable (and potentially straight-forward) approach as many of the articles we looked at used a convention where a referenced passage from a play was annotated with the act, scene, and line number in the referencing article. However, initial attempts at using these references proved problematic as multiple plays were often referenced in the same article and the play text referenced could be other than the Folger edition we were using. In addition to these challenges we also had to contend with text recognition errors in the article text that is produced using optical character recognition (OCR). While these were likely tractable problems we concluded that a fuzzy text matching approach would likely provide a more robust solution.
Our data scientist, Yuhang Wang, was tasked with designing and implementing an algorithm for performing the fuzzy text matching. Both block and inline quotes were to be considered in the matching process. Block quotes were identified by using OCR coordinates to identify text passages offset from surrounding text. Inline quotes were identified by text bounded by quotation (“) characters. After normalizing the text in the extracted quotes and play text, candidate matches were found using a fuzzy text matching process based on the Levenshtein distance measure. Levenshtein edit distance is a similarity measure of two texts that counts the minimum number of operations (the removal or insertion of a single character or the substitution of one character for another) required to transform one text into the other. Using this approach we found the substring from the play text with the smallest Levenshtein edit distance for each candidate quote.
The matching of article text to play passages required some significant computation as the Levenshtein edit distance had to be calculated for each extracted article quote and all possible substrings in the play text. For that we used our in-house Hadoop cluster and some carefully crafted MapReduce programs. It’s safe to say that prior to the advent of technologies such as Hadoop and MapReduce permitting highly parallelized text processing this project would not have been practical.
This fuzzy text matching approach worked well overall, identifying nearly 800,000 candidate matches in the 6 plays analyzed. After applying filtering thresholds to reduce the false hits we ended up with just over 26,237 matches for the initial version of the prototype. As might be expected, the matching accuracy was much better for longer quotes but tended to include a good number of false hits on smaller passages (15-20 characters or fewer) when the quote consisted of commonly used words and phrases. A future refinement of the filtering process will likely include a measurement of how common a phrase is in modern usage. This would enable us to keep an 11 character quote like “hurly burly” but inhibit matches for something that occurs more frequently like “is not true”.
Overall, we are pretty happy with the approach used and the results. We have started thinking about how to improve the robustness and accuracy of the approach and also how to generalize if for use with other texts. Stay tuned as this looks to be an area in which more interesting tools may well emerge from the Labs team.
We are also planning to make the dataset generated in this project available for use by other scholars and researchers. The dataset will include all of the candidate matches as a downloadable dataset. More to come on this soon… in the meantime here are a few extracts from the first 6 plays we’ve incorporated into the prototype.
Matched quotes by play:
Fri 14 Nov 2014
We had a partner to work with in the Folger Shakespeare Library. And we had the idea: give scholars and students better links between the plays and the scholarship. The question was: how should we go about exploring and developing this idea? For our answer, we took our inspiration from the Nordstrom Innovation Lab, who test out new ideas using a “flash build” – you can see a video of one of their builds here.
You might have heard of a flash mob, where a group of people converge on a place at the same time, puts on some performance (pillow fight!), and then disbands as quickly as they formed. A flash build is similar, but with, sadly, less dancing and fewer pillows. The team meets in a single location, focuses on a single effort, and, informed by regular access to end users, comes away with a piece of working software.
With this inspiration, we decided to organize ourselves around a flash build, in which the Labs team would hold onsite at the Folger the week of September 29th. To make that work, we had some preparation to get through.
We started with a series of interviews of Shakespeare scholars and students, whom we contacted with the help of the Folger. These interviews were exploratory by nature. They gave us a picture of who we were trying to help and how they currently moved between the play and scholarship about the play.
Informed by these interviews and the possibilities they suggested, we embarked on the work to create the links between the play and the scholarship. We’ll describe that work in more detail in a future post, but in the meantime suffice to say that Ron Snyder, the Labs team’s lead technologist and Yuhang Wang, JSTOR’s Data Scientist, worked some magic to get a deep level of granularity that made this site possible. We then connected that data into a relatively simple front end, on which we could iterate quickly during the flash build.
Our preparatory work complete, the Labs team arrived at the Folger on September 29th ready for the flash build. We started the week with a “design jam” with Folger staff. A design jam is an organized brainstorming technique in which participants seek to generate many possible approaches to a particular design problem; in this case, we were looking for ways we might link the play to the secondary literature. You can see an example of the sort of material created in this picture.
By Wednesday, we had a good idea of what would be both valuable to users and feasible technically. By Thursday, we had a working site and informed by another interview or four, we’d polished that to the point where it did a pretty good job of demonstrating the possibility of this approach. On Friday, we demonstrated the working site to the employees and fellows at the Folger.
In the few weeks following the flash build, we did do some polishing of the site, and we added five more plays to the site to increase the likelihood that people might find it useful. You can see what we came up with at Understanding Shakespeare. We’re interested in hearing what you think of it. We also would love to hear your stories of how you’ve used the site. Toss a comment in below, or send us an email at firstname.lastname@example.org.
Mon 10 Nov 2014
Let's dive into our latest effort – Understanding Shakspeare – in a series of posts, each one dealing with a particular aspect of the project. In this first post, we’ll look at the project’s genesis and the collaboration with the Folger Shakespeare Library. In later posts, we’ll walk through the process we used to build the site and look at how we created the data that underlies the site.
This project began with the partnership. The Folger Shakespeare Library in Washington, DC, is an incredible institution.
It houses one of the world’s largest and most important collections of Shakespeare-related materials, including a large number of the earliest printed editions of Shakespeare's plays.
In addition, they are the publisher of Shakespeare Quarterly, an anchor journal in Shakespeare studies, World Shakespeare Bibliography, a core resource in the field, the Folger Editions, the best-selling critical editions of Shakespeare’s works in North America, and the Folger Digital Texts, openly available electronic versions of those same editions. Last but by no means least, they connect to a network of scholars and students working in the field.
We first began to speak with the Folger about a possible partnership this past summer. In those conversations, we crafted an approach that fostered open, exploratory collaboration, focused on innovation. Perhaps this is easier to describe by saying what we didn’t do: we didn’t discuss what services one team could provide for the other. Nor did we try to specify in great detail from the start what deliverables would be met by each party.
Instead, we discussed what each team could bring to the collaborative table. In Folger’s case, that’s the list two paragraphs back. On JSTOR’s side, we had the full digital archives of Shakespeare Quarterly (or, SQ) along with 2,000 other journals, and we had a recently-formed Labs team. Given that set of assets, we discussed what opportunities were most worth exploring. Together, we settled on the idea of linking – in some way – the primary texts of Shakespeare’s plays with the secondary scholarship of SQ and, ideally, other content in the JSTOR archive.
Those three words “in some way” are important. When we started this project, we did not know how we would link the plays with the scholarship, either technically or as a user experience. We just knew that we wanted to. The plan we constructed helped us answer that “how” in a lean and efficient manner, informed by user input and technical exploration. I’ll describe the details of that plan in the next post.
In the meantime, you can see the outcome of this collaboration at the Understanding Shakespeare site. We hope that this open and exploratory partnership with Folger is the first of many such partnerships for the Labs team.
Helping Teachers Find Reading Assignments
Sun 05 Oct 2014
It started with the data. We at JSTOR had ample evidence – from emails, tweets, and untold conversations – that our content was being used by teachers as part of their classes. But of the 10 million documents contained within JSTOR, which content? We decided to find out. Our Data Scientist looked at the usage profiles of all our articles over the past few years, and discovered a pattern that looked right: at a single institution, relatively flat usage of a document on both sides of a two-week spike of usage. That profile led to a collection of over 9,000 articles that we believed had been used in a course. It was an impressive set of content, ranging from the evolution of the bicycle to The Death of Ivan Ilych. It led us to wondering how teachers find and select this content, and whether there was something JSTOR could do to make it easier for them.
The approach we took to find out sounds a bit like the set up to a reality TV show: we gave ourselves one week and only one week. Inspired by those 9,000 articles, the Labs team decided to spend one intensive week in our Ann Arbor offices finding out how teachers identify this content and how JSTOR could make it easier. We worked with JSTOR’s User Experience Researchers to find a variety of teachers willing to work with us. Each day during one week in late June, two teachers came in and spoke with us. By the end of the week, we’d spoken with ten high school, community college, undergraduate and graduate-level teachers, in a variety of disciplines, including English and Language Arts, History, and Psychology. These conversations were our daily check-in to see if we were on the track towards what would be most helpful for teachers. [Classroom-Readings-LandingSM]
At the start of the week, we were in information-gathering mode, trying to learn as much as possible about what teachers looked for when selecting articles, what their process for discovery was, and whether and how it differed from the processes they might use for their own research (for those that did it). By Tuesday, we had some theories about what might be helpful and tested these by showing teachers hand-drawn “paper prototypes” like the one to the right, which the teachers interacted with by “clicking” with their finger.
By the end of the week, we had migrated from paper prototypes to a fully-functioning and designed website, the site that we’re pleased and eager to share with you today: JSTOR Classroom Readings. We hope you like it, and cannot wait to hear from you.
The Bubble of an Idea
Wed 01 Oct 2014
Here at ITHAKA, we’re a creative, think-y bunch, and there are dozens – hundreds! – of ideas bubbling around. Sit in one of our kitchens and you’ll see idea-bubbles forming with every casual chat. We have a lot of similarly creative and thoughtful partners, and our meetings with them lead to still more ideas floating out of conference rooms and into the hallways. Sometimes the air is so thick with them it can be hard to see.
Sometimes, we select one of these ideas and decide to turn it from a substance-less bubble into an actual thing. New features and programs like JSTOR Daily, Register & Read and, heck, JSTOR itself were all just ideas at one point. The challenge comes when we have to choose which of these bubbles floating around has the most potential to help advance scholarly research, teaching and learning.
So we created a team – Labs – to help us solve this problem. (You can breathe easier now: I’m done with precious bubble metaphors.) Labs seeks out new concepts and opportunities for JSTOR, Ithaka S+R and Portico. We refine and validate new ideas through research and experimentation. Since we have SO many ideas, it's important that we can evaluate ideas quickly – to that end, we're using “Lean Startup” methodology. If you’re curious about the methodology, Eric Ries’ book of the same name is a great place to start, but so is Steve Blank’s latest, the Business Model Generation, Marty Cagan’s blog, or a dozen other sources.
We hope that this small team using this methodology will be a powerful combination. We hope that it will open up new forms of partnership with others seeking to learn with us. We hope that this will give us a better way to choose between all these ideas. On this blog, we’ll share what we learn along the way. I hope you’ll join us for the voyage.