Sprechen Sie Textanalysierer?

Creating a Multilingual Text Analyzer

Alex Humphreys

We released Text Analyzer a little less than a year ago. Since then, we’ve made any number of adjustments to it based on your feedback (improving the recommendations, adding search filters, My List and MyJSTOR integration, etc.), and meanwhile its usage has grown along with our belief that this new way to search holds great promise. One of the reasons people tell us they come to Text Analyzer is that it helps to find related content in unfamiliar disciplines or subject areas. In doing so, it breaks them out of the disciplinary or citation-based siloes that they’d been working in, and we feel this is critical for multidisciplinary work and for scholarship more generally.

Today, I’m pleased to announce a new, experimental feature for Text Analyzer that hopefully will help to break people out of a different kind of silo, one based on language. You can now upload or point Text Analyzer at a document in any of a dozen languages, and it will help you to find related English-language content. This means that if you’re a researcher whose first language isn’t English, you can search with material in your native language. You’ll still have to translate or read the English-language results, but hopefully this makes it easier and more natural to find the right articles and chapters for you to read.

Currently, this experimental feature works for the following languages: English (but you knew that already), Arabic, (simplified) Chinese, Dutch, French, German, Hebrew, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish and Turkish.

Anyone who has used Google Translate will know that algorithmic translation can be a tricky game. Sometimes, it’s good enough to give you the gist of a text, but sometimes it stumbles and just produces gibberish. That’s why we think the approach we’ve taken is an interesting and potentially promising one.

Text Analyzer, you may recall, works using a topic model. Each topic is composed of a cluster of terms – if enough of those terms are found in a text, then it’s more likely that that topic is being discussed. What we’ve done is to create topic models in multiple languages, and link them at the high level-topic. We’ll post the technical details of this in a future blog post, but what this means is that when you upload a document in Russian, Text Analyzer uses its Russian topic model to figure out what the text is about (in Russian). Each topic inferred in a native language is associated with the corresponding topic in the English topic model which are then used by Text Analyzer to do what it’s done since it launched – find articles and chapters in JSTOR that are about the same topics.

The practical upshot of this is, we think, pretty exciting: cross-language content discovery, while avoiding both the expense of manual translation and the robotic inaccuracy of algorithmic translation.

I should signal that this translational feature is still very much experimental! First, at this point it currently only works in one direction – from these languages to English. JSTOR is primarily an English-language database, but it does have many documents in other languages. Ideally, this functionality would help English-speakers find those as well, but that is not yet developed. Second, at this point Text Analyzer only identifies a single language per uploaded document – we hope to be able to handle documents with multiple languages in the future. Last and most importantly, the non-English topic models are still in their early stages of development (in fact, if you’re a digital humanities practitioner interested in helping us developing one of our language-specific topic models, we’d love to hear from you!

We’ll keep working on all these things. In the meantime, I hope you’ll give this new feature a try and let us know what you think! You may not be a polyglot, but now with Text Analyzer, you can research like one.

Photo credit: Andrew Butler on Unsplash