arrow_backward Projects

Text & Data Mining Service

Text & Data Mining Service

Create a dataset,
start text mining.

Visit the tool


Text mining, or the process of deriving new information from pattern and trend analysis of the written word, has the potential to revolutionize research across disciplines. Sadly, there is a massive hurdle facing those eager to unleash its power.  The coding skills and statistical knowledge that text mining requires can take years to develop.   All too often, researchers learn about the promise of text mining, only to have it revealed that the promise can be realized solely by the select few with the necessary technical skills.  Ted Underwood, Professor of English at the University of Illinois, likens this scenario to researchers being presented with a “deceptively gentle welcome mat, followed by a trapdoor."


ITHAKA is addressing this problem by building a text & data mining (TDM) platform aimed at teaching and enabling a generation of researchers to text mine.  Two of ITHAKA’s services, JSTOR and Portico, are the initial sources of content for the new platform.   Our TDM service provides a platform for teaching, learning about how to analyze text, and analyzing text.

At the core of our service is a multifaceted platform that includes a dataset builder and visualizer and a web-based learning and research environment built around a hosted software development environment (such as JupyterHub). This learning and research environment is populated with text mining analysis code that can be used by instructors as tutorials and by students as templates. The other key element of the platform is a space for open educational resources where we and others in the community can collaborate on the creation, distribution, and reuse of documentation and code (code which can then be used within our learning and research environment).

The core of our service is a multifaceted platform that includes a dataset builder and visualizer and a web-based learning and research environment

The TDM platform provides text mining access to a variety of content, with content initially coming from JSTOR and Portico. It is embedded within our greater TDM service which provides workshops and classes to participating institutions. The service is used by instructors, junior text miners, and expert text miners to teach, learn, and conduct research. Participating libraries work with interested faculty to introduce them to the TDM service and teach workshops to students.

Diagram of the full TDM service

The content available for mining and analysis will include all of JSTOR and the content from those Portico publishers who choose to participate (currently, 39 publishers, including John Wiley & Sons, Inc., Project Muse, Thieme Publishing Group, Hindawi, and Cambridge University Press).  In addition, we are in discussions with third party content providers about participating with content and the service will include the ability for researchers to upload their own content for analysis.  We have also included the commercial use section of the CORD-19, COVID-19 Open Research Dataset.

The JSTOR & Portico TDM service will provide both free tools and tools accessible exclusively for  institutional participants.  As a not-for-profit, our sole aim is to reach self-sustainability.

We are working with a set of ten reference institutions from late 2019 and through 2020 to identify and build all of the necessary features, with an aim to release the service in late 2020.


A series of professional development webinars aimed at teaching text mining to be run through the late Spring and Summer of 2020.