TAPI is led by JSTOR Labs, developed alongside Constellate, a text analytics platform.

Courses

TAP Institute 2022

Week 1 (June 20, 22, 24)


All times are Pacific Daylight Time

Monday (6/20) Wednesday (6/22) Friday (6/24)
9-10:30am Python Basics A Python Basics A Python Basics A
12:00pm-1:30pm Python Basics B Python Basics B Python Basics B
2:00pm-3:30pm Twitter Data Twitter Data Twitter Data

Python Basics

Instructor: Nathan Kelber

If you've never programmed before, this course is a great introduction. Taught from a humanist perspective, this course will help you start writing your first code and unlock the potential of text analysis. This course is offered in identical A and B sections.

This course is open to complete beginners who have never programmed or done text analysis before.

Working with Twitter Data

Instructor: Melanie Walsh

This workshop will prepare students to collect, analyze, and visualize Twitter data. Students will learn how to work with the Twitter API and with the Python library twarc, one of the most popular tools for Twitter data. We will also introduce basic text analysis methods that are appropriate for short documents like tweets. Participants who are eligible for the Academic Research Track of the Twitter API will have the opportunity to work with the entire historical archive of tweets (2006-2022).

This course will be most accessible to students who have some prior experience with Python and/or the command line, though it is not strictly required.




Week 2 (June 27, 29, July 1)


All times are Pacific Daylight Time

Monday (6/27) Wednesday (6/29) Friday (7/1)
9-10:30am Text Data Curation Text Data Curation Text Data Curation
10:30am-12pm NLP with spaCy NLP with spaCy NLP with spaCy

A Practical Guide to Text Data Curation

Instructor: Xanda Schofield

No matter how exciting your research question is or how fancy your models are, all text analysis projects depend on having text data that is tidy enough to analyze. This course surveys some practices of text data curation to filter out irrelevant text, refine a corpus vocabulary, and identify text artifacts in real world text collections. We will explore how to approach these tasks using Python libraries such as NLTK and spaCy, as well as explore how some text models, like LDA topic models, can actually serve as a tool for diagnosing recurring corpus issues.

This course is open to those with a basic level of proficiency in Python. Taking the Python Basics course the week before is sufficient.

Introduction to NLP with spaCy

Instructor: William Mattingly

This course will introduce the key concepts of natural language processing (NLP) and an NLP Python library, spaCy. SpaCy allows users to cultivate robust pipelines for text analysis. In Day 1 we will learn about NLP concepts and how to install and use the spaCy library generally. On Day 2, we will learn how to use spaCy to identify linguistic features within a document. On Day 3, we will learn about how to apply those features to solve real-world problems for information extraction.

Students will be expected to have basic Python. Taking the Python Basics course in week one will be sufficient.




Week 3 (July 11, 13, 15)


All times are Pacific Daylight Time

Monday (7/11) Wednesday (7/13) Friday (7/15)
10:30am-12pm Multilingual NER Multilingual NER Multilingual NER
1-2:30pm Bilingual Social Media Bilingual Social Media Bilingual Social Media
3-4:30pm Machine Learning Machine Learning Machine Learning

Introduction to Multilingual Named Entity Recognition

Instructor: William Mattingly
This course will introduce students to named entity recognition with emphasis placed on multilingual documents. In Day 1, we will address some of the common issues one faces in handling multilingual documents, such as inconsistent text encoding and text standardization, and some of the current state-of-the-art transformer-based language models. We will also meet some of the key features of spaCy’s NER pipelines. On Day 2, we will jump into rules-based NER with spaCy. On Day 3, we will explore machine learning (ML) based NER in spaCy. Here, we will learn the essentials of creating good datasets for training NER models.

Students will be expected to have basic Python. While previous knowledge of spaCy is not essential, it will help. Consider taking Introduction to NLP with spaCy the week prior.

Web Scraping and Text Analysis in Bilingual Social Media

Instructors: Sylvia Fernández Quintanilla and Rubria Rocha De Luna

This course is designed for attendees to learn how to web scrape social media posts, as well as download the information in csv format, clean it, and do basic analysis such as word frequency. To achieve this, we will rely on exercises with posts in Spanish, English or Spanglish, taken from Facebook pages belonging to organizations of migrants returned to Mexico. We will use some tools like Facepager, Notepad, Word, and RStudio.

This course is for beginners who would like to learn how to download social media posts for text analysis. Participants do not need to have prior knowledge of programming.

Machine Learning

Instructor: Grant Glass

This course will introduce students to the variety of machine learning (ML) algorithms available for textual analysis. Throughout the three days of the course, we will address how ML can be used to solve text-based problems. Day 1 will focus on the basics of ML and students will use supervised learning to work through a research question. Day 2 will be dedicated to a common ML technique: Topic Modeling. Day 3 will focus on more advanced techniques such as using language models to classify text. Everyday students will be provided a workflow for using these techniques on their own research questions.

Students should have a firm grasp of Python: specifically loops, lists, and arrays. Knowledge of python libraries such as Pandas preferred.




Week 4 (July 18, 20, 22)


All times are Pacific Daylight Time

Monday (7/18) Wednesday (7/20) Friday (7/22)
10:30am-12pm Pandas Pandas Pandas
1-2:30pm Multilingual Newspapers Multilingual Newspapers Multilingual Newspapers

Introduction to Pandas

Instructor: William Mattingly

This course introduces students to working with tabular data in Python through the Pandas library. On Day 1, you will learn how to install and import Pandas; you will also learn about some of its basic features, such as the DataFrame. Day 2 will focus on finding, organizing, and sorting data. Day 3 will focus on advanced searching methods, such as filtering, querying, grouping, and GroupBy. A few additional lessons will be provided on plotting data in Pandas.

Students will be expected to have basic Python. Taking the Python Basics course in week one will be sufficient.

Multilingual Newspaper Data and Visualizations

Instructors: Sylvia Fernández Quintanilla and Rubria Rocha De Luna

This course is designed for attendees to learn close reading text analysis with bilingual (Spanish and English) newspapers hosted in various digital repositories; create bilingual datasets and clean the data; select images from the newspapers and edit them; adapt these datasets for visualizations (mapping, timelines and networking) approaching it through time, space, cultural and historical contexts. We will use tools like Excel, Open Refine, Carto, Timeline JS, and GraphCommons.

This course is for beginners, who would like to learn how to visualize humanities data from archival primary sources. Participants do not need to have prior knowledge of programming.

Created by:

Funded by:

Hosted by: