Digital Humanities

class: middle

# Text Analysis Methods Workshop 6:

### Working with Corpora and Datasets

<hr>
Matthew J. Lavin

Clinical Assistant Professor of English and Director of Digital Media Lab

University of Pittsburgh

Fall 2018

---
class: middle

# Corpus or Dataset
<hr>

- ### Often used interchangeably, or corpus describes a dataset of text documents
- ### In some scholarly communities, distinction might be important 
- ### In computational linguistics, corpora are large, made of "real world language" and often annotated

---
class: middle

# Where to Find Data

<hr>

- #### Many datasets are publicly available and can be found with a simple Google search (e,g, "Amazon product reviews dataset").
- #### You can also search around for large dataset sharing/indexing platforms 
  - #### Try datahub.io, figshare.com, kaggle.com, zenodo.org, dataverse.harvard.edu, and now toolbox.google.com/datasetsearch
- Pubished articles will often announce a new corpus or dataset, 
  - #### Try keyword searches on Pitcatt+, Google scholar, and arxiv.org
- #### I've already mentioned largescale collections like HathiTrust, archive.org, Chronicling America that often have APIs or data downloads

---
class: middle

# Helpful Resources to Know about
<hr>
- #### The CMU Pronouncing Dictionary 
- #### WordNet, VerbNet
- #### Wordlist corpora (stopwords, names)
- #### LIWC (costs $)
- #### SensEval

---
class: middle

# Data Structures and Formats
<hr>

- #### Plain text files with metadata
- #### xml files ... American Periodicals Series
- #### json (especially from APIs)
- #### databases

---
class: middle

# Some examples

<hr>

- #### Take a look at https://github.com/dh-fall-2018/working-with-corpora-fall-2018
- #### You can clone the repo and run ExploreCorpora.ipynb yourself

---
class: middle

# Plain Text Files with Metadata
<hr>

- #### We can use the metadata.csv to grab txt files and manipulate them

---
class: middle

# XML Files
<hr>
- #### We can go through all the tags in the document and retrieve only the paragraphs of text (skipping table of contents, chapter headings, etc.)

- #### Or we can separate the file into separate chapters, pages, or even by character dialogue tags if those data are tagged in the xml

- #### What would you want to count besides "ly" words, and how would we tweak the code?

---
class: middle

# Reading documentation

<hr>

- #### Traversing complex data can be challenging, and sometimes even looking at the source files isn't enough.

- #### Luckily, people who share data often describe their data and how to use it, and then publish those descriptions in the form of documentation

- #### Let's take a look at https://wiki.dlib.indiana.edu/display/vwwp/Home