class: middle # Text Analysis Methods Workshop 6: ### Working with Corpora and Datasets
Matthew J. Lavin Clinical Assistant Professor of English and Director of Digital Media Lab University of Pittsburgh Fall 2018 --- class: middle # Corpus or Dataset
- ### Often used interchangeably, or corpus describes a dataset of text documents - ### In some scholarly communities, distinction might be important - ### In computational linguistics, corpora are large, made of "real world language" and often annotated --- class: middle # Where to Find Data
- #### Many datasets are publicly available and can be found with a simple Google search (e,g, "Amazon product reviews dataset"). - #### You can also search around for large dataset sharing/indexing platforms - #### Try datahub.io, figshare.com, kaggle.com, zenodo.org, dataverse.harvard.edu, and now toolbox.google.com/datasetsearch - Pubished articles will often announce a new corpus or dataset, - #### Try keyword searches on Pitcatt+, Google scholar, and arxiv.org - #### I've already mentioned largescale collections like HathiTrust, archive.org, Chronicling America that often have APIs or data downloads --- class: middle # Helpful Resources to Know about
- #### The CMU Pronouncing Dictionary - #### WordNet, VerbNet - #### Wordlist corpora (stopwords, names) - #### LIWC (costs $) - #### SensEval --- class: middle # Data Structures and Formats
- #### Plain text files with metadata - #### xml files ... American Periodicals Series - #### json (especially from APIs) - #### databases --- class: middle # Some examples
- #### Take a look at https://github.com/dh-fall-2018/working-with-corpora-fall-2018 - #### You can clone the repo and run ExploreCorpora.ipynb yourself --- class: middle # Plain Text Files with Metadata
- #### We can use the metadata.csv to grab txt files and manipulate them --- class: middle # XML Files
- #### We can go through all the tags in the document and retrieve only the paragraphs of text (skipping table of contents, chapter headings, etc.) - #### Or we can separate the file into separate chapters, pages, or even by character dialogue tags if those data are tagged in the xml - #### What would you want to count besides "ly" words, and how would we tweak the code? --- class: middle # Reading documentation
- #### Traversing complex data can be challenging, and sometimes even looking at the source files isn't enough. - #### Luckily, people who share data often describe their data and how to use it, and then publish those descriptions in the form of documentation - #### Let's take a look at https://wiki.dlib.indiana.edu/display/vwwp/Home