class: middle # Text Analysis Methods Workshop 7: ### Dictionaries and Lexicons
Matthew J. Lavin Clinical Assistant Professor of English and Director of Digital Media Lab University of Pittsburgh Fall 2018 --- class: middle # Reminder: Corpus vs. Lexicon or Dictionary
#### A corpus collects "real world language" and is often annotated. Many corpora are large and make a claim of representativeness. (Like the "Corpus of Contemporary American English")
--- class: middle # Reminder: Corpus vs. Lexicon or Dictionary
#### A lexicon or dictionary is a list of words or phrases that match some criteria, and/or a wordlist with information about those words.
#### For example, SocialSent includes .tsv files, by decade, covering the years 1850-2000.
#### Each .tsv has the format, < word >\t< mean_sentiment >\t< std_sentiment >\n where "mean_sentiment is the averaged inferred sentiment across bootstrap-sampled SentProp runs and std_sentiment is the standard deviation of these samples."
--- class: middle # Helpful Resources to Know about
- #### The CMU Pronouncing Dictionary - #### WordNet, VerbNet - #### Wordlist corpora (stopwords, names) - #### LIWC (costs $) - #### SensEval - #### SocialSent - #### The North Atlantic Population Project (NAPP) - #### And many more! --- class: middle # Common Formats
### As with corpora, you are most likely to see .txt, .csv, .xml, and .json files.
--- class: middle # Importing in Python
### Working with a lexicon or dictionary in Python almost always involves loading the structured data as a _list_, a _dictionary_, or a pandas _dataframe_
--- class: middle # An Activity
### Let's break into groups of 2-3 students and try to load SocialSent adjective data and use it to generate a 'positivity score' for some texts.
--- class: middle # Downloading the Repo
#### I've created a starter repo with the SocialSent adjective data and a set of texts to analyze. Clone it here: https://github.com/dh-fall-2018/social-sent-activity
#### The repository contains approximately 6400 opo-eds from _The New York Times_, all published between January and September 2018. There's a metadata.csv file with columns representing an < id >,< url >,< word_count >,< snippet >,< source >,< byline >,< byline_parsed >,< pub_date >, and < inferred_gender >. With the exception of < byline_parsed > and < inferred_gender >, the fields are all taken from _The New York Times_ Article API. < byline_parsed > attempts to isolate the author's name, and < inferred_gender > assigns a gender label using the R package 'gender', which returns a gender probability given a name input and a date. The inference is made by looking up census data.
--- class: middle # Breaking down the problem
- #### Loading a decade file from SocialSent into Python (as a dictionary) - #### Loading and tokenizing fifty texts (using spacy?) - #### Looping each text and checking each token against our dictionary - #### Designing the "positivity score" - #### Saving an aggregated score for each file
#### This is obviously a "challenge by choice" type of thing, but I want you to try, _really try_, to do each step. The goal here isn't necessarily to complete the task, but push yourself to think through the code will be a valuable learning experience. As you go, think about how this kind of coding gets done, as well as what it means to work with others on these kinds of tasks.
--- class: middle # Sharing Your Process
- #### How did it go? Did you complete the challenge? - #### Were some tasks more difficult than others? What kinds of concessions did you make? - #### What strategies were successful? Did you consult code from previous weeks' workshops? - #### What did effective communication and teamwork feel like for this activity?