Digital Humanities

class: middle

# Text Analysis Methods Workshop 9: Collocations and N-Grams

###

<hr>
Matthew J. Lavin

Clinical Assistant Professor of English and Director of Digital Media Lab

University of Pittsburgh

Fall 2018

---
class:middle

# N-Grams
<hr>
1. ### Defined as "a contiguous sequence of n items from a given sample of text or speech"
2. ### Written out as Bigrams (pairs of words), trigrams (sets of three), 4-grams and numerical after

---
class:middle

# N-Grams use a sliding window
<hr>

- #### This means that a word will appear in more than one N-gram, but that's ok because each set of words is a distinct N-gram.

- #### Example: "I ran away" has two bigrams: "I ran" and "ran away"

---
class:middle

# An Example
<hr>

<div class="highlight"><pre>mytext = """I traveled to Afghanistan in 2015 to film 
a documentary about the United States drone war. 
When my production partner and I inquired about 
kidnapping insurance, we were told that it 
would cost more than $20,000 to cover the director 
of photography and me. We couldn’t afford 
such a high premium and declined the offer."""

import spacy
nlp = spacy.load('en')
doc = nlp(mytext)
</pre></div>

---
class:middle

# Trigrams by Hand
<hr>
<div class="highlight"><pre>#omit punctuation and extra spaces, 
#and make everything lowercase
words = [i.text.lower() for i in doc if i.pos_ !='PUNCT'
words = [i for i in words if i.pos_ != 'SPACE']

#we create trigrams, an empty list
trigrams = []
for i in range(len(words)):
 # i here will be a number between 0 and 56
 # tri is a list of three words: 
 # the current word and the two words after it
 tri = words[i:i+3] 
 # at the end of the list, words[i:i+3] will look 
 # two words ahead and find 1. a two-word pair since 
 # the second to last word has only one word after it
 # and 2. the very last word alone 
 # since nothing comes after it 
 # we want to ignore those two cases, so ...
 if len(tri)== 3:
 trigrams.append(tri)
 
#trigrams is now a list of lists
trigrams
</pre></div>

---
class:middle

# N-grams in nltk
<hr>

<div class="highlight"><pre>#to do this we need nltk installed
from nltk.util import ngrams
trigrams=ngrams(words,3)
# this will return a list of tuples 
# instead of a list of lists
list(trigrams)
</pre></div>

#### To install nltk, check out this link: https://anaconda.org/anaconda/nltk

---
class:middle

# Collocations
<hr>

#### Implements scoring based on Church and Hanks, 1990 (http://www.aclweb.org/anthology/J90-1003)
<hr> 
#### I(x,y) = log2( P(x,y)/P(x)P(y) )
<hr>
#### "Word probabilities P(x) and P(y) are estimated by counting the number of observations of x and y in a corpus, f(x) and f(y), and normalizing by N, the size of the corpus. Joint probabilities, P(x,y), are estimated by counting the number of times that x is followed by y in a window of w words, fw (x,y), and normalizing by N." (23)
---
class:middle

# Example Usage
<hr>

<div class="highlight"><pre>import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)
finder.nbest(bigram_measures.pmi, 10)
</pre></div>

#### Note: This works much better with more text!

---
class:middle

# Window Size
<hr>

#### "The window size parameter allows us to look at different scales. Smaller window sizes will identify fixed expressions (idioms such as bread and butter) and other relations that hold over short ranges; larger window sizes will highlight semantic concepts and other relationships that hold over larger scales." (23)

---
class:middle

# Example Usage with window_size
<hr>
<div class="highlight"><pre>import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words, window_size=6)
finder.nbest(bigram_measures.pmi, 10)
</pre></div>

---
class:middle

# Discussion of Keyness Measures
<hr>
### http://www.thegrammarlab.com/?p=193