Digital Humanities

class: middle

# Text Analysis Methods Workshop 11:

### TF-IDF and Clustering

<hr>
Matthew J. Lavin

Clinical Assistant Professor of English and Director of Digital Media Lab

University of Pittsburgh

Fall 2018

---
class: middle

# What is TF-IDF?

<hr>
#### TF-IDF stands for Term Frequency - Inverse Document Frequency. Instead of representing a term in a document by its raw frequency (number of occurrences) or its relative frequency (term count divided by document length), each term is weighted by dividing the term frequency by the (1) frequency of the word in the corpus, or (2) the number of documents in the corpus that contain the word. 
<hr>

---
class: middle

# How Does the Math Work?

<hr>

- #### The formula is: tf-idf(d, t) = tf(t) * idf(d, t) 
- #### _tf(t)_ is the term frequency of a term, either a raw count or a normalized value
- #### _idf(d, t)_  is log [ n / df(d, t) ] + 1, where _log_ equals the natural logarithm of the derived value
- #### _n_ is the total number of documents
- #### _df_ is the number of documents containing the term (t)

<hr>

---
class: middle

# An Example: One Obituary among Many

<div>
<table border="1" class="dataframe">
<thead>
 <tr style="text-align: right;">
 <th title="Term">Term</th>
 <th title="Count">Count</th>
 <th title="DF">DF</th>
 <th title="Smoothed-IDF">Smoothed-IDF</th>
 <th title="TF-IDF">TF-IDF</th>
 
 </tr>
</thead>
<tbody>
<tr>
<td>afternoon</td>
<td>1</td>
<td>66</td>
<td>2.70066923</td>
<td>2.70066923</td>

</tr>
<tr>
<td>against</td>
<td>1</td>
<td>189</td>
<td>1.65833778</td>
<td>1.65833778</td>

</tr>
<tr>
<td>age</td>
<td>1</td>
<td>224</td>
<td>1.48926145</td>
<td>1.48926145</td>

</tr>
<tr>
<td>ago</td>
<td>1</td>
<td>161</td>
<td>1.81776551</td>
<td>1.81776551</td>

</tr>
<tr>
<td>air</td>
<td>1</td>
<td>80</td>
<td>2.51091269</td>
<td>2.51091269</td>

</tr>
<tr>
<td>all</td>
<td>1</td>
<td>310</td>
<td>1.16556894</td>
<td>1.16556894</td>

</tr>
<tr>
<td>american</td>
<td>1</td>
<td>277</td>
<td>1.27774073</td>
<td>1.27774073</td>

</tr>
<tr>
<td>an</td>
<td>1</td>
<td>352</td>
<td>1.03889379</td>
<td>1.03889379</td>

</tr>
<tr>
<td>and</td>
<td>13</td>
<td>364</td>
<td>1.00546449</td>
<td>13.07103843</td>

</tr>
<tr>
<td>around</td>
<td>2</td>
<td>149</td>
<td>1.89472655</td>
<td>3.78945311</td>
</tr>

<tr>
<td>ascension</td>
<td>1</td>
<td>6</td>
<td>4.95945170</td>
<td>4.95945170</td>
</tr>
</tbody>
</table>
</div>

Download a complete excel file on [Github](https://github.com/dh-fall-2018/tfidf-explore/blob/master/bly_tfidf_all.xlsx?raw=true)

---
class: middle

# In Python

<hr>
<div class="highlight"><pre># all_docs can be any list of strings,
# each item representing a document 
a = 'The once ruddy face was puffy and pale'
b = 'The gray hair was straight and thin'
c = 'His dark brown eyes looked fixed, and he seemed \
to be daydreaming'
d = 'his figure was trim and erect'

all_docs = [a,b,c,d]
</pre></div>
<hr>

---
class: middle

# In Python

<hr>

<div class="highlight"><pre>from sklearn.feature_extraction.text import TfidfVectorizer
# TfidfVectorizer is a class, so I instantiate it 
# with specific pararmeters as 'vectorizer'
# I then run the object's fit_transform() 
# method on my list of strings (all_docs)
# The stored variable X is output of the 
# fit_transform() method 
vectorizer = TfidfVectorizer(max_df=.65, min_df=1, 
 stop_words=None, use_idf=True, norm=None)
X = vectorizer.fit_transform(all_docs)
</pre></div>

<hr>

---
class: middle

# In Python

<hr>
<div class="highlight"><pre># The fit_transform() method converts the list of 
# strings to a sparse matrix of TF-IDF values
# The toarray method converts a numpy array, which 
# makes it easier to indpect every values including the zeros 
myarray = X.toarray()
# prints the first row of results
print(a[0])
</pre></div>
<hr>

---
class: middle

# In Python

<hr>
<div class="highlight"><pre># You can merge the results for each doc
list(zip(vectorizer.get_feature_names(), myarray[0]))
</pre></div>
<hr>

---
class: middle

# Exploratory Applications

<hr>
#### TF-IDF is often used to simply produce re-ranked term lists for a large group of documents 
<hr>
See https://github.com/dh-fall-2018/tfidf-explore/blob/master/ExploreFiles.ipynb

---
class: middle

# Using It As a Preprocessing Step

<hr>
- #### Many machine learning algorithms begin by vectorizing text and running TF-IDF
- #### For example, I could next run k-means clustering and split the obits into three groups to see what lands where

<hr>

---
class: middle

# Using it To Make Feature Lists

<hr>
- #### To derive a feature set, I might run TF-IDF on a group of documents about a subject
- #### For example, obits in one pile, non-obits in another
- #### Then I can take the top N TF-IDF terms (say 50) and see which novels use those words the most

<hr>

---
class: middle

# Clustering

<hr>
- #### Clustering can be supervised or unsupervised
- #### Uses either flat or non-flat geometry to measure vectors' relationships with one another
- #### See https://scikit-learn.org/stable/modules/clustering.html for explanations of many algorithms

<hr>

---
class: middle

# K-means Clustering

<hr>
- #### User supplies a number of clusters
- #### The model tries various groupings, calculates a "centroid" of the data (a multidimensional center) ands then measures the mean distance from the centroid
- #### More coherent groups have lower "inertia"

<hr>

See https://github.com/dh-fall-2018/tfidf-explore/blob/master/ExploreFiles.ipynb