class: middle # Text Analysis Methods Workshop 14 ### Machine Learning Classification: Logistic and Linear Regression
Matthew J. Lavin Clinical Assistant Professor of English and Director of Digital Media Lab University of Pittsburgh Fall 2018 --- class: middle # Regression
- #### Statistical model for inferring the relationships among variables - #### Builds a model or estimater by constructing a "line of regression" from observed samples - #### Supervised learning, requires training data - #### One or more dependent variables - #### One or more independent variables
--- class: middle # Logistic regression
- #### Models a binary, dependent variable given one or more independent variables - #### For example, given someone's income, IQ, number of pets, etc., I want to predict "did they go to college?" (y/n) - #### Calculates a logit for each variable in all training data and then applies coefficient to the observed variable in the test case - #### Converts logit back to a probability using inverse natural logarithm
--- class: middle # Logistic regression
- #### Good for DH because coefficient lists (variable and affect on the model) are easy to read - #### You don't need to know which features are important ahead of time - #### Need ample training data - #### Term frequencies are the most common features (vsm) - #### Could theoretically work on any data expressed as vectors
--- class: middle # Linear regression
- #### Models a scalar, dependent variable given one or more independent variables - #### For example, given a person's heart rate, weight, etc. what is their most likely age - #### Need training data but too much might be bad - #### Calculates a line of best fit - #### Can return feature coefficients like logistic regression - #### Can also return an intercept variable, the expected mean of X when Y is zero - #### Independent variables can be term frequencies here as well
--- class: middle # In Python
- #### Both are done the same way using scikit learn - #### To build on the k-means clustering example, we need labels (stored in a Python list in the same order as our texts) - #### We also want to divide our data into training and test sets - #### See https://github.com/dh-fall-2018/jupyter-notebooks-text-analysis/blob/master/Regression.ipynb