NLP Basics with Python

NLP Basics

Applications:

Spam filter
Auto-complete when searching
Auto-Correct

Naural Language Toolkit

NLTK for short. Suite of open-source tools created to make NLP processes in Python easier to build.

Structured Data and Unstructured Data

80% of business relevant information originates in unstructured form, primarily text.

Unstructured means:

Binary data
No delimiters
No indication of rows

Regular Expressions

Text string for describing a search pattern.

[j-q] searches all character between j and q.

[j-q]+ searches all characters between j and q, but can return multiple characters.

[0-9]+ searches all numbers between 0 and 9, can return multiple.

[j-q0-9]+ searches all numbers between 0 and 9 and characters bewteen j and q, can return multiple.

Why RegEx?

Identify whitespace between words/tokens
Identify delimiters or end-of-line escape charaters
Removing punctuation or numbers from your text
Cleaning HTML tags from text

Essentially, we want to use RegEx to tokenize/split the text into list that Python can understand.

Takeaways for re package

methods for tokenizing
- findall()
- split()
regexes for tokenizing
- ‘\w’ ‘\W’ words
- ‘\s’ ‘\S’ whitespaces

Machine Learning Pipeline

Raw text - model can’t distinguish words
Tokenize - tell the model what to look at
Clean text - remove stop words/punctuation, stemming, etc.
Vectorize - convert to numeric form
Machine learning algorithm - fit/train model
Spam filter - system to filter emails

Stemming

Process of reducing inflected words (or sometimes derived) to their word stem or root

Crudely chopping off the end of the word to leave only the base

Examples:

stemming/stemmed –> stem
electricity/electrical –> electr
meanness/meaningful –> mean (wrong but this is the fact)

correct in most cases but NOT perfect.

Why?

reduce the corpus of words
explicitly correlates words with similar meanings

Stemmers

Porter Stemmer (in this study)
Snowball
Lancaster
Regex-based

Lemmatizing

Process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word’s lemma.

How is it different from stemming?

both are to condense derived words into their base forms
stemming is typically faster
lemmatizing is typically more accurate

Vectorizing

Process of encoding text as integers to create feature vectors

Feature vector

An n-dimensional vector of numerical features that represent some object.

Why?

Python only see strings, don’t know what the word represents for
Raw text needs to be converted to numbers so that Python/algorithms can understand

Different types

Count Vectorization
N-grams
Term frequency - inverse document frequency (TF-IDF)

Feature Engineering

Creating new features or transforming your existing features to get the most out of your data.

Creating New Features

Length of text field
percentage of characters that are pucntuation in the text
percentage of characters that are capitalized

Transformations

Power transformations (square, square root, etc…)
standardizing data

Transformation Process

Determine what range of exponents to test
Apply each transformation to each value of your chosen feature
use some criteria to determine which of the transformations yield the best distribution

Machine learning Algorithms

The field of study that gives computers the ability to learn without being explicitly programmed. - Arthur Samuel, 1959

Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world. - NVIDA, 2016

Supervised learning
- labeled, predict
Unsupervised learning
- no label, derive structure

Cross-validation and evaluation

Holdout test set
K-fold cross-validation
- more robust than just 1 holdout test set

Evaluations

accuracy = # predicted correctly / # total observations
precision = # predicted as spam that are actually spam / # predicted as spam
recall = # predicted as spam that are actually spam / # actually spam

Random forest

A type of ensemble method, which is a technique that creates multiple models and then combines them to produce better results than any of the single models individually.

Random forest combines a collection of decision trees and then aggregate the predictions of each tree to determine the final prediction.

Benefits:

used for classification and regression
easily handles outliers, missing values, etc
accept various data type
less likely to overfit
output feature importance

Gradient Boosting

Also an ensemble method, which takes an iterative approach to combing weak learners to create a strong learner by focusing on mistakes of prior iterations.

Both are ensemble methods and dicision-tree-based.
Difference:
- GB: boosting (sampled with increased weight on those it got wrong previously); training done iteratively; weighted voting for final prediction; harder to tune, easier to overfit (however, the truth is when GB is tuned properly, it has better performance than RF)
- RF: bagging (sampled randomly); training done in parallel; unweighted voting for final prediction; easier to tune, harder to overfit

Speaking of GB itself:

Pros: extremly powerful; various input types; classification and regression; output feature importance
Cons: longer to train; more likely to overfit; more difficult to properly tune

Recap - Machine Learning Pipeline

Read in raw text
Clean text and tokenize
Feature engineering
Fit simple model
Tune hyperparameters and evaluate with GridSearchCV
Select the best model

Two final points

further evaluation
- slice test set
- examine cases where model gets wrong
Results of trade-off - consider business context
- focus on the business request
- precision / recall
  - spam filter - optimize for recall
  - antivirus software - optimize for recall