NLP Basics
Applications:
- Spam filter
- Auto-complete when searching
- Auto-Correct
Related Topics:
- Sentiment analysis
- Topic Modeling
- Text classification
- Sentence segmentation or part-of-speech tagging
Naural Language Toolkit
NLTK for short. Suite of open-source tools created to make NLP processes in Python easier to build.
Structured Data and Unstructured Data
80% of business relevant information originates in unstructured form, primarily text.
Unstructured means:
- Binary data
- No delimiters
- No indication of rows
Regular Expressions
Text string for describing a search pattern.
[j-q]
searches all character between j and q.
[j-q]+
searches all characters between j and q, but can return multiple characters.
[0-9]+
searches all numbers between 0 and 9, can return multiple.
[j-q0-9]+
searches all numbers between 0 and 9 and characters bewteen j and q, can return multiple.
Why RegEx?
- Identify whitespace between words/tokens
- Identify delimiters or end-of-line escape charaters
- Removing punctuation or numbers from your text
- Cleaning HTML tags from text
Essentially, we want to use RegEx to tokenize/split the text into list that Python can understand.
Takeaways for re package
- methods for tokenizing
- findall()
- split()
- regexes for tokenizing
- ‘\w’ ‘\W’ words
- ‘\s’ ‘\S’ whitespaces
Machine Learning Pipeline
- Raw text - model can’t distinguish words
- Tokenize - tell the model what to look at
- Clean text - remove stop words/punctuation, stemming, etc.
- Vectorize - convert to numeric form
- Machine learning algorithm - fit/train model
- Spam filter - system to filter emails
Stemming
Process of reducing inflected words (or sometimes derived) to their word stem or root
Crudely chopping off the end of the word to leave only the base
Examples:
- stemming/stemmed –> stem
- electricity/electrical –> electr
- meanness/meaningful –> mean (wrong but this is the fact)
correct in most cases but NOT perfect.
Why?
- reduce the corpus of words
- explicitly correlates words with similar meanings
Stemmers
- Porter Stemmer (in this study)
- Snowball
- Lancaster
- Regex-based
Lemmatizing
Process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word’s lemma.
How is it different from stemming?
- both are to condense derived words into their base forms
- stemming is typically faster
- lemmatizing is typically more accurate
Vectorizing
Process of encoding text as integers to create feature vectors
Feature vector
An n-dimensional vector of numerical features that represent some object.
Why?
- Python only see strings, don’t know what the word represents for
- Raw text needs to be converted to numbers so that Python/algorithms can understand
Different types
- Count Vectorization
- N-grams
- Term frequency - inverse document frequency (TF-IDF)
Feature Engineering
Creating new features or transforming your existing features to get the most out of your data.
Creating New Features
- Length of text field
- percentage of characters that are pucntuation in the text
- percentage of characters that are capitalized
Transformations
- Power transformations (square, square root, etc…)
- standardizing data
Transformation Process
- Determine what range of exponents to test
- Apply each transformation to each value of your chosen feature
- use some criteria to determine which of the transformations yield the best distribution
Machine learning Algorithms
The field of study that gives computers the ability to learn without being explicitly programmed. - Arthur Samuel, 1959
Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world. - NVIDA, 2016
- Supervised learning
- labeled, predict
- Unsupervised learning
- no label, derive structure
Cross-validation and evaluation
- Holdout test set
- K-fold cross-validation
- more robust than just 1 holdout test set
Evaluations
- accuracy = # predicted correctly / # total observations
- precision = # predicted as spam that are actually spam / # predicted as spam
- recall = # predicted as spam that are actually spam / # actually spam
Random forest
A type of ensemble method, which is a technique that creates multiple models and then combines them to produce better results than any of the single models individually.
Random forest combines a collection of decision trees and then aggregate the predictions of each tree to determine the final prediction.
Benefits:
- used for classification and regression
- easily handles outliers, missing values, etc
- accept various data type
- less likely to overfit
- output feature importance
Gradient Boosting
Also an ensemble method, which takes an iterative approach to combing weak learners to create a strong learner by focusing on mistakes of prior iterations.
- Both are ensemble methods and dicision-tree-based.
- Difference:
- GB: boosting (sampled with increased weight on those it got wrong previously); training done iteratively; weighted voting for final prediction; harder to tune, easier to overfit (however, the truth is when GB is tuned properly, it has better performance than RF)
- RF: bagging (sampled randomly); training done in parallel; unweighted voting for final prediction; easier to tune, harder to overfit
Speaking of GB itself:
- Pros: extremly powerful; various input types; classification and regression; output feature importance
- Cons: longer to train; more likely to overfit; more difficult to properly tune
Recap - Machine Learning Pipeline
- Read in raw text
- Clean text and tokenize
- Feature engineering
- Fit simple model
- Tune hyperparameters and evaluate with GridSearchCV
- Select the best model
Two final points
- further evaluation
- slice test set
- examine cases where model gets wrong
- Results of trade-off - consider business context
- focus on the business request
- precision / recall
- spam filter - optimize for recall
- antivirus software - optimize for recall