In this blog:
- make predictions, such as predicting the right drug using decision trees…
- recommendation systems
- use python libraries to build ML models
- use the built-in lab environment
Skills: regression, classification, Clustering, Scikit Learn, Scipy …
Projects: cancer detection, predicting economic trends, customer churn, recommendation systems …
Table of Contents
- Intro to ML
- Intro to Regression
- Intro to Classification
- Intro to Clustering
- Intro to Recommender Systems
Intro to ML
Project: is this benign or malignant cell?
To make predictions, we need to collect the data and train our models to understand the pattern.
Machine learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.” - Arthur Samuel
So how machine learning works?
Examples
- Recommendation systems: Amazon, Netflix
- Loan application
- Telecommunication companys predict whether customers unsubscribe next month
Major tech in ML:
- Regression/estimation: predict continuous values
- Classification: predict the item class/category of a case
- Clustering: finding the structure of data; summarization
- Associations: associating frequent co-occurring items/events
- Anomaly detection: discovering abnormal and unusual cases
- Sequence mining: predicting next events; click-stream (Markov Model, HMM)
- Dimension Reduction: Reducing the size of data (PCA)
- Recommendation Systems: recommending items
Difference between AI, ML and DL
Python for ML
Libraries for ML
Scikit-learn functions
This is just to clarify that a ML algo could be achieved by several lines of code. Don’t worry if you don’t understand them.
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.33)
from sklearn import svm
clf = svm.SVC(gamma = 0.001, C=100.)
clf.fit(X_train, y_train)
clf.predict(X_test)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, yhat, labels = [1,0]))
import pickle
s = pickle.dumps(clf)
Supervised vs unsupervised
Supervised: teach the machine with labeled data.
Speaking of the data, each row (aka observation) has several columns (aka attributes).
Two types of supervised learning: classification (to predict discrete values) and regression (to predict continuous values).
Unsupervised: no labels anymore. Models work on its own to discover information.
Techniques in Unsupervised learning:
- Dimension reduction (feature selection)
- density estimation
- market basket analysis
- clustering
Clustering is grouping of data points or objects that are somehow similar by: discovering structure, summarization, anomaly detection.
Summary
Intro to Regression
What’s regression?
Rergession is the process of how to predict a continuous value using other variables.
X: Independent variable, Y: Dependent variable.
Types of regression
- Simple regression (one independent variable) 1.1. simple LR 1.2. simple non-LR
- Multiple regression 2.1. Multiple LR 2.2. Multiple non-LR
Applications of regression
- Sales forecasting
- Satisfaction analysis
- Price estimation
- Employment income
Algorithms
- Ordinal regression
- Poisson regression
- Fast forest quantile regression
- Linear, Polynomial, Lasso, Stepwise, Ridge regression
- Bayesian linear regression
- Neural network regression
- Decision forest regression
- Boosted decision tree regression
- KNN (K-nearest neighbors)
Simple Linear Regression
Then fit a line throught the data:
To do predictions, just find the corresponding Y-axis value given the X-axis value.
Model Expression
So how to find the best fit?
Firstly, the best fit line should minimize the MSE (mean square error):
So to find the best estimate of theta_0
and theta_1
, we have two options: 1. math method; 2.optimization method.
- Math
- Optimization (skip)
Pros
- Very fast
- No parameter tuning
- Easy to understand and highly interpretable
Model Evaluation
- Train and test on the same dataset
- Train/test split
How to measure the accuracy of our model?
Compare the actual (dependent) values with predicted ones.
We could use the one-norm error to measure this difference:
What’s training & out-of-sample accuracy?
K-fold Cross Validation
Evaluation Metrics
- MAE (mean absolute error)
- MSE (mean squared error)
- RMSE (root mean squared error)
Their definitions are
Relative absolute error:
Relative squared error:
R-squared error:
Multiple LR
Two applications:
- Independent variables effectiveness on prediction
- Does revision time, test anxiety, lecture attendance and gender have any effect on the exam performance of students?
- Predict impacts of changes
- How much does blood pressure go up or down for every unit increase (or decrease) in the BMI of a patient?
Mathematically,
Or,
where
and
Optimal parameter: minize the error of our model, such as MSE or MAE.
After finding good estimation of parameters, prediction is just a piece of cake.
How many independent variables should we use?
- Adding too many independent variables without theoratical justification will make our model overfitting (not general enough for prediction)
Should the independent variable be continuous?
- Categorical variables could be converted into a numeric variable to cooperate with linear regression
How to check the linear regression relationship?
- Scatter plot is one of the easiest way.
Intro of Non-Linear Regression
We could try scatter plot to check the regression is linear or not.
The regression above could be called polynomial regression in general.
Polynomial: curvy data.
However, a polynomial regression can be transformed into linear regression model.
Then we could apply the least square to the transformed ‘linear’ regression.
Intro to Classification
- Supervised-learning
- Target attribute is a category/discrete
Eg, the bank can clssify a user to the default group (value = 1) or NOT-default group (value = 0) based on the user’s information.
The target/dependent variable could also have multiple (>2) values.
Classification algorithms in ML:
- Decision trees (ID3, C4.5, C5.0)
- Naive Bayes
- Linear Discriminant Analysis
- K-Nearest Neighbor (KNN)
- Logistic Regression
- Neural Networks
- Support Vector Machines (SVM)
KNN
Given an example,
What’s KNN?
- A method for classifying cases based on their similarity to other cases
- Cases that are near each other are said to be
neighbors
- Based on similar cases with same calss labels are near each other
Algo
Then we might have two questions: 1. how to select parameter k
; 2. how to measure the distance between two cases.
For Q2, apparently we could use the Euclid distance as following but there are also many other method we could use.
For Q1, let’s take the income and age as two independent variables as an example:
Evaluation matrics in classification
How to measure the accuracy?
Jaccard index
If J(y, y_hat) = 1
then it means a perfect prediction.
The higher the index is, the more accurate our model is.
F1-score
Confusion matrix: (which shows the model’s ability to correctly classify the object)
Further,
F1-score = 2 * (prc * rec) / (prc + rec)
The higher the F1-score is, the more accurate our model is.
Comment: the Jaccard index and F1-score can be used for multiple discrete variable as well.
Log loss
Performance of a classifier where the predicted output is a probability value between 0 and 1.
Using the following formula to calculate the log loss:
where y_hat
is the probability value.
The lower the logloss is, the more accurate our model is.
Decision Trees (DT)
What’s a decision tree?
The basic intuition is to map out all possible decision paths in the form of a tree.
For example, say we need to decide which drug the patient p15
need to take.
Suppose we have the following decision tree:
Sex
here is called a internal node, and each internal node corresponds to a testM
here is called a branch, and each branch corresponds to a result of the testB
here is called a leaf node, and each leaf node assigns a classification
So how to build a DT?
Let’s talk more about this process. How to choose the attribute we used to split the tree? That is, which attribute is the best?
Let’s say we used cholesterol
as the attribute, then we have
Apparently it’s not a good choice. If we used the sex
attribute, we have
Compared with cholesterol
, using sex
attribute give us more predictiveness, less impurity and lower entropy. But the results for men
is still not very good. Then we take a further step:
The node with only one class is called pure nodde.
As the internal nodes increase, the impurity decreases. And the impurity is measured by entropy
.
So what is entropy?
- Measure of randomness or uncertainty
- The lower the entropy, the less uniform the distrubution, the purer the node.
Mathematically,
Entropy = -p(A)log(p(A)) - p(B)log(p(B))
where p
is the proportion of drug A/B.
For example, we could calculate the entropy before spliting.
9 B, 5 A, 14 in Total E = -(9/14)log(9/14) - (5/14)log(5/14) E = 0.94
If we use the cholesterol
to split, then we have entropy as
But if we use the sex
, then
Now back to our question, which one is a bette choice?
Ans: we need to choose the tree with the higher information gain after splitting.
So what is information gain?
- The information that can increase the level of certainty after splitting
- IG(information gain) = entropy before split - weighted entropy after split
- The weighted entropy ⬆️, the IG ⬇️
Therefore, we should go with the sex
attribute. As we can imagine, the next step is to repeat the same process as above until we find the purest leaf node as below.
Intro to Logistic Regression
- what is LR (logistic regression, same as below, not linear regression)?
- what problem we could solve with LR?
- which situation do we use with LR?
What is LR?
- A classification algo for categorial variables
- Applications:
When should we use?
- if your data is binary: 1/0, T/F, Y/N
- if you need probabilistic results (because for LR, we computed the probability of one object belonging to a specific class and then map it to the class type)
- when you need a linear decision boundary (line, plan, or hyperplane)
- if you need to understand the impact of a feature (attribute/independent variable)
Logistic Regression vs Linear Regression
Can we use the linear regression to predict a categorial data?
Our job is to predict a point is red or blue given its X (age) value. Because the output is only 0/1, so we may end up with a stepwise function like this
Note, in this case, no matter how large the is, the output is always 1, as long as it’s greater than 0.5.
Therefore it would be nice if we could find some method to map the value of y_hat into the range [0,1]
.
Fortunately, sigmoid
function could help us achieve this.
So what is sigmoid function?
- also called logistic function
- As ⬆️, the value of the exponential of the - becomes to 0, then the value of the sigmoid function becomes to 1; on the other hand, if is very small, then the sigmoid function value is 0
So now how to explain the output of our model if we use the sigmoid function?
Having understood the output, how to train our model to find the θ is the only question left.
- Initialize θ;
- Calculate y_hat = σ(θ^T X) for a customer;
- Compare the output of y_hat with actual output of customer, y, and record it as error;
- Calculate the error for all customers;
- Change the θ to reduce the cost;
- Go back to step 2.
Now let’s talk more details in this training process, such as how to change the parameter to have a better outcome, as well as the cost function.
To find the best parameter, we should formulate the expression of our cost, and by finding its derivative, we could know how to change θ to have a better outcome.
General cost function could be
But because of the negativity and simplicity, we take as our cost function. Thus for all customers, our final cost is
Given its complexity, it’s not easy to find the global minimal value of this function using derivative. But if we could find its GM (global minimal), then we could find the best estimate of our parameter. ➡️ Gradient Descent
So we chose another cost function, which is easier to find the GM. Let’s see how to find such a cost function:
Since we know
if we plot the cost function
we could find the function -log(y_hat)
could provide us a good fit for this curve.
Therefore, we could replace the cost function and the total cost function as following:
Now we could minimize the total cost function to find the best estimate of parameters.
And here are some Q&A regarding minimizing the cost function:
How gradient descend works?
Important
COMMENT
- With the new total cost function, compute its derivative of all parameters separately, then we could get the gradient vector
- Given a learning rate, we could update our parameter with a specific speed, just like the step width.
Training Algorithm Recap
Intro to SVM (Support Vector Machine)
Now we have two new challenges: 1. How to transform the data into a higher dimension space? 2. How to find the best/optimized separator?
Data Transformation
The function we used to transform the data is called kernal function. The four common types are linear, polynomial, radial basis function (RBF) and sigmoid function.
Using SVM to find the hyperplane
Our goal is to find such a hyperplane with a bigger margin as possible
The points on the boundary are called support vectors.
Mathematically, our goal is
Not surprisingly, we still need to use the gradient descent. (skip for now)
Pros & Cons of SVM
Therefore when should we use SVM?
- image recognition
- text category assignment
- detecting spam
- sentiment analysis
- gene expression classification
- regression, outlier detection and clustering
Intro to Clustering
Say we have a dataset like
We need to derive segments and groups from large datasets. That is, how to use this unlabelled dataset to understand and identify how customers are similar to each other.
Clustering
Client segmentation is one of the popular usage of clustering. But clustering also has many other applications in different domains.
What’s a cluster?
A group of objects that are similar to other objects in the cluster and dissimilar to data points in other clusters.
What’s the difference between clustering and classification?
Classification is a supervised learning algorithm to predict the label of objects.
However, clustering is an unsupervised learning algorithm, to group similar objects and assign them to different clusters.
More Applications
- Retail/marketing
- identifying buying patterns of customers
- recommending new books or movies to new customers
- Banking
- Fraud detection in credit card use
- identifying clusters of customers
- Insurance
- fraud detection in claims analysis
- insurance risk of customers
- Publication Media
- auto-categorizing news based on their content
- recommending similar news articles
- Medicine
- Characterizing patient behavior
- Biology
- Clustering genetic markers to identify family ties
Why clustering?
- Exploratory data analysis
- summary generation
- outlier detection
- finding duplicates
- pre-processing step
Clustering algorithms
- Partitioned-based clustering (for medium/large databases)
- Relatively efficient
- eg, K-means, K-median, Fuzzy c-Means
- Hierarchical Clustering (intuitive, small sets)
- Produces trees of clusters
- eg, agglomerative, divisive
- Density-based clustering (good for spatial datasets or datasets with noise)
- produces arbitrary shaped clusters
- eg DBSCAN
Intro to K-means
- Partitioning Clustering
- K-means divides the data into non-overlapping subsets without any cluster-internal structure
- objects within a cluster are very similar
- objects within different clusters are very different
The distance between samples from each other is used to shape the clusters. K-means is to minimize the intra-cluster(集群内) distance and maximize the inter-cluster(集群间) distance.
So how can we compute the dissimilarity/distance? Let’s look at following examples with Euclidean distance.
We could also use cosine similarity, average distance, etc.
Now let’s suppose our data has only two features and looks as following:
- Then to start our process, we need to initialize
k
points as the cluster center. These points could be random.
We will talk more about how to select
k
later
- Distance calculation
Compute the distance between each point and all cluster centers, and put them together to form the distance matrix.
- Assign points to its closest centroid
With SSE (sum of squared error), we could easily compare which clustering is better. In another word, our job is to find such a clustering with a minimal SSE. Apparently the initialized points are not good enough, therefore we need to move them.
- Compute the new centroids for each cluster
Use the mean of points in each cluster as the new centroids.
Repeat this process until the centroids no longer move.
Note: there is no guarantee that it will converge to the global optimum and the result may depend on the initial clusters.
K-means Accuracy and Characteristics
Algo summary:
- randomly placing
k
centroids, one for each cluster - calculate the distance of each point from each centroid
- assign each data point to its closest centroid, creating a cluster
- recalculate the position of the
k
centroids - Repeat the steps 2-4, until the centroids no longer move
How to measure its accuracy?
- External: compare the culters with the ground truth, if it is available
- Internal: average the distance between data points within a cluster
How to choose K?
Run the algorithm with different K, and then compare their accuracy (this accuracy could be the average distance of points to its centroid).
However, the truth is, as K increase, the average distance will always decrease (accuracy increase). Therefore, we mainly need to find the elbow point, where the rate of accuracy sharply changes. This method is called the elbow method.
Intro to Hierarchical Clustering
Agglomerative would be our focus in this video. As the picture above shows, agglomerative starts with each observation as one cluster and then combine them one by one.
So how to do agglomerative /əˈɡlɑːməˌreɪtɪv/
clustering?
Say we want to cluster these 6 cities in Canada, the numbers are distances between each city.
Start with 6 clusters, and then we select the minimal distance and put them together, which is OT-MO
at 167.
And then we take OT/MO
as one cluster and re-compute the distance:
As for how to compute the distance to
OT/MO
, one sensible choice is to use the midpoint as the location ofOT/MO
.
And then repeat this process until we reach the threhold. If there is no threshold, we will finally merge all clusters at the begining into a single cluster.
What’s the threshold? The hierarchical clustering typically can be visualized by a dendrogram.
If we pre-defined ay
, then the hierarchy would be cutted into different disjoint clusters.
More about HC
Let’s say our dataset has n points. Now let’s summarize the agglomerative algorithm: (Note that the final until
statement could also be until we reach the threshold
.)
If we look at this summary, we will found our real problems are
- Which distance should we use?
- How to compute the distance, especially after merging part of them?
- How to find the lowest distance?
Take the following patient info as an example, then our data is 3-d
, here we take the Euclidean distance.
The example here shows how to compute the distance between two points, so how about the distance between two clusters?
There are a lot of choices as shown below. But generally it depends on the data type, the dimension and most importantly, the domain knowledge of the dataset. In fact these different distance definition distinguishs different algorithms.
Pros VS Cons
Hierarchical Clustering vs K-means
DBSCAN Clustering
DBSCAN: Density-Based Spatial Clutering of Applciations with Noise
Density-based clustering
K-Means vs Density-based clustering
About DBSCAN
- DBSCAN is one of density-based clustering methods.
- DBSCAN: Density-Based Spatial Clutering of Applciations with Noise
- It can find out any arbitrary shaped cluster without getting effected by noise
- Based on two parameters:
R
(radius) andM
(Min number of neighbors)
The points meet the requirement R
and M
are called core points; the points which can be reached by a core point but whose neighbors are less tha M
are called border points. But if a point is not neither a core point nor a border point, then it’s called an outlier.
Advantages of DBSCAN
- Arbitrarily shaped clusters
- Robust to outliers
- Does not require specification of the number of clusters
Intro to Recommender Systems
What are recommender systems?
Recommender systems capture the pattern of peoples’ behavior and use it to predict what else you might like.
Applications
- what’s to buy?
- where to eat?
- which job to apply?
- who you should be friends with?
- personalize your experience on the web
Advantages
- Broader exposure
- Continual usage/purchase of products
- provide better experience
Content-based recommender system
- Recommend based on user’s profile
- based on similarity of user’s liked/disliked items
Take the movie recommendation as an example:
And then normalized the weight to define user’s profile.
And then use the user’s profile to find the most suitable movie for user.
But this method doesn’t work if the user has never seen some specific types of movie, such as drama. That is, the content-based RS won’t recommend those types of movie that user has never watched before.
Therefore, in this case we need to introduce other methods such as collaborative filtering.
Collaborative Filtering
Two types:
Suppose we already know the correlation among different user are 0.7, 0.9 and 0.4, and we want to calculate the expected score for the last user.
Then calculate the weighted rating matrix for the movie of interest:
Then we divide the sum of weighted rating for each movie by the total weight (because not all movies have all users’ weighted rating, we need to eliminate this influence).
Actually both collaborative filtering types are similar. The difference is, their objective is opposite.