In this blog:

make predictions, such as predicting the right drug using decision trees…
recommendation systems
use python libraries to build ML models
use the built-in lab environment

Skills: regression, classification, Clustering, Scikit Learn, Scipy …

Projects: cancer detection, predicting economic trends, customer churn, recommendation systems …

Intro to ML
Intro to Regression
Intro to Classification
Intro to Clustering
Intro to Recommender Systems

Intro to ML

Project: is this benign or malignant cell?

To make predictions, we need to collect the data and train our models to understand the pattern.

Machine learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.” - Arthur Samuel

So how machine learning works?

Examples

Recommendation systems: Amazon, Netflix
Loan application
Telecommunication companys predict whether customers unsubscribe next month

Major tech in ML:

Regression/estimation: predict continuous values
Classification: predict the item class/category of a case
Clustering: finding the structure of data; summarization
Associations: associating frequent co-occurring items/events
Anomaly detection: discovering abnormal and unusual cases
Sequence mining: predicting next events; click-stream (Markov Model, HMM)
Dimension Reduction: Reducing the size of data (PCA)
Recommendation Systems: recommending items

Difference between AI, ML and DL

Python for ML

Libraries for ML

Scikit-learn functions

This is just to clarify that a ML algo could be achieved by several lines of code. Don’t worry if you don’t understand them.

from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.33)

from sklearn import svm
clf = svm.SVC(gamma = 0.001, C=100.)

clf.fit(X_train, y_train)

clf.predict(X_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, yhat, labels = [1,0]))

import pickle
s = pickle.dumps(clf)

Supervised vs unsupervised

Supervised: teach the machine with labeled data.

Speaking of the data, each row (aka observation) has several columns (aka attributes).

Two types of supervised learning: classification (to predict discrete values) and regression (to predict continuous values).

Unsupervised: no labels anymore. Models work on its own to discover information.

Techniques in Unsupervised learning:

Dimension reduction (feature selection)
density estimation
market basket analysis
clustering

Clustering is grouping of data points or objects that are somehow similar by: discovering structure, summarization, anomaly detection.

Summary

Intro to Regression

What’s regression?

Rergession is the process of how to predict a continuous value using other variables.

X: Independent variable, Y: Dependent variable.

Types of regression

Simple regression (one independent variable) 1.1. simple LR 1.2. simple non-LR
Multiple regression 2.1. Multiple LR 2.2. Multiple non-LR

Applications of regression

Sales forecasting
Satisfaction analysis
Price estimation
Employment income

Algorithms

Ordinal regression
Poisson regression
Fast forest quantile regression
Linear, Polynomial, Lasso, Stepwise, Ridge regression
Bayesian linear regression
Neural network regression
Decision forest regression
Boosted decision tree regression
KNN (K-nearest neighbors)

Simple Linear Regression

Then fit a line throught the data:

To do predictions, just find the corresponding Y-axis value given the X-axis value.

Model Expression

$\hat{y} = \theta_0 + \theta_1 x_1$

So how to find the best fit?

Firstly, the best fit line should minimize the MSE (mean square error):

$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

So to find the best estimate of theta_0 and theta_1, we have two options: 1. math method; 2.optimization method.

Math

Optimization (skip)

Pros

Very fast
No parameter tuning
Easy to understand and highly interpretable

Model Evaluation

Train and test on the same dataset
Train/test split

How to measure the accuracy of our model?

Compare the actual (dependent) values with predicted ones.

We could use the one-norm error to measure this difference:

$Error = \frac{1}{n}\sum_{j=1}^{n}|y_j-\hat{y}_j|$

What’s training & out-of-sample accuracy?

K-fold Cross Validation

Evaluation Metrics

MAE (mean absolute error)
MSE (mean squared error)
RMSE (root mean squared error)

Their definitions are

Relative absolute error:

Relative squared error:

R-squared error:

Multiple LR

Two applications:

Independent variables effectiveness on prediction
- Does revision time, test anxiety, lecture attendance and gender have any effect on the exam performance of students?
Predict impacts of changes
- How much does blood pressure go up or down for every unit increase (or decrease) in the BMI of a patient?

Mathematically,

$\hat{y} = \theta_0 +\theta_1 x_1 + \theta_2 x_2 + ... + \theta_nx_n$

Or, $\hat{y} = \theta^T X$
where
$\theta^T = [\theta_0, \theta_1, \theta_2, ..., \theta_n]$ and
$X = \begin{bmatrix} 1\\ x_1\\ x_2\\ ...\\\end{bmatrix}$

Optimal parameter: minize the error of our model, such as MSE or MAE.

After finding good estimation of parameters, prediction is just a piece of cake.

How many independent variables should we use?

Adding too many independent variables without theoratical justification will make our model overfitting (not general enough for prediction)

Should the independent variable be continuous?

Categorical variables could be converted into a numeric variable to cooperate with linear regression

How to check the linear regression relationship?

Scatter plot is one of the easiest way.

Intro of Non-Linear Regression

We could try scatter plot to check the regression is linear or not.

The regression above could be called polynomial regression in general.

Polynomial: curvy data.

However, a polynomial regression can be transformed into linear regression model.

Then we could apply the least square to the transformed ‘linear’ regression.

Intro to Classification

Supervised-learning
Target attribute is a category/discrete

Eg, the bank can clssify a user to the default group (value = 1) or NOT-default group (value = 0) based on the user’s information.

The target/dependent variable could also have multiple (>2) values.

Classification algorithms in ML:

Decision trees (ID3, C4.5, C5.0)
Naive Bayes
Linear Discriminant Analysis
K-Nearest Neighbor (KNN)
Logistic Regression
Neural Networks
Support Vector Machines (SVM)

KNN

Given an example,

What’s KNN?

A method for classifying cases based on their similarity to other cases
Cases that are near each other are said to be neighbors
Based on similar cases with same calss labels are near each other

Algo

Then we might have two questions: 1. how to select parameter k; 2. how to measure the distance between two cases.

For Q2, apparently we could use the Euclid distance as following but there are also many other method we could use.

For Q1, let’s take the income and age as two independent variables as an example:

Evaluation matrics in classification

How to measure the accuracy?

Jaccard index

If J(y, y_hat) = 1 then it means a perfect prediction.

The higher the index is, the more accurate our model is.

F1-score

Confusion matrix: (which shows the model’s ability to correctly classify the object)

Further,

F1-score = 2 * (prc * rec) / (prc + rec)

The higher the F1-score is, the more accurate our model is.

Comment: the Jaccard index and F1-score can be used for multiple discrete variable as well.

Log loss

Performance of a classifier where the predicted output is a probability value between 0 and 1.

Using the following formula to calculate the log loss:

$LogLoss = -\frac{1}{n}\sum(y\times log(\hat{y})+ (1-y) \times log(1-\hat{y}))$

where y_hat is the probability value.

The lower the logloss is, the more accurate our model is.

Decision Trees (DT)

What’s a decision tree?

The basic intuition is to map out all possible decision paths in the form of a tree.

For example, say we need to decide which drug the patient p15 need to take.

Suppose we have the following decision tree:

Sex here is called a internal node, and each internal node corresponds to a test
M here is called a branch, and each branch corresponds to a result of the test
B here is called a leaf node, and each leaf node assigns a classification

So how to build a DT?

Let’s talk more about this process. How to choose the attribute we used to split the tree? That is, which attribute is the best?

Let’s say we used cholesterol as the attribute, then we have

Apparently it’s not a good choice. If we used the sex attribute, we have

Compared with cholesterol, using sex attribute give us more predictiveness, less impurity and lower entropy. But the results for men is still not very good. Then we take a further step:

The node with only one class is called pure nodde.

As the internal nodes increase, the impurity decreases. And the impurity is measured by entropy.

So what is entropy?

Measure of randomness or uncertainty
The lower the entropy, the less uniform the distrubution, the purer the node.

Mathematically,

Entropy = -p(A)log(p(A)) - p(B)log(p(B))

where p is the proportion of drug A/B.

For example, we could calculate the entropy before spliting.

9 B, 5 A, 14 in Total E = -(9/14)log(9/14) - (5/14)log(5/14) E = 0.94

If we use the cholesterol to split, then we have entropy as

But if we use the sex, then

Now back to our question, which one is a bette choice?

Ans: we need to choose the tree with the higher information gain after splitting.

So what is information gain?

The information that can increase the level of certainty after splitting

IG(information gain) = entropy before split - weighted entropy after split

The weighted entropy ⬆️, the IG ⬇️

Therefore, we should go with the sex attribute. As we can imagine, the next step is to repeat the same process as above until we find the purest leaf node as below.

Intro to Logistic Regression

what is LR (logistic regression, same as below, not linear regression)?
what problem we could solve with LR?
which situation do we use with LR?

What is LR?

A classification algo for categorial variables
Applications:

When should we use?

if your data is binary: 1/0, T/F, Y/N
if you need probabilistic results (because for LR, we computed the probability of one object belonging to a specific class and then map it to the class type)
when you need a linear decision boundary (line, plan, or hyperplane)
if you need to understand the impact of a feature (attribute/independent variable)

Logistic Regression vs Linear Regression

Can we use the linear regression to predict a categorial data?

Our job is to predict a point is red or blue given its X (age) value. Because the output is only 0/1, so we may end up with a stepwise function like this

Note, in this case, no matter how large the $\theta^TX$ is, the output is always 1, as long as it’s greater than 0.5.

Therefore it would be nice if we could find some method to map the value of y_hat into the range [0,1].

Fortunately, sigmoid function could help us achieve this.

So what is sigmoid function?

also called logistic function

As $\theta^TX$ ⬆️, the value of the exponential of the - $\theta^TX$ becomes to 0, then the value of the sigmoid function becomes to 1; on the other hand, if $\theta^TX$ is very small, then the sigmoid function value is 0

So now how to explain the output of our model if we use the sigmoid function?

Having understood the output, how to train our model to find the θ is the only question left.

Initialize θ;
Calculate y_hat = σ(θ^T X) for a customer;
Compare the output of y_hat with actual output of customer, y, and record it as error;
Calculate the error for all customers;
Change the θ to reduce the cost;
Go back to step 2.

Now let’s talk more details in this training process, such as how to change the parameter to have a better outcome, as well as the cost function.

To find the best parameter, we should formulate the expression of our cost, and by finding its derivative, we could know how to change θ to have a better outcome.

General cost function could be

But because of the negativity and simplicity, we take as our cost function. Thus for all customers, our final cost is

Given its complexity, it’s not easy to find the global minimal value of this function using derivative. But if we could find its GM (global minimal), then we could find the best estimate of our parameter. ➡️ Gradient Descent

So we chose another cost function, which is easier to find the GM. Let’s see how to find such a cost function:

Since we know

if we plot the cost function

we could find the function -log(y_hat) could provide us a good fit for this curve.

Therefore, we could replace the cost function and the total cost function as following:

Now we could minimize the total cost function to find the best estimate of parameters.

And here are some Q&A regarding minimizing the cost function:

How gradient descend works?

Important

COMMENT

With the new total cost function, compute its derivative of all parameters separately, then we could get the gradient vector
Given a learning rate, we could update our parameter with a specific speed, just like the step width.

Training Algorithm Recap

Intro to SVM （Support Vector Machine）

Now we have two new challenges: 1. How to transform the data into a higher dimension space? 2. How to find the best/optimized separator?

Data Transformation

The function we used to transform the data is called kernal function. The four common types are linear, polynomial, radial basis function (RBF) and sigmoid function.

Using SVM to find the hyperplane

Our goal is to find such a hyperplane with a bigger margin as possible

The points on the boundary are called support vectors.

Mathematically, our goal is

Not surprisingly, we still need to use the gradient descent. (skip for now)

Pros & Cons of SVM

Therefore when should we use SVM?

image recognition
text category assignment
detecting spam
sentiment analysis
gene expression classification
regression, outlier detection and clustering

Intro to Clustering

Say we have a dataset like

We need to derive segments and groups from large datasets. That is, how to use this unlabelled dataset to understand and identify how customers are similar to each other.

Clustering

Client segmentation is one of the popular usage of clustering. But clustering also has many other applications in different domains.

What’s a cluster?

A group of objects that are similar to other objects in the cluster and dissimilar to data points in other clusters.

What’s the difference between clustering and classification?

Classification is a supervised learning algorithm to predict the label of objects.

However, clustering is an unsupervised learning algorithm, to group similar objects and assign them to different clusters.

More Applications

Retail/marketing
- identifying buying patterns of customers
- recommending new books or movies to new customers
Banking
- Fraud detection in credit card use
- identifying clusters of customers
Insurance
- fraud detection in claims analysis
- insurance risk of customers
Publication Media
- auto-categorizing news based on their content
- recommending similar news articles
Medicine
- Characterizing patient behavior
Biology
- Clustering genetic markers to identify family ties

Why clustering?

Exploratory data analysis
summary generation
outlier detection
finding duplicates
pre-processing step

Clustering algorithms

Partitioned-based clustering (for medium/large databases)
- Relatively efficient
- eg, K-means, K-median, Fuzzy c-Means
Hierarchical Clustering (intuitive, small sets)
- Produces trees of clusters
- eg, agglomerative, divisive
Density-based clustering (good for spatial datasets or datasets with noise)
- produces arbitrary shaped clusters
- eg DBSCAN

Intro to K-means

Partitioning Clustering
K-means divides the data into non-overlapping subsets without any cluster-internal structure
objects within a cluster are very similar
objects within different clusters are very different

The distance between samples from each other is used to shape the clusters. K-means is to minimize the intra-cluster(集群内) distance and maximize the inter-cluster(集群间) distance.

So how can we compute the dissimilarity/distance? Let’s look at following examples with Euclidean distance.

We could also use cosine similarity, average distance, etc.

Now let’s suppose our data has only two features and looks as following:

Then to start our process, we need to initialize k points as the cluster center. These points could be random.

We will talk more about how to select k later

Distance calculation

Compute the distance between each point and all cluster centers, and put them together to form the distance matrix.

Assign points to its closest centroid

With SSE (sum of squared error), we could easily compare which clustering is better. In another word, our job is to find such a clustering with a minimal SSE. Apparently the initialized points are not good enough, therefore we need to move them.

Compute the new centroids for each cluster

Use the mean of points in each cluster as the new centroids.

Repeat this process until the centroids no longer move.

Note: there is no guarantee that it will converge to the global optimum and the result may depend on the initial clusters.

K-means Accuracy and Characteristics

Algo summary:

randomly placing k centroids, one for each cluster
calculate the distance of each point from each centroid
assign each data point to its closest centroid, creating a cluster
recalculate the position of the k centroids
Repeat the steps 2-4, until the centroids no longer move

How to measure its accuracy?

External: compare the culters with the ground truth, if it is available
Internal: average the distance between data points within a cluster

How to choose K?

Run the algorithm with different K, and then compare their accuracy (this accuracy could be the average distance of points to its centroid).

However, the truth is, as K increase, the average distance will always decrease (accuracy increase). Therefore, we mainly need to find the elbow point, where the rate of accuracy sharply changes. This method is called the elbow method.

Intro to Hierarchical Clustering

Agglomerative would be our focus in this video. As the picture above shows, agglomerative starts with each observation as one cluster and then combine them one by one.

So how to do agglomerative /əˈɡlɑːməˌreɪtɪv/ clustering?

Say we want to cluster these 6 cities in Canada, the numbers are distances between each city.

Start with 6 clusters, and then we select the minimal distance and put them together, which is OT-MO at 167.

And then we take OT/MOas one cluster and re-compute the distance:

As for how to compute the distance to OT/MO, one sensible choice is to use the midpoint as the location of OT/MO.

And then repeat this process until we reach the threhold. If there is no threshold, we will finally merge all clusters at the begining into a single cluster.

What’s the threshold? The hierarchical clustering typically can be visualized by a dendrogram.
If we pre-defined a y, then the hierarchy would be cutted into different disjoint clusters.

More about HC

Let’s say our dataset has n points. Now let’s summarize the agglomerative algorithm: (Note that the final until statement could also be until we reach the threshold.)

If we look at this summary, we will found our real problems are

Which distance should we use?
How to compute the distance, especially after merging part of them?
How to find the lowest distance?

Take the following patient info as an example, then our data is 3-d, here we take the Euclidean distance.

The example here shows how to compute the distance between two points, so how about the distance between two clusters?

There are a lot of choices as shown below. But generally it depends on the data type, the dimension and most importantly, the domain knowledge of the dataset. In fact these different distance definition distinguishs different algorithms.

Pros VS Cons

Hierarchical Clustering vs K-means

DBSCAN Clustering

DBSCAN: Density-Based Spatial Clutering of Applciations with Noise

Density-based clustering

K-Means vs Density-based clustering

About DBSCAN

DBSCAN is one of density-based clustering methods.
DBSCAN: Density-Based Spatial Clutering of Applciations with Noise
It can find out any arbitrary shaped cluster without getting effected by noise
Based on two parameters: R (radius) and M (Min number of neighbors)

The points meet the requirement R and M are called core points; the points which can be reached by a core point but whose neighbors are less tha M are called border points. But if a point is not neither a core point nor a border point, then it’s called an outlier.

Advantages of DBSCAN

Arbitrarily shaped clusters
Robust to outliers
Does not require specification of the number of clusters

Intro to Recommender Systems

What are recommender systems?

Recommender systems capture the pattern of peoples’ behavior and use it to predict what else you might like.

Applications

what’s to buy?
where to eat?
which job to apply?
who you should be friends with?
personalize your experience on the web

Advantages

Broader exposure
Continual usage/purchase of products
provide better experience

Content-based recommender system

Recommend based on user’s profile
based on similarity of user’s liked/disliked items

Take the movie recommendation as an example:

And then normalized the weight to define user’s profile.

And then use the user’s profile to find the most suitable movie for user.

But this method doesn’t work if the user has never seen some specific types of movie, such as drama. That is, the content-based RS won’t recommend those types of movie that user has never watched before.

Therefore, in this case we need to introduce other methods such as collaborative filtering.

Collaborative Filtering

Two types:

Suppose we already know the correlation among different user are 0.7, 0.9 and 0.4, and we want to calculate the expected score for the last user.

Then calculate the weighted rating matrix for the movie of interest:

Then we divide the sum of weighted rating for each movie by the total weight (because not all movies have all users’ weighted rating, we need to eliminate this influence).

Actually both collaborative filtering types are similar. The difference is, their objective is opposite.

Machine Learning with Python (Study Notes)

Table of Contents