In this blog:

  • make predictions, such as predicting the right drug using decision trees…
  • recommendation systems
  • use python libraries to build ML models
  • use the built-in lab environment

Skills: regression, classification, Clustering, Scikit Learn, Scipy …

Projects: cancer detection, predicting economic trends, customer churn, recommendation systems …

Table of Contents

  1. Intro to ML
  2. Intro to Regression
  3. Intro to Classification
  4. Intro to Clustering
  5. Intro to Recommender Systems

Intro to ML

Project: is this benign or malignant cell?

image

To make predictions, we need to collect the data and train our models to understand the pattern.

image

Machine learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.” - Arthur Samuel

So how machine learning works?

image

Examples

  1. Recommendation systems: Amazon, Netflix
  2. Loan application
  3. Telecommunication companys predict whether customers unsubscribe next month

Major tech in ML:

  1. Regression/estimation: predict continuous values
  2. Classification: predict the item class/category of a case
  3. Clustering: finding the structure of data; summarization
  4. Associations: associating frequent co-occurring items/events
  5. Anomaly detection: discovering abnormal and unusual cases
  6. Sequence mining: predicting next events; click-stream (Markov Model, HMM)
  7. Dimension Reduction: Reducing the size of data (PCA)
  8. Recommendation Systems: recommending items

Difference between AI, ML and DL

image

Python for ML

Libraries for ML

image

image

Scikit-learn functions

This is just to clarify that a ML algo could be achieved by several lines of code. Don’t worry if you don’t understand them.

from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.33)

from sklearn import svm
clf = svm.SVC(gamma = 0.001, C=100.)

clf.fit(X_train, y_train)

clf.predict(X_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, yhat, labels = [1,0]))

import pickle
s = pickle.dumps(clf)

Supervised vs unsupervised

Supervised: teach the machine with labeled data.

Speaking of the data, each row (aka observation) has several columns (aka attributes).

Two types of supervised learning: classification (to predict discrete values) and regression (to predict continuous values).

Unsupervised: no labels anymore. Models work on its own to discover information.

Techniques in Unsupervised learning:

  1. Dimension reduction (feature selection)
  2. density estimation
  3. market basket analysis
  4. clustering

Clustering is grouping of data points or objects that are somehow similar by: discovering structure, summarization, anomaly detection.

Summary

image

Intro to Regression

What’s regression?

image

Rergession is the process of how to predict a continuous value using other variables.

X: Independent variable, Y: Dependent variable.

image

Types of regression

  1. Simple regression (one independent variable) 1.1. simple LR 1.2. simple non-LR
  2. Multiple regression 2.1. Multiple LR 2.2. Multiple non-LR

Applications of regression

  1. Sales forecasting
  2. Satisfaction analysis
  3. Price estimation
  4. Employment income

Algorithms

  1. Ordinal regression
  2. Poisson regression
  3. Fast forest quantile regression
  4. Linear, Polynomial, Lasso, Stepwise, Ridge regression
  5. Bayesian linear regression
  6. Neural network regression
  7. Decision forest regression
  8. Boosted decision tree regression
  9. KNN (K-nearest neighbors)

Simple Linear Regression

image

Then fit a line throught the data:

image

To do predictions, just find the corresponding Y-axis value given the X-axis value.

Model Expression

So how to find the best fit?

Firstly, the best fit line should minimize the MSE (mean square error):

So to find the best estimate of theta_0 and theta_1, we have two options: 1. math method; 2.optimization method.

  1. Math

image

image

  1. Optimization (skip)

Pros

  • Very fast
  • No parameter tuning
  • Easy to understand and highly interpretable

Model Evaluation

  • Train and test on the same dataset
  • Train/test split

How to measure the accuracy of our model?

Compare the actual (dependent) values with predicted ones.

We could use the one-norm error to measure this difference:

image

What’s training & out-of-sample accuracy?

image

image

K-fold Cross Validation

image

Evaluation Metrics

  • MAE (mean absolute error)
  • MSE (mean squared error)
  • RMSE (root mean squared error)

Their definitions are

image

image

image

Relative absolute error: image

Relative squared error: image

R-squared error: image

Multiple LR

image

Two applications:

  • Independent variables effectiveness on prediction
    • Does revision time, test anxiety, lecture attendance and gender have any effect on the exam performance of students?
  • Predict impacts of changes
    • How much does blood pressure go up or down for every unit increase (or decrease) in the BMI of a patient?

Mathematically,

Or,
where
and

Optimal parameter: minize the error of our model, such as MSE or MAE.

image

After finding good estimation of parameters, prediction is just a piece of cake.

How many independent variables should we use?

  • Adding too many independent variables without theoratical justification will make our model overfitting (not general enough for prediction)

Should the independent variable be continuous?

  • Categorical variables could be converted into a numeric variable to cooperate with linear regression

How to check the linear regression relationship?

  • Scatter plot is one of the easiest way.

Intro of Non-Linear Regression

We could try scatter plot to check the regression is linear or not.

image

The regression above could be called polynomial regression in general.

Polynomial: curvy data.

However, a polynomial regression can be transformed into linear regression model.

image

Then we could apply the least square to the transformed ‘linear’ regression.

image

Intro to Classification

  • Supervised-learning
  • Target attribute is a category/discrete

Eg, the bank can clssify a user to the default group (value = 1) or NOT-default group (value = 0) based on the user’s information.

image

The target/dependent variable could also have multiple (>2) values.

Classification algorithms in ML:

  • Decision trees (ID3, C4.5, C5.0)
  • Naive Bayes
  • Linear Discriminant Analysis
  • K-Nearest Neighbor (KNN)
  • Logistic Regression
  • Neural Networks
  • Support Vector Machines (SVM)

KNN

Given an example,

image

What’s KNN?

  • A method for classifying cases based on their similarity to other cases
  • Cases that are near each other are said to be neighbors
  • Based on similar cases with same calss labels are near each other

Algo

image

Then we might have two questions: 1. how to select parameter k; 2. how to measure the distance between two cases.

For Q2, apparently we could use the Euclid distance as following but there are also many other method we could use.

image

For Q1, let’s take the income and age as two independent variables as an example:

image

Evaluation matrics in classification

How to measure the accuracy?

image

Jaccard index

image

If J(y, y_hat) = 1 then it means a perfect prediction.

The higher the index is, the more accurate our model is.

F1-score

Confusion matrix: (which shows the model’s ability to correctly classify the object)

image

Further,

image

F1-score = 2 * (prc * rec) / (prc + rec)

The higher the F1-score is, the more accurate our model is.

Comment: the Jaccard index and F1-score can be used for multiple discrete variable as well.

Log loss

Performance of a classifier where the predicted output is a probability value between 0 and 1.

Using the following formula to calculate the log loss:

where y_hat is the probability value.

The lower the logloss is, the more accurate our model is.

Decision Trees (DT)

What’s a decision tree?

The basic intuition is to map out all possible decision paths in the form of a tree.

For example, say we need to decide which drug the patient p15 need to take.

image

Suppose we have the following decision tree:

image

  • Sex here is called a internal node, and each internal node corresponds to a test
  • M here is called a branch, and each branch corresponds to a result of the test
  • B here is called a leaf node, and each leaf node assigns a classification

So how to build a DT?

image

Let’s talk more about this process. How to choose the attribute we used to split the tree? That is, which attribute is the best?

Let’s say we used cholesterol as the attribute, then we have

image

Apparently it’s not a good choice. If we used the sex attribute, we have

image

Compared with cholesterol, using sex attribute give us more predictiveness, less impurity and lower entropy. But the results for men is still not very good. Then we take a further step:

image

The node with only one class is called pure nodde.

As the internal nodes increase, the impurity decreases. And the impurity is measured by entropy.

So what is entropy?

  • Measure of randomness or uncertainty
  • The lower the entropy, the less uniform the distrubution, the purer the node.

image

Mathematically,

Entropy = -p(A)log(p(A)) - p(B)log(p(B))

where p is the proportion of drug A/B.

For example, we could calculate the entropy before spliting.

image

9 B, 5 A, 14 in Total E = -(9/14)log(9/14) - (5/14)log(5/14) E = 0.94

If we use the cholesterol to split, then we have entropy as

image

But if we use the sex, then

image

Now back to our question, which one is a bette choice?

Ans: we need to choose the tree with the higher information gain after splitting.

So what is information gain?

  • The information that can increase the level of certainty after splitting
  • IG(information gain) = entropy before split - weighted entropy after split
  • The weighted entropy ⬆️, the IG ⬇️

image

Therefore, we should go with the sex attribute. As we can imagine, the next step is to repeat the same process as above until we find the purest leaf node as below.

image

Intro to Logistic Regression

  • what is LR (logistic regression, same as below, not linear regression)?
  • what problem we could solve with LR?
  • which situation do we use with LR?

What is LR?

  • A classification algo for categorial variables
  • Applications:image

When should we use?

  • if your data is binary: 1/0, T/F, Y/N
  • if you need probabilistic results (because for LR, we computed the probability of one object belonging to a specific class and then map it to the class type)
  • when you need a linear decision boundary (line, plan, or hyperplane) image
  • if you need to understand the impact of a feature (attribute/independent variable)

Logistic Regression vs Linear Regression

Can we use the linear regression to predict a categorial data?

image

image

Our job is to predict a point is red or blue given its X (age) value. Because the output is only 0/1, so we may end up with a stepwise function like this

image

Note, in this case, no matter how large the is, the output is always 1, as long as it’s greater than 0.5.

Therefore it would be nice if we could find some method to map the value of y_hat into the range [0,1].

Fortunately, sigmoid function could help us achieve this.

image

So what is sigmoid function?

  • also called logistic function
  • image
  • As ⬆️, the value of the exponential of the - becomes to 0, then the value of the sigmoid function becomes to 1; on the other hand, if is very small, then the sigmoid function value is 0 image

So now how to explain the output of our model if we use the sigmoid function?

image

Having understood the output, how to train our model to find the θ is the only question left.

  1. Initialize θ;
  2. Calculate y_hat = σ(θ^T X) for a customer;
  3. Compare the output of y_hat with actual output of customer, y, and record it as error;
  4. Calculate the error for all customers;
  5. Change the θ to reduce the cost;
  6. Go back to step 2.

Now let’s talk more details in this training process, such as how to change the parameter to have a better outcome, as well as the cost function.

To find the best parameter, we should formulate the expression of our cost, and by finding its derivative, we could know how to change θ to have a better outcome.

General cost function could be image

But because of the negativity and simplicity, we take image as our cost function. Thus for all customers, our final cost is image

Given its complexity, it’s not easy to find the global minimal value of this function using derivative. But if we could find its GM (global minimal), then we could find the best estimate of our parameter. ➡️ Gradient Descent

So we chose another cost function, which is easier to find the GM. Let’s see how to find such a cost function:

Since we know image

if we plot the cost function image

we could find the function -log(y_hat) could provide us a good fit for this curve.

Therefore, we could replace the cost function and the total cost function as following:

image

Now we could minimize the total cost function to find the best estimate of parameters.

And here are some Q&A regarding minimizing the cost function:

image

How gradient descend works?

Important

image

COMMENT

  • With the new total cost function, compute its derivative of all parameters separately, then we could get the gradient vector
  • Given a learning rate, we could update our parameter with a specific speed, just like the step width.

Training Algorithm Recap

image

Intro to SVM (Support Vector Machine)

image

Now we have two new challenges: 1. How to transform the data into a higher dimension space? 2. How to find the best/optimized separator?

Data Transformation

image

The function we used to transform the data is called kernal function. The four common types are linear, polynomial, radial basis function (RBF) and sigmoid function.

Using SVM to find the hyperplane

Our goal is to find such a hyperplane with a bigger margin as possible

image

The points on the boundary are called support vectors.

Mathematically, our goal is

image

Not surprisingly, we still need to use the gradient descent. (skip for now)

Pros & Cons of SVM

image

Therefore when should we use SVM?

  • image recognition
  • text category assignment
  • detecting spam
  • sentiment analysis
  • gene expression classification
  • regression, outlier detection and clustering

Intro to Clustering

Say we have a dataset like

image

We need to derive segments and groups from large datasets. That is, how to use this unlabelled dataset to understand and identify how customers are similar to each other.

Clustering

Client segmentation is one of the popular usage of clustering. But clustering also has many other applications in different domains.

What’s a cluster?

A group of objects that are similar to other objects in the cluster and dissimilar to data points in other clusters.

image

What’s the difference between clustering and classification?

Classification is a supervised learning algorithm to predict the label of objects.

However, clustering is an unsupervised learning algorithm, to group similar objects and assign them to different clusters.

image

More Applications

  • Retail/marketing
    • identifying buying patterns of customers
    • recommending new books or movies to new customers
  • Banking
    • Fraud detection in credit card use
    • identifying clusters of customers
  • Insurance
    • fraud detection in claims analysis
    • insurance risk of customers
  • Publication Media
    • auto-categorizing news based on their content
    • recommending similar news articles
  • Medicine
    • Characterizing patient behavior
  • Biology
    • Clustering genetic markers to identify family ties

Why clustering?

  • Exploratory data analysis
  • summary generation
  • outlier detection
  • finding duplicates
  • pre-processing step

Clustering algorithms

  • Partitioned-based clustering (for medium/large databases)
    • Relatively efficient
    • eg, K-means, K-median, Fuzzy c-Means
  • Hierarchical Clustering (intuitive, small sets)
    • Produces trees of clusters
    • eg, agglomerative, divisive
  • Density-based clustering (good for spatial datasets or datasets with noise)
    • produces arbitrary shaped clusters
    • eg DBSCAN

Intro to K-means

image

  • Partitioning Clustering
  • K-means divides the data into non-overlapping subsets without any cluster-internal structure
  • objects within a cluster are very similar
  • objects within different clusters are very different

The distance between samples from each other is used to shape the clusters. K-means is to minimize the intra-cluster(集群内) distance and maximize the inter-cluster(集群间) distance.

So how can we compute the dissimilarity/distance? Let’s look at following examples with Euclidean distance.

image

image

We could also use cosine similarity, average distance, etc.

Now let’s suppose our data has only two features and looks as following:

image

  1. Then to start our process, we need to initialize k points as the cluster center. These points could be random.

We will talk more about how to select k later

image

  1. Distance calculation

Compute the distance between each point and all cluster centers, and put them together to form the distance matrix.

image

  1. Assign points to its closest centroid

With SSE (sum of squared error), we could easily compare which clustering is better. In another word, our job is to find such a clustering with a minimal SSE. Apparently the initialized points are not good enough, therefore we need to move them.

image

  1. Compute the new centroids for each cluster

Use the mean of points in each cluster as the new centroids.

image

Repeat this process until the centroids no longer move.

Note: there is no guarantee that it will converge to the global optimum and the result may depend on the initial clusters.

K-means Accuracy and Characteristics

Algo summary:

  1. randomly placing k centroids, one for each cluster
  2. calculate the distance of each point from each centroid
  3. assign each data point to its closest centroid, creating a cluster
  4. recalculate the position of the k centroids
  5. Repeat the steps 2-4, until the centroids no longer move

How to measure its accuracy?

  • External: compare the culters with the ground truth, if it is available
  • Internal: average the distance between data points within a cluster

How to choose K?

Run the algorithm with different K, and then compare their accuracy (this accuracy could be the average distance of points to its centroid).

However, the truth is, as K increase, the average distance will always decrease (accuracy increase). Therefore, we mainly need to find the elbow point, where the rate of accuracy sharply changes. This method is called the elbow method.

image

Intro to Hierarchical Clustering

image

image

Agglomerative would be our focus in this video. As the picture above shows, agglomerative starts with each observation as one cluster and then combine them one by one.

So how to do agglomerative /əˈɡlɑːməˌreɪtɪv/ clustering?

Say we want to cluster these 6 cities in Canada, the numbers are distances between each city.

image

Start with 6 clusters, and then we select the minimal distance and put them together, which is OT-MO at 167.

image

And then we take OT/MOas one cluster and re-compute the distance:

As for how to compute the distance to OT/MO, one sensible choice is to use the midpoint as the location of OT/MO.

image

And then repeat this process until we reach the threhold. If there is no threshold, we will finally merge all clusters at the begining into a single cluster.

What’s the threshold? The hierarchical clustering typically can be visualized by a dendrogram.
image If we pre-defined a y, then the hierarchy would be cutted into different disjoint clusters. image

More about HC

Let’s say our dataset has n points. Now let’s summarize the agglomerative algorithm: (Note that the final until statement could also be until we reach the threshold.)

image

If we look at this summary, we will found our real problems are

  1. Which distance should we use?
  2. How to compute the distance, especially after merging part of them?
  3. How to find the lowest distance?

Take the following patient info as an example, then our data is 3-d, here we take the Euclidean distance.

The example here shows how to compute the distance between two points, so how about the distance between two clusters?

image

There are a lot of choices as shown below. But generally it depends on the data type, the dimension and most importantly, the domain knowledge of the dataset. In fact these different distance definition distinguishs different algorithms.

image

Pros VS Cons

image

Hierarchical Clustering vs K-means

image

DBSCAN Clustering

DBSCAN: Density-Based Spatial Clutering of Applciations with Noise

Density-based clustering

image

K-Means vs Density-based clustering

image

About DBSCAN

  • DBSCAN is one of density-based clustering methods.
  • DBSCAN: Density-Based Spatial Clutering of Applciations with Noise
  • It can find out any arbitrary shaped cluster without getting effected by noise
  • Based on two parameters: R (radius) and M (Min number of neighbors)

image

The points meet the requirement R and M are called core points; the points which can be reached by a core point but whose neighbors are less tha M are called border points. But if a point is not neither a core point nor a border point, then it’s called an outlier.

Advantages of DBSCAN

  1. Arbitrarily shaped clusters
  2. Robust to outliers
  3. Does not require specification of the number of clusters

Intro to Recommender Systems

What are recommender systems?

Recommender systems capture the pattern of peoples’ behavior and use it to predict what else you might like.

Applications

  • what’s to buy?
  • where to eat?
  • which job to apply?
  • who you should be friends with?
  • personalize your experience on the web

image

Advantages

  • Broader exposure
  • Continual usage/purchase of products
  • provide better experience

image

image

Content-based recommender system

  • Recommend based on user’s profile
  • based on similarity of user’s liked/disliked items

Take the movie recommendation as an example:

image

And then normalized the weight to define user’s profile.

image

And then use the user’s profile to find the most suitable movie for user.

image

But this method doesn’t work if the user has never seen some specific types of movie, such as drama. That is, the content-based RS won’t recommend those types of movie that user has never watched before.

Therefore, in this case we need to introduce other methods such as collaborative filtering.

Collaborative Filtering

Two types:

image

image

Suppose we already know the correlation among different user are 0.7, 0.9 and 0.4, and we want to calculate the expected score for the last user.

image

Then calculate the weighted rating matrix for the movie of interest:

image

Then we divide the sum of weighted rating for each movie by the total weight (because not all movies have all users’ weighted rating, we need to eliminate this influence).

image

Actually both collaborative filtering types are similar. The difference is, their objective is opposite.

image

image