The Prelude: Statistics
Before getting into real statistical machine learning, I’ll have to recap some basic statistics basic stuff… The main objective is to give a intuitive explanation of those basic concepts that are required in machine learning.
As we know, statistics has two branches, descriptive and inferential. The descriptive statistics is used to represent a bunch of data while inferential statistics uses data to make conclusion about things. Both of them are covered in statistical machine learning, and a large part of that is probability theory, where we are going to start.
Random Process
Random Variables
A variable is something whose value may change, but random variable seem to be a little bit tricky, because it can taken on a bunch of different values at the same time, and we can never solve it for its value. However, we can develop something to describe how does it work.
A random variable can be treated as a map from a random process to a number, which is random. For example, we have a random variable which represents whether it rains tomorrow. It is 1 if it rains tomorrow, otherwise 0.
In order to describe the behavior of random variable, we have a probability assigned to it
The expression above indicates the probability for $X$ to have the value $x_1$.
Random variables can be either discrete or continuous.
Binomial Distribution
The core of Binomial Distribution is doing a series of random experiments, each of that has exactly two possible outcomes. For example, we are flipping a fair coin for $5$ times and let random variable $X$ represent the number of heads occurred after $5$ flips, so $X$ can be $0$, $1$, $2$, $3$, $4$, or $5$. We can calculate the probability for first three cases and the other 3 cases are symmetric.
This is quite intuitive, we have probability of $(\frac{1}{2})^5$ for each possible combination of 5 coins, and in order to calculate the probability of $k$ heads, we are actually choosing $k$ coins from $5$ and let them to be head.
Now let’s make the problem a little bit complex. Suppose we have unfair coins, which has probability of $0.3$ to be head. In this case
The calculation becomes
 Select $k$ fron $N$ to represent head
 multiply by probability of head to the power of $k$
 multiply by probability of tail to the power of $(Nk)$
The expectation of binomial distribution is
where $n$ is the number of experiments and $p$ is the probability of succeed one time. Let’s prove that. We know that
and the expection is the probabilistic weighted sum of outcomes
The part after $np$ in the equation above is actually the sum of probability of getting $k1$ successes in $n1$ trials with probability of each trail to be $p$. The sum should be $1$, so expected value is $np$
Poisson Process
You might find Poisson process quite tricky, just like I did, because the math teacher sucks. Think about this scenario, we want to look at how many vehicles pass the crossroad in one hour. So we have random variable $X = #\text{cars pass in 1 h}$, and we have following assumptions
 Every hour is identical
 Hours are independent
How are we gonna solve this? Well, let’s assume that this problem is bionomial, that is, we divide one hour into 60 minutes, and do one trial per minute, there can either be one car passing by or no car at all. So we have
where $\lambda$ is the number of cars per 60 minutes. So for each minute, the probability of having a car passing by is $\frac{\lambda}{60}$, so using the conclusion from binomial distribution, we have
But the problem is that there might be multiple cars passing by in one minute, which means we need a finer grained approximation. Well, the idea is to divide one hour into more intervals (infinite intervals).
Before we get into that, let’s look at couple of facts that is helpful later on.
It is quite straightforward to get this, simply let Then the equation becomes
Our probability is written as
Bernoulli Distribution
Bernoulli is a special case of Binomial distribution with one single trial. For example, we have a random variable $X$ which represents the result of flipping a coin, which can either be Head or Tail.
Then the pdf of the distribution is simply
where $k$ can either be $0$ or $1$.
Beta distribution
The probability distribution of beta distribution is simply given by
where $\theta$ is a random variable of which the value is from $0$ to $1$. The demoninator is a normalization function of $a$ and $b$ in order to make sure the integration of pdf over random variable is 1.
The following graph indicates different pdf shapes when $a$ and $b$ have different value combinations.
Frequentist Approach vs Bayesian Approach
The difference between frequentist and Bayesian approach is that the assumption of variation of data and parameter. The frequentist assumes that the parameters are fixed in a certain problem, for example, in flipping coin trial, the parameter $p$ which represents the probability of Head is always 0.5. The infrerences are done by applying mathematics to the fixed parameters directly.
While the bayesian approach assumes that the data is fixed and the parameters can vary. In the case of flipping a coin, the probability of having a Head may related to some priors, for example, the gravity, the angle of the coin… As long as these parameters are fixed, we will always have same result. The uncertainty is introduced by the initial condition.
Bayes’ rule
The Bayes’ rule states the the posterior of parameter of a given distribution is given by
where the $P(x  \theta)$ is the likelihood and $P(\theta)$ is the prior distribution. Because of the assumption of certainty of invarient data, the posterior is proportional to likelihood multiplied by the prior 
This relationship is quite useful if we assume the prior to be uniformly distributed, because we can maximize the posterior by maximizing the likelihood.
Likelihood isn’t a distribution
The likelihood $P(x\theta)$ can also be written as $L(\thetax)$. We can figure out that but in bayesian approach, the data is fixed, and the parameters are variables. If we integrate the likelihood
we are actually summing up the probability of data given parameter for all possible parameters. If the $\theta$ is from $\infty$ to $\infty$, the integral can potentially be $\infty$ to $\infty$, which does not satisfy the property of a distribution. That’s why we always say likelihood function instead of likelihood distribution.
Conjugate Distribution
If prior and posterior distributions are of the same probability distribution family, and the prior is called the conjugate prior to the likelihood function.
For example, Beta distribution is conjugate to Bernoulli/Binomial distribution. In order to understand this, we’ll need to introduce the Gamma function. Gamma function is like factorial defined on Real numbers. For an nonnegative integer $\alpha$, we have
Therefore, we can write choose function into the form of Gamma function
Now we have a posterior distribution
where
From Binomial distribution, we have
and from Beta distribution we have
let $a’ = a + z$ and $b’ = N+bz$, we can simply get
Machine Learning Basics
Regression as Probabilistic Model
Linear Regression with Gaussian Noise
Suppose that we have a linear model with Gaussian noise, given by
with
Combining the linear and Gaussian distribution, we get the equivalent posterior
The posterior of parameter $w$ is proportional to
Taking logarithm, we have
It is quite obvious that the loglikelihood of parameter $w$ is proportional to the sum of square loss.
Logistic Regression
Perceptron Learning Algorithm
Multilayer perceptron and back propagation
SVM and Kernel Method
Support vector machine is a binary classification tool. It takes a data point and output a value which is $1$ or $1$. What we wanna do with SVM is that we wanna find the support vectors that are closest data instances to the hyperplain that divide the dataset, and find such a hyperplain that maximize the distance.
Hard margin SVM
Suppose if we have a data point $X$, the distance between the point and the hyper plain $P:y = w’x + b$ is calculated as follows:
 We find a vector $r$, which is $X + r = X_p$, where $X_p$ is a point on the hyper plain and $r\perp P$
 It is obvious that $r$ is parallel to $w$

The magnitude of $r$ is given by $r=w\frac{ r }{ w }$ 
We have $X + w\frac{ r }{ w } = X_p$ 
Multiply $w$ on both sides and add $b$, and we know $X_p$ is on hyper plain, so we finally get $ r = \plusmn \frac{w’x + b}{ w }$ 
Because the $y$ is either $1$ or $1$, we get $ r = \frac{y_i(w’x + b)}{ w }$
The SVM maxinizes the magnitude of $r$
$\max \frac{y_i(w’x + b)}{  w  }, i=1,2,…n$ 
The problem is that there might be infinity possible values for $w$ and $b$ which are $\alpha w$ and $\alpha b$ where $\alpha>0$. In order to resolve the ambiguity, we have following constraints
where $*$ denotess the closest data point to the hyperplain. Than the objective becomes
Training hard margin SVM
To train the hard margin SVM, we need to apply Lagrangian Duality. The canonical form of it is show as following
And we have following exoression introducing some auxilliary variable $\lambda$ and $\nu$
For hard margin SVM, the lagrange multiplier is shown as follows
We have $\frac{\delta L}{\delta b} = \sum \lambda y = 0$ and $\frac{\delta L}{\delta w} = w  \sum_i\lambda_iy_ix_i = 0$
Substituting into the original expression, we get Where the $L’$ is the dual problem of prime problem. In order to predict, just calculate the sing where $x_j$ is the point to predict.
Bayesian Network
Discrete variable joint probability distribution
When we do inferences like
where $\vec{e}$ is the evidence or observed variables, it’s quite straightforward to think of bayes’ theorem
But there might be other variables that are neigher $x$ nor $\vec{e}$, which are called $H$, they can be marginalized out like this
then the posterior is related to the full joint probability distribution.
Bayesian network is good at helping reducing the number of parameters required to calculate the joint probability by assuming dependancies between random variables. For example the figure below:
Where $A$ $B$ $C$ $D$ $E$ are random variables with two possible values: $1$ and $0$
if we write down the joint probability ignoring the graph, needs $2^5$ parameters because we need the value of probability when those 5 random variables take each of 2 possible values.
After taking the graph into consideration, the dependencies actually changed. For example, the random variable $A$ and $B$ doesn’t have any parents. The joint distribution then becomes
the number of parameters becomes
which is much less than previous one. In fact the growth of parameters follows $O(nd^k)$ where $d$ is possible values of random variables, $k$ is number of parents on average and $n$ is the number of nodes.