Face Recognition Development Notes



Tensorflow is a graph-based computation framework. Using graph provides better readibality, parallelism and optimization of pre-compiling.

In declarative programming, the outputs of functions are independent of external states, instead, they only depend on the inputs, which eliminates most troubles caused by the side effect. In tensorflow, each function or operation is a single node in the graph.

The Graph characteristic of tensorflow also provide the capability of pre-compiling. Instead of executing the codes line by line like interpreted language (Python say for example), a graph is generated by pre-compiling the code. Optimization like parallelize independent logics, removal of useless logics and extracion of shared logics are available, thus make it more efficient.

Key elements in tensorflow graph

Model design

The primary problem of the project is to find selfie among the photos uploaded by twitter users. This problem can be treat as an object detection problem. However, the challenge part is that we can only refer to user’s avatar as how does the user look like. Therefore, it is impossible to train a netowork to classify a photo as sombody. Moreover, we want the model to be generalized so it can detect users that have never be seen before, therfore training the network using user-id as class is infeasible in this case.

Instead of detecting user directly, we introduce two separated steps to accomplish this task. The first one is to detect the object using RCNN-like networks or YOLO(You only look once). The second step is to compare the detected face to the face in user’s avatar, using one-shot algorithm like Siamese Net.


The dataset is a collection of loosely cropped human faces. The filename structure is in the form of “n{id}/{filename}.jpg”. The photos of the same person are placed under the same directory.


RCNN is on of the simplest way of object detection. It works as follows:

Advantage: Very simple

Disadvantage: Repeated CNN make it very slow, high computation waste.


Fast RCNN improve the efficiency by following changes.

Faster RCNN

Faster RCNN makes it even faster by proposing bounding boxes using pre-trained neural networks.


YOLO is even faster than faster-RCNN but it has trade accuracy off for speed to make real-time detection in video stream possible.

The main idea of YOLO is dividing the image into grids, say 7 by 7 grids. Each grid cell predicts B bounding boxes and corresponding confidence.

In prediction session, the model infers the class-wise confidence of each grid cell. This indicates how well the object belongs to a class and how well the object is fitted.

The limitation of YOLO is that when there are overlapping between objects or multiple objects appear in the same box, it can hardly operate well.

NN shit

Gradient descent

Gradient descent method are usually applied to perform iterative optimization of linear or non-linear (using jacobian method). The main idea is keeping taking stepin the direction of opposite to the gradient at current point, where we need to calculate the partial derivative of the less function with respect to each variable.

Back propagation

The main idea of this section comes from wikipedia

Back propagation is when we apply gradient descent method to a neural network. In this method, we start from the output layer and move backward, which is why it’s called back propagation. The variable part of the NN is weight, thus we calculate the partial derivative of less function with respect to the weights of incoming edges of current layer. That is,

However, this is quite hard to compute directly. Therefore we apply chain rule to the equation above:

, where $w_{i,j}$ is the weight of the edge connection neuron $i$ and $j$, $o_j$ is the output of neuron $j$ abd $net_j$ is the sum of neuron $j$, which is the sum of all incoming edges times the output of the neuron tge edge comes from.

From the last term of the equation above, we can find that

This is because only one term in the sum expression above depends on the weight $w_{i,j}$.

For the second term,

Where the term $\varphi(…)$ is the activation function of the neuron, this is why we prefer a differentiable activation function for many cases.

The first term partial derivative is given by:

But if the partial derivative of the loss function is not so obvious, we need to take total derivative:

Where $x_l$ variables are from the next layer (the layer close to th output layer), which are available from previous iterations.

Putting everything together, we have:

, where



Idea of this section is from here

Regularization is used to improve the ability generalize on unseen data. Empirical learning of classification is underdetermined because it attempts to infer a function of any x given only data samples. This means models can suffer from over-fitting.

We can apply regularization on loss functions to reduce the effect of over-fitting. the idea is similar to occam razor, which states that when multiple solutions are available to describe a model, simpler ones are more likely to be correct. Regularization tries to simplify the model by reducing coefficients. In order to do this, we just introduce a new term to represent the penalty in the loss function.

For example, we have a residual sum of square loss function

For Ridge Regression Regularization, we add a term like this

, which is also called L2 Norm.

In Lasso Regularization, the penalty term is given by:

Large $\lambda$ indicates high impact of penalty term, so if $\lambda \rightarrow \infty$, all the coefficients are temd to be 0.

In order to find a suitable $\lambda$ value, cross validation can be applied.


Optimizer is a tensorflow module where the optimization algorithms are implemented. It automatically calculate the gradient for each coefficient given data samples from loss and topology of forward connected graph by applying chain rule, which is introduced in Back propagation section.

Whenever optimization is required, the instance of subclass of Optimizer is created, each of them has its implementation of minimize() function.

When the optimizer optimize the model, following operations are taken in order:


CNN processes images by applying convolution, pooling, fully connect and normalization.


Applying convolution to an image with a certain amount of conv kernels. We may specift the kernel size, kernel number, stride, padding …


Pooling layer is applied to convolved image by taking a operation to a S by S grid, for example, max pooling take the max value of the cells.

Spatial Pyramid Pooling

Neural networks require a fixed size at output layer, which means the input image must be resized by cropping or wrapping. But such operations may affect some of the features in the image, therefore another method called Spatial Pyramid Pooling(SPP) can be applied after the convolution layer.

SSP layer maintains spatial information by pooling in local spatial bins. Bin sizes are proportional to image input size. The last pooling layer of CNN is replaced by a SPP layer. In each spatial bin, we apply pooling.

Fully connect

Fully connected layer is used to convert the feature map to the normal neural network layer(logits or propability distribution in classification problems).

Convolution implementation of fully connect

The old implementation of fully connected layer is a little bit slow, therefore we can use convolutional implementation. For example, we have a feature map of size $N\times N \times L$, convolution kernels with size $N\times N$ with input channel $L$ can be applied, then the number of kernels is equivalent to the size of fully connected layer.

Siamese Net


Model and estimator

Estimator is high-level API of tensorflow which helps simplifying building machine learning models. It encapsulates four different operations (available in tf.estimator.ModeKeys):

To use estimator, we have the define following modules:

Data streamer

In this project, the dataset is in the form of a large image collection, which is sometimes tens or hundreds Gigabytes in size. In this case, we can hardly fit the training dataset into memory, therefore we have to build a pipeline for data streaming. Pipeline does not only help streaming smaller batches of data, but also balance the load between CPU and GPU. The CPU can be used to pre-process and stream the data online, while GPU is in charge of running convolutions and weight updating.

As mentioned before, the dataset is stored in the file system of host, so all we need to do is to build a generator which keeps yielding the path of image files. A very simple version can be built as following:

def file_dir_streamer(image_dir):
    file_list = os.walk(image_dir)
    # root, dir, files
    file_dic = {}
    for r in file_list:
        file_dic[r[0]] = r[2]
    same = True
    keys = list(file_dic.keys())
    while True:
            key = random.choice(keys)
            value = random.choice(file_dic.get(key))
            path = os.path.join(key, value)
            key_ = random.choice(keys)
            value_ = random.choice(file_dic.get(key_))
            path_ = os.path.join(key_, value_)
            # print(path, path_)
            if key_ == key:
                label = 0.0
                label = 1.0
            yield path, path_, label

In the example above, the images are totally randomly selected (Which is not a proper way to train siamese net because this method is likely to give negative image pairs all almost all the time), but it does indicates how is the streamer supposed to work.

It firstly list all the possible file paths by doing a directory tree walk, then yeild three things: two image paths and whether they belongs to the same person.

However, this is not enough because we have to load the actual iamge data from the hard disk, so we have to map the dataset to a new one as following:

def _parse_function(path, path_, label):
  image_string = tf.read_file(path)
  image_string_ = tf.read_file(path_)
  image_decoded = tf.image.decode_jpeg(image_string)
  image_decoded_ = tf.image.decode_jpeg(image_string_)
  image_decoded = tf.image.rgb_to_grayscale(image_decoded)
  image_decoded_ = tf.image.rgb_to_grayscale(image_decoded_)

  image_resized = tf.image.per_image_standardization(tf.image.resize_images(image_decoded, [128, 128]))
  image_resized_ = tf.image.per_image_standardization(tf.image.resize_images(image_decoded_, [128, 128]))

  out = tf.concat([tf.reshape(image_resized, [16384,]), tf.reshape(image_resized_, [16384,])],axis=-1)

  return out, label

def input_func_gen():
    dset = tf.data.Dataset.from_generator(
        output_types=(tf.string, tf.string, tf.float32)

    dset = dset.map(map_func=_parse_function,num_parallel_calls=4)

    return dset

In the code above, the Dataset instance is firstly instanciated using from_generator method, followed by a mapping method. The parse function is used to map path to the actual image. The parse function accepts the parameters which are exact same as data yielded fro mthe streamer, and return the data of image. The data is the feature parameter of model function