Regularization Techniques Machine Learning

Salmen Zouari
9 min readJan 21, 2021

Have you ever encountered a situation where your machine learning model models the training data exceptionally well but fails to perform well on the testing data i.e. was not able to predict test data? This situation can be dealt with regularization in Machine learning.

Overfitting happens when a model learns the very specific pattern and noise from the training data to such an extent that it negatively impacts our model’s ability to generalize from our training data to new (“unseen”) data. By noise, we mean the irrelevant information or randomness in a dataset.

Preventing overfitting is very necessary to improve the performance of our machine learning model.

Following pointers will be covered in this article and eventually help us solve this problem,

  • What is Regularization
  • How Does Regularization Work
  • Regularization Techniques

What is Regularization?

In general, regularization means to make things regular or acceptable. This is exactly why we use it for applied machine learning. In the context of machine learning, regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

How Does Regularization Work?

The basic idea is to penalize the complex models i.e. adding a complexity term that would give a bigger loss for complex models. To understand it, let’s consider a simple relation for linear regression. Mathematically, it is stated as below:
Y≈ W_0+ W_1 X_1+ W_2 X_(2 )+⋯+W_P X_P

Where Y is the learned relation i.e. the value to be predicted.
X_1,X_(2 ),〖…,X〗_P , are the features deciding the value of Y.
W_1,W_(2 ),〖…,W〗_P , are the weights attached to the features X_1,X_(2 ),〖…,X〗_P respectively.
W_0 represents the bias.

Now, in order to fit a model that accurately predicts the value of Y, we require a loss function and optimized parameters i.e. bias and weights.

The loss function generally used for linear regression is called the residual sum of squares (RSS). According to the above stated linear regression relation, it can be given as:
RSS= ∑_(j=1)^m (Y_i-W_0-∑_(i=1)^n W_i X_ji )²

We can also call RSS as the linear regression objective without regularization.

Now, the model will learn by the means of this loss function. Based on our training data, it will adjust the weights (coefficients). If our dataset is noisy, it will face overfitting problems and estimated coefficients won’t generalize on the unseen data.

This is where regularization comes into action. It regularizes these learned estimates towards zero by penalizing the magnitude of coefficients.

But how it assigns a penalty to the coefficients, let’s explore.

Regularization Techniques

L1 Regularization

A regression model that uses L1 regularization technique is called Lasso Regression.

Mathematical formula for L1 Regularization.

Let’s define a model to see how L1 Regularization works. For simplicity, We define a simple linear regression model Y with one independent variable.

In this model, W represent Weight, b represent Bias.

W=w1,w2…wnX=x1,x2…xn

and the predicted result is Yˆ

Yˆ=w1x1+w2x2+…wnxn+b

Following formula calculates the error without Regularization function

Loss=Error(Y,Yˆ)

Following formula calculates the error With L1 Regularization function

Loss=Error(YYˆ)+λ∑1n|wi|

Note

Here, If the value of lambda is Zero then above Loss function becomes Ordinary Least Square whereas very large value makes the coefficients (weights) zero hence it under-fits.

One thing to note is that |w|

is differentiable when w!=0 as shown below,

d|w|dw={1−1w>0w<0

To understand the Note above,

Let’s substitute the formula in finding new weights using Gradient Descent optimizer.

wnew=wηL1∂w

When we apply the L1 in above formula it becomes,

wnew=wη.(Error(Y,Yˆ)+λd|w|dw)={wη.(Error(Y,Yˆ)+λ)wη.(Error(Y,Yˆ)−λ)w>0w<0

From the above formula,

  • If w is positive, the regularization parameter λ

> 0 will push w to be less positive, by subtracting λ

  • from w.
  • If w is negative, the regularization parameter λ

< 0 will push w to be less negative, by adding λ

  • to w. hence this has the effect of pushing w towards 0.

Simple python implementation

def update_weights_with_l1_regularization(features, targets, weights, lr,lambda):
'''
Features:(200, 3)
Targets: (200, 1)
Weights:(3, 1)
'''
predictions = predict(features, weights)
#Extract our features
x1 = features[:,0]
x2 = features[:,1]
x3 = features[:,2]
# Use matrix cross product (*) to simultaneously
# calculate the derivative for each weight
d_w1 = -x1*(targets - predictions)
d_w2 = -x2*(targets - predictions)
d_w3 = -x3*(targets - predictions)
# Multiply the mean derivative by the learning rate
# and subtract from our weights (remember gradient points in direction of steepest ASCENT)
weights[0][0] = (weights[0][0] - lr * np.mean(d_w1) - lambda) if weights[0][0] > 0 else (weights[0][0] - lr * np.mean(d_w1) + lambda)
weights[1][0] = (weights[1][0] - lr * np.mean(d_w2) - lambda) if weights[1][0] > 0 else (weights[1][0] - lr * np.mean(d_w2) + lambda)
weights[2][0] = (weights[2][0] - lr * np.mean(d_w3) - lambda) if weights[2][0] > 0 else (weights[2][0] - lr * np.mean(d_w3) + lambda)
return weights

Use Case

L1 Regularization (or varient of this concept) is a model of choice when the number of features are high, Since it provides sparse solutions. We can get computational advantage as the features with zero coefficients can simply be ignored.

L2 Regularization

A regression model that uses L2 regularization technique is called Ridge Regression. Main difference between L1 and L2 regularization is, L2 regularization uses “squared magnitude” of coefficient as penalty term to the loss function.

Mathematical formula for L2 Regularization.

Let’s define a model to see how L2 Regularization works. For simplicity, We define a simple linear regression model Y with one independent variable.

In this model, W represent Weight, b represent Bias.

W=w1,w2…wnX=x1,x2…xn

and the predicted result is Yˆ

Yˆ=w1x1+w2x2+…wnxn+b

Following formula calculates the error without Regularization function

Loss=Error(Y,Yˆ)

Following formula calculates the error With L2 Regularization function

Loss=Error(YYˆ)+λ∑1nw2i

Note

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it leads to under-fitting.

To understand the Note above,

Let’s substitute the formula in finding new weights using Gradient Descent optimizer.

wnew=wηL2∂w

When we apply the L2 in above formula it becomes,

wnew=wη.(Error(Y,Yˆ)+λL2∂w)=wη.(Error(Y,Yˆ)+2λw)

Simple python implementation

def update_weights_with_l2_regularization(features, targets, weights, lr,lambda):
'''
Features:(200, 3)
Targets: (200, 1)
Weights:(3, 1)
'''
predictions = predict(features, weights)
#Extract our features
x1 = features[:,0]
x2 = features[:,1]
x3 = features[:,2]
# Use matrix cross product (*) to simultaneously
# calculate the derivative for each weight
d_w1 = -x1*(targets - predictions)
d_w2 = -x2*(targets - predictions)
d_w3 = -x3*(targets - predictions)
# Multiply the mean derivative by the learning rate
# and subtract from our weights (remember gradient points in direction of steepest ASCENT)
weights[0][0] = weights[0][0] - lr * np.mean(d_w1) - 2 * lambda * weights[0][0]
weights[1][0] = weights[1][0] - lr * np.mean(d_w2) - 2 * lambda * weights[1][0]
weights[2][0] = weights[2][0] - lr * np.mean(d_w3) - 2 * lambda * weights[2][0]
return weights

Use Case

L2 regularization can address the multicollinearity problem by constraining the coefficient norm and keeping all the variables. L2 regression can be used to estimate the predictor importance and penalize predictors that are not important. One issue with co-linearity is that the variance of the parameter estimate is huge. In cases where the number of features are greater than the number of observations, the matrix used in the OLS may not be invertible but Ridge Regression enables this matrix to be inverted.

Dropout

Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

Simply put, It is the process of ignoring some of the neurons in particular forward or backward pass.

Dropout can be easily implemented by randomly selecting nodes to be dropped-out with a given probability (e.g. .1%) each weight update cycle.

Most importantly Dropout is only used during the training of a model and is not used when evaluating the model.

import numpy as np
A = np.arange(20).reshape((5,4))
print("Given input: ")
print(A)
def dropout(X, drop_probability):
keep_probability = 1 - drop_probability
mask = np.random.uniform(0, 1.0, X.shape) < keep_probability
if keep_probability > 0.0:
scale = (1/keep_probability)
else:
scale = 0.0
return mask * X * scale
print("\n After Dropout: ")
print(dropout(A,0.5))

output from above code

Given input:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
After Dropout:
[[ 0. 2. 0. 0.]
[ 8. 0. 0. 14.]
[16. 18. 0. 22.]
[24. 0. 0. 0.]
[32. 34. 36. 0.]]

Data Augmentation

Having more data (dataset / samples) is a best way to get better consistent estimators (ML model). In the real world getting a large volume of useful data for training a model is cumbersome and labelling is an extremely tedious task.

Either labelling requires more manual annotation, example — For creating a better image classifier we can use Mturk and involve more man power to generate dataset or doing survey in social media and asking people to participate and generate dataset. Above process can yield good dataset however those are difficult to carry and expensive. Having small dataset will lead to the well know Over fitting problem.

Data Augmentation is one of the interesting regularization technique to resolve the above problem. The concept is very simple, this technique generates new training data from given original dataset. Dataset Augmentation provides a cheap and easy way to increase the amount of your training data.

This technique can be used for both NLP and CV.

In CV we can use the techniques like Jitter, PCA and Flipping. Similarly in NLP we can use the techniques like Synonym Replacement,Random Insertion, Random Deletion and Word Embeddings.

It is worth knowing that Keras’ provided ImageDataGenerator for generating Data Augmentation.

Sample code for random deletion

def random_deletion(words, p):
"""
Randomly delete words from the sentence with probability p
"""
#obviously, if there's only one word, don't delete it
if len(words) == 1:
return words
#randomly delete words with probability p
new_words = []
for word in words:
r = random.uniform(0, 1)
if r > p:
new_words.append(word)
#if you end up deleting all words, just return a random word
if len(new_words) == 0:
rand_int = random.randint(0, len(words)-1)
return [words[rand_int]]
return new_words

Furthermore, when comparing two machine learning algorithms train both with either augmented or non-augmented dataset. Otherwise, no subjective decision can be made on which algorithm performed better

Early Stopping

One of the biggest problem in training neural network is how long to train the model.

Training too little will lead to underfit in train and test sets. Traning too much will have the overfit in training set and poor result in test sets.

Here the challenge is to train the network long enough that it is capable of learning the mapping from inputs to outputs, but not training the model so long that it overfits the training data.

One possible solution to solve this problem is to treat the number of training epochs as a hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best accuracy on the train or a holdout test dataset, But the problem is it requires multiple models to be trained and discarded.

Clearly, after ‘t’ epochs, the model starts overfitting. This is clear by the increasing gap between the train and the validation error in the above plot.

One alternative technique to prevent overfitting is use validation error to decide when to stop. This approach is called Early Stopping.

While building the model, it is evaluated on the holdout validation dataset after each epoch. If the accuracy of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped. This process is called Early stopping.

Python implementation for Early stopping,

def early_stopping(theta0, (x_train, y_train), (x_valid, y_valid),
n = 1, p = 100):
""" The early stopping meta-algorithm for determining the best amount of time to train.
REF: Algorithm 7.1 in deep learning book.
Parameters:
n: int; Number of steps between evaluations.
p: int; "patience", the number of evaluations to observe worsening validataion set.
theta0: Network; initial network.
x_train: iterable; The training input set.
y_train: iterable; The training output set.
x_valid: iterable; The validation input set.
y_valid: iterable; The validation output set.
Returns:
theta_prime: Network object; The output network.
i_prime: int; The number of iterations for the output network.
v: float; The validation error for the output network.
"""
# Initialize variables
theta = theta0.clone() # The active network
i = 0 # The number of training steps taken
j = 0 # The number of evaluations steps since last update of theta_prime
v = np.inf # The best evaluation error observed thusfar
theta_prime = theta.clone() # The best network found thusfar
i_prime = i # The index of theta_prime
while j < p:
# Update theta by running the training algorithm for n steps
for _ in range(n):
theta.train(x_train, y_train)
# Update Values
i += n
v_new = theta.error(x_valid, y_valid)
# If better validation error, then reset waiting time, save the network, and update the best error value
if v_new < v:
j = 0
theta_prime = theta.clone()
i_prime = i
v = v_new
# Otherwise, update the waiting time
else:
j += 1
return theta_prime, i_prime, v

--

--