# Logistic regression

Logistic regression is a supervised machine learning model used for classification tasks. It works by learning an hyperplane defined by some coefficients (or weights) θ = [θ0,…,θd] such that it can split the data into two subsets according to their labels. These coefficients θ are composed by a weights vector w = [w1,…,wd] plus a bias term b. For simplicity, we will consider a binary classification task but the algorithm can also generalized to deal with multi-labeled data simply by calculating a set of weights for each class.

Logistic regression is a probabilistic machine learning model, that is, its predictions are probability values ranging from 0 to 1. Thus, for a binary classification task we have that for a point x of the data, the probability of its label being y is equal to: $P(y=1|x,w,b)=\frac{1}{1+e^{-(x\times w^T+b)}}$

The above function is also as known as sigmoid function: $\sigma(z)=\frac{1}{1+e^{-z}}$

In our case, we have that z = wx+b represents the dot product between the weights vector times the point x of the data plus a scalar bias term b. By plotting the above formula, we can see that its values range from 0 to 1 excluded.

In order to simplify the notation, we will consider that θ = [b,w1,…,wd] is the coefficients vector, but we can also call it weights vector by considering the term b as an additional weight. Consequently, the input features vector will result to be x = [1,x1,…xd] with a 1 added in correspondence of the bias term of the coefficients vector θ. Hence, applying this new notation it results that: $P(y=1|x,\theta)=\frac{1}{1+e^{-(x\times\theta)}}$

### Maximum Likelihood Estimation

Maximum likelihood estimation principle (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function. Given a dataset X (n x d) and its labels y (n x 1), what we are trying to learn is a set of parameters θ such that: $L(\theta)=P(y|X,\theta)=\prod_{i=1}^{n}P(y_i|x_i,\theta)$

L(θ) is the likelihood of the model given θ and according to the maximum likelihood estimation principle we have to pick the θ parameters that maximizes the likelihood of the data. In other words, we have to find the best coefficients such that the model would predict a probability value very close to 1 when predicting the primary class and a value very close to 0 for the secondary class.

So, the next step consists in estimating the probability P(y|x,θ). So far we know that when y = 1 we have that P(y|x,θ) = σ(xθ), equation (1), conversely, we will have that P(y|x,θ) = 1-σ(), equation (2), when y = 0 since also in this case we would like to maximize the probability value. Finally, the below formula puts the two equations together: $P(y|X,\theta)=\sigma(x\theta)^y\times(1-\sigma(x\theta))^{1-y}$

From this equation we can clearly notice that when y = 1 we obtain equation (1) and we get equation (2) when y = 0. Next, we can calculate the likelihood over the whole dataset in the following way: $L(\theta)=\prod_{i=1}^{n}P(y_i|X_i,\theta)=\prod_{i=1}^{n}\sigma(x_i\theta)^y_i\times(1-\sigma(x_i\theta))^{1-y_i}$

Next, we add the log to this expression either to avoid underflow when multiplying small numbers and mainly because logs can easily be converted into sums which are easier to be computed and derived respect to the products. Moreover, since log L(θ) is a monotonic function of L(θ), maximizing it is the same as maximizing L(θ) itself. $logL(\theta)=\sum_{i=1}^{n}y_i log\;\sigma(x_i\theta)+(1-y_i)log(1-\sigma(x_i\theta))$

The resulting formula is also known as binary cross-entropy loss and represents the likelihood for the L(θ) function that we want to maximize. Unfortunately, this function doesn’t present any closed form solution but the good news is that L(θ) is a concave function of θ, thus, we can proceed to calculate its maximum using gradient ascent technique. Usually, we use gradient descent when we want to minimize an objective function and gradient ascent when we want to maximize it.

It is well known that a partial derivative is the derivative of a function along a single dimension, whereby the other dimensions are treated as constants. The gradient of a function f(θ) when θ is d-dimensional, is given by a vector of partial derivatives for each dimension: $\nabla f(\theta)=[\frac{\partial f}{\partial \theta_1},...,\frac{\partial f}{\partial \theta_d}]$

The gradient expresses the direction of the fastest increase of the function f and its rate of increase in that direction. In this way we know how much we have to vary each of the θ parameters in order to maximize the value of L(θ), that is, maximize the total likelihood. Thus, by applying this definition to the logistic regression likelihood function we obtain the rule to update θ at each timestep t: $\theta_{t+1}=\theta_{t}+\eta\nabla log L(\theta_{t})$

Where η is the learning rate which is a hyper-parameter that controls to what extent we are going to vary the weights of our model respect to the gradient. Having a low value of η means moving toward the maximum of the function with smaller steps, thus, requiring more iterations to convergence toward the optimal solution and presenting the risk to get stuck into a false local maximum. On the other hand, a high learning rate will converge faster but has the risk of jumping over the optimal solution becoming unstable and to diverge. Hence, a good strategy would be to begin with a high learning rate for the initial iterations in order to speed up the convergence and reducing it over the time when more precision is required to find the optimal solution. This procedure is also known as annealing of the learning rate. High learning rate on the left and small learning rate on the right figure

After calculating the gradient of log L(θ) respect to the parameters θ, we obtain the following equation: $\nabla log L(\theta)=\sum_{i=1}^{n}x_i(y_i-\sigma(x_i\theta))$

The steps needed to derive above equation have been omitted here but can be still checked on this link. Finally, replacing it back in the weights update formula we get that: $\theta_{t+1}=\theta_{t}+\eta\sum_{i=1}^{n}x_i(y_i-\sigma(x_i\theta_{t}))$

Therefore, to adjust the model’s parameters, these iterations are repeated until the change of the likelihood between ti and ti+1 doesn’t fall below a fixed threshold.

Gradient ascent is a simple and common method to optimize the weights of a model but, for the case of the logistic regression, a faster convergence can be obtained by using another more efficient method called Iterative Reweighted Least Squares.

### Training with IRLS

Iterative Reweighted Least Squares is a method that exploits the Newton method to iteratively update the weights of a logistic regression model. In numerical analysis, the Newton method is an iterative method used to find an approximation of the root f(x) = 0. $x_{t+1}=x_t-\frac{f(x_t)}{f'(x_t)}$

Choosing an initial x0 value, successive iterations will converge toward the root of f(x).

To maximize the likelihood L(θ) we have to find θ such that: $\nabla L(\theta)=0$

Here we can notice an analogy with the Newton method since what we want to find is still the root of a function L(θ). Thus, in the Newton formula we can replace xt+1 and xt with θt+1 and θ respectively, f(xt) replaced with ∇L(θt) and f'(xt) with ∇2L(θt) = H which is knows as the Hessian matrix. The Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, therefore, a model having d parameters will result in a (d x 1) gradient vector and (d x d) Hessian matrix.

Finally, substituting the new terms in the Newton method formula we obtain that: $\theta_{t+1}=\theta_t-\frac{\nabla L(\theta_t)}{H}=\theta_t-H^-1\nabla L(\theta_t)$ (3)

Previously, we had already calculated ∇log L(θt), so we can set the gradient of the new likelihood function as: $\nabla L(\theta)=\nabla log L(\theta)=\sum_{i=1}^{n}x_i(y_i-u_i)=X(y-u)$ (4)

Note that for simplicity we have defined ui = σ(xiθt) and in the last part of the above equation, X is the matrix representing the input data (n x d), whereby y (n x 1) and u (n x 1) is a vector.

Using (4), we can compute the Hessian matrix which results to be: $H=\nabla^2 L(\theta)=-\sum_{i=1}^{n}u_i(1-u_i)x_i x_i^T=-XRX^T$ (5)

Where Rii = ui(1 – ui) is a (d x d) square matrix. The last step consists in replacing (4) and (5) in the Newton method formula (3) to get the final formula for the IRLS algorithm: $\theta_{t+1}=\theta_t-\frac{\nabla L(\theta_t)}{H}=\theta_t-H^-1\nabla L(\theta_t)=\theta_t-(XRX^T)^{-1}X(u-y)$

IRLS takes O(n + d3) per iteration, where n is the number of training instances and d is the feature dimension, but it converges in fewer iterations than gradient ascent method.

### IRLS with L2 regularization

One of the drawbacks of the logistic regression is that its learning procedure might cause its weights w to become very large as the training goes on. This problem would lead to overfitting since the predicted probabilities yi=σ(xiθt) would end up being very close to 0 or to 1 most of the times. One way to prevent the weights from growing too much is penalizing them by applying L2 regularization, also known as rigde regularizarion. This result is achieved by adding the L2 norm of the weights ||θ||2 to the likelihood function: $\nabla L(\theta)=L(\theta)+\frac{\lambda}{2}||\theta||^2$ (6)

Where ||θ||2 is the L2 norm of θ. Next we are going to calculate the new gradient and the new Hessian with L2 regularization applied to the new objective function: $\nabla L(\theta)=\sum_{i=1}^{n}x_i(y_i-u_i)+\lambda\times\theta=X(y-u)+\lambda\times\theta$ (7) $H=\nabla^2 L(\theta)=-\sum_{i=1}^{n}u_i(1-u_i)x_i x_i^T+I\lambda=-XRX^T+I\lambda$ (8)

Where I is the identity matrix (d x d).Putting (7) and (8) in (3), we get that the formula to update the weights with IRLS algorithm and L2 regularization is: $\theta_{t+1}=\theta_t-(XRX^T-\lambda)^{-1}(X(u-y)-\lambda\times\theta)$