Moisturizer or Sunscreen?

Moisturizers without SPF indication on the bottle offer zero sun protection. Pretty straightforward, I’d say. Same goes for foundations and makeup: if it doesn’t list an SPF factor on the bottle…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Multinomial Logistic Regression In a Nutshell

Logistic Regression on the Fashion MNIST Dataset

Logistic regression is one of the most frequently used models in classification problems. It can accurately predict the probability of a person having certain diseases, the probability of a person getting a ticket if he/she is speeding, or the probability of a sports team winning a game. Notice that these examples are binary, meaning that the logistic regression would have two outcomes: a “Yes” or a “No”. We call this a binary logistic regression.

There is another type of logistic regression that can predict more than two outcomes. This is multinomial (multiclass) logistic regression (MLR).

In this tutorial, we will not be using any external package functions to build our model. Instead, we will be building a multinomial logistic regression model from scratch, only using numpy and seemingly complex mathematics. Don’t fret, I will explain the math in the simplest form possible.

To learn about the multinomial logistic regression, let’s first remind ourselves the components of a binary logistic regression model:.

In binary logistic regression, we have:

In multinomial logistic regression, we have:

MLR shares steps with binary logistic regression, and the only difference is the function for each step. In binary logistic regression, Sigmoid function is used because it is a binary classification problem. In MLR, we use the Softmax function because the problem is no longer binary. This function can distribute probabilities for each output node. Now that our activation function is different in MLR, the loss function is also different because our loss function depends on the activation function.

Lastly, we use stochastic gradient descent rather than regular gradient descent because there are too many features in our data. Calculating gradient descent for each feature would take too much computations. We will come back to this topic later when we implement stochastic gradient descent in our code. Now that we know MLR in words, let’s see what MLR looks like visually.

Compare with neural networks, MLR is much easier to implement because it is simple. MLR only requires 1 layer of network, which means that there are fewer calculations than multi-layer neural networks. However, since MLR is a less complex model, the accuracy will not be as high as neural network models.

Figure 2: ANN (left), RNN (middle), and CNN (right)

Now that we understand multinomial logistic regression, let’s apply our knowledge. We’ll be building the MLR by following the MLR in the graph above (Figure 1).

Figure 3.2. Labels

Task:

We are using one-hot encoder to transform the original Y values into one-hot encoded Y values because our predicted values are probabilities. I will explain this later in the next step.

Figure 4. Given X- and Y-values and desired X- and Y-values

Task:

Looking at Figure 1, the next step would be computing the dot product between the vectors containing features and weights. Our original weight vector will be an array of 0s because we do not have any better values. Don’t worry, our weight will be constantly updating as the loss function is minimized. The dot product is called the score. This score is the deciding factor that predicts whether our image is a T-shirt/top, a dress or a coat.

In an array of probability values for each possible result, the argmax of the probability values is the Y value. For example, in an array of 10 probabilities, if the 5th element has the highest probability, then the image label is a coat, since the 5th element in the Y values is the coat (Figure 3.1).

Task:

The way to maximize the correctness is to minimize the loss in cross entropy function. To do that, we will apply gradient descent. Specifically, we will use stochastic ٖgradient descent. Stochastic gradient descent is no different than regular gradient descent. The term “stochastic” means random, meaning the gradient descent will be done by randomly selecting a sample of features. Then, instead of taking the gradient of all features, we are only calculating the gradient for the sample features. The purpose of stochastic gradient descent is to decrease the number of iterations and save time.

Figure 8. Gradient descent and stochastic gradient descent formulas

Task:

Figure 9. Losses after iterations

We can clearly see that the value of the loss function is decreasing substantially at first, and that’s because the predicted probabilities are nowhere close to the target value. That means our loss values are far from the minimum. As we are getting close to the minimum, the error is getting smaller because the predicted probabilities are getting more and more accurate.

After fitting the model on the training set, let’s see the result for our test set predictions.

It looks like our accuracy is about 85% (Figure 11), which is not so bad.

Figure 11. Accuracy Score

Add a comment

Related posts:

Nvidia Vs AMD The Domination Of Graphic Cards

Nvidia and AMD are two major players in the graphics processing unit (GPU) market, and both companies offer a range of graphics cards that are designed to deliver high-quality graphics and video…

Crying Out for a Sign

I drive down the road and see a homeless person begging for food I Cry out to God to give this man a break I see my own life and ask for a sign to show that I am even alive Then I realize that having…

SOLID Table Design Uses Glue and Hidden Screws

If you need a new desk or table design, this build is fairly easy — uses mostly glue for assembly — and very expandable. I built this for my son to use, but dimensions can be changed around as…