Linear Regression(Machine Learning Technique) for beginners

 

Linear Regression: A Simple Introduction

Linear regression is a technique that helps us predict a continuous value(Numeric Values) like house prices from a set of input features (like the number of rooms, square footage, etc.). Here’s how it works:

  1. Supervised Learning: Linear regression is a supervised learning method, meaning it uses labeled data (input-output pairs) to make predictions.
  2. Predicting Continuous Values: Linear regression predicts continuous values based on input features. For example, if we want to predict the price of a house, the features could include the house's size, number of rooms, etc.

How the Formula Works

Let's take an example of predicting house prices. Suppose we have the following features for different houses:

Size (sq ft) (x1)

Number of Rooms (x2)

Age of House (x3)

Price (y)

2000

3

10

300000

1500

2

20

250000

1800

4

15

275000

2200

4

5

350000

1600

3

15

280000

 

The Formula for Linear Regression

Linear regression uses the following formula to predict the price of a house:

y (House Price) = w₀ + w₁(x₁) + w₂(x₂) + w₃(x₃)

Where:

  • y is the predicted house price (target).
  • x₁, x₂, x₃ are the features (e.g., size, number of rooms, age of house).
  • w₀, w₁, w₂, w₃ are the weights (or coefficients) that the model learns from the data.

Example:

Let’s say the formula for our linear regression model looks like this:

y = 50,000 + 100x₁ + 10,000x₂ - 2,000x₃

Where:

  • x₁ = Size of the house in sq ft (e.g., 2000, 1500, etc.)
  • x₂ = Number of rooms
  • x₃ = Age of the house

Using the Formula

Now, let’s predict the price of a new house with the following features:

  • Size = 2100 sq ft
  • Rooms = 4
  • Age = 8 years

Putting these values into the formula:

y = 50,000 + 100(2100) + 10,000(4) - 2,000(8)

y = 50,000 + 210,000 + 40,000 - 16,000

y = 284,000

So, the predicted price for this new house is $284,000.

Learning the Weights (w₀, w₁, w₂, w₃)

Now, you might be wondering: How are the weights (w₀, w₁, w₂, w₃) calculated? Well, that’s the main thing that makes linear regression so powerful and effective. The goal of linear regression is to find the best weights for the features (x₁, x₂, etc.) so that the predictions are as accurate as possible.

The model adjusts these weights to minimize the error between the predicted and actual prices. This process is known as training the model, and it’s what allows the model to improve over time.

The technique used to find these best weights is called Gradient Descent, and it works by adjusting the weights step by step to reduce the error. But don’t worry—we'll dive deeper into how this works in the next section!

Calculating the Weights (w₀, w₁, w₂, …)

  1. Start with the formula: The general linear regression equation is:

y=w0 + w1x1+ w2x2 + w3x3+...+wkxk

Here:

    • y is the predicted output (e.g., house price).
    • x₁, x₂, ... xₖ are the features (e.g., size, number of rooms, age of the house).
    • w₀, w₁, w₂, ... wₖ are the weights (or coefficients) that the model will learn.
  1. Randomly initialize the weights: Before starting, we assign random values to the weights. For example:
    • w₀ = 0.5
    • w₁ = 0.1
    • w₂ = -0.2 (These values are just random guesses to start with.)
  2. Calculate the predicted output: Using the randomly initialized weights, calculate the predicted output (y) for each data point. For example, for the first data point:

                                  ŷ = w₀ + w₁x₁ + w₂x₂ + w₃x₃

where x₁, x₂, x₃ are the features (like house size, number of rooms, and age).

  1. Calculate the error: The error is the difference between the predicted value and the actual value (the true house price). For each data point, the error is:

                                  Error = ŷ - y(actual)

This tells you how far off the model’s prediction is from the actual value.

 

Explaining the Cost Function in Linear Regression

After calculating the predicted values (denoted as ) for each data point, we want to know how far off these predictions are from the actual values (denoted as y). This difference, or error, gives us an indication of how well our model is performing. To measure this error across all data points, we use a Cost Function.

The cost function for linear regression is often referred to as the Mean Squared Error (MSE), but we’ll start with a simple version called the Sum of Squared Errors (SSE), which is later averaged.

 

 

Cost Function Formula:



Where:

  • J(w₀, w₁, ..., wₖ) is the cost (or error) function.
  • ŷᵢ is the predicted value for the i-th data point.
  • yᵢ is the actual value for the i-th data point.
  • m is the total number of data points.
  • w₀, w₁, ..., wₖ are the weights (parameters) we are trying to optimize.

Why is the error squared?

The reason we square the errors is simple but important:

  1. Ensure Positive Errors:
    If we simply summed the errors, negative and positive errors could cancel each other out. This would make the total error seem smaller than it actually is, making it harder to gauge how well the model is performing.

For example, if you have two data points:
Error 1: -2 (prediction is too high)
Error 2: +2 (prediction is too low)
The sum of errors would be -2 + 2 = 0, which is misleading. The total error should reflect the magnitude of the error, so squaring makes all errors positive, and larger errors contribute more.

  1. Prevent Negative Values:
    Squaring makes sure that both positive and negative errors contribute positively to the total error. This way, we avoid cases where a positive and negative error might cancel each other out.
  2. Simplify Optimization:
    The reason we multiply the errors by themselves (i.e., square them) is that it makes it easier to perform optimization mathematically. Squared terms make the cost function smoother, and we can apply mathematical techniques like Gradient Descent to minimize it efficiently.

 

Why Multiply by (1/2)?

You might be wondering why we multiply the cost function by 1/2. The answer is that it simplifies the math later when we compute the derivative (slope) of the cost function with respect to the weights. This will come in handy during Gradient Descent optimization, which we’ll discuss in the next section. The 1/2 cancels out the factor of 2 that appears when we take the derivative.

 

In summary, squaring the errors allows us to:

  1. Make all errors positive.
  2. Prevent cancellation of errors.
  3. Make the error function mathematically easier to minimize.

 

Gradient Descent for Optimizing Weights

After calculating the cost function J(w0,w1,...,wk), the goal is to minimize this error function to find the best weights. To do this, we use Gradient Descent.

Gradient Descent is an optimization algorithm used to minimize the cost function by updating the weights in the opposite direction of the gradient (slope) of the cost function. The size of the update is determined by the learning rate, α\alphaα.

Gradient Descent Formula:

The weight update rule is as follows:



Where:

  • W_new is the weight being updated.
  • α is the learning rate.
  • ∂J(w)/∂W is the derivative (slope) of the cost function with respect to the weight w​.

 

 

 

 

 

 

 

 

 

 

 

Visualizing Gradient Descent



  • Case 1: w on the right side of the graph (high cost):
    • If (∂J(w) / ∂wi) > 0, the gradient is positive, meaning the current weight is too high, and we need to decrease it to move towards the minimum. Therefore, the weight is updated as:

                                                    wi = wi - α * (∂J(w) / ∂wi)

    • On the right side, the slope is positive, so the weight wi will decrease, as the negative sign in the formula makes the weight move towards the minimum.
  • Case 2: w on the left side of the graph (low cost):
    • If (∂J(w) / ∂wi) < 0, the gradient is negative, meaning the current weight is too low, and we need to increase it to move towards the minimum. Therefore, the weight is updated as:

                                               wi := wi - α * (∂J(w) / ∂wi)

    • On the left side, the slope is negative, and since the negative sign in the formula is already there, the weight will increase, which is what we want to move towards the minimum.



Final Step:

As we continue updating the weights, the slope of the cost function gets smaller and approaches zero. When the gradient is zero, we have reached the minimum of the cost function, and further updates are no longer needed. This is the point where the weight updates stop, and we have found the optimal weights.

 

Note on Learning Rate (α):

The learning rate (α) is a crucial parameter in gradient descent. It controls how much we adjust the weights in each iteration based on the gradient. A carefully chosen learning rate helps the model converge to the optimal solution. However, if the learning rate is not chosen wisely, it can cause issues:

  1. If α is too small (slow learning):
    • The gradient updates will be very small, and the model might take a long time to converge. It could get stuck in local minima and never reach the global minimum.
    • This is often seen when the updates are so small that the weights do not change significantly between iterations, leading to slow progress.
  2. If α is too large (overshooting):
    • A large learning rate can cause the updates to overshoot the optimal values, jumping back and forth over the minimum. Instead of converging to the global minimum, it can diverge or get trapped in a suboptimal solution, oscillating around a point.
    • In extreme cases, the learning rate can even cause the cost to increase rather than decrease, as seen when the model moves past the minima.

To avoid these issues, you need to tune the learning rate appropriately. A good practice is to start with a moderate learning rate and adjust it based on the behavior of the cost function. If the cost function decreases too slowly or fluctuates, consider reducing the learning rate. If the cost doesn't decrease or oscillates too much, it may be necessary to reduce the learning rate to avoid overshooting.



Underfitting vs Overfitting

In machine learning, underfitting and overfitting refer to how well your model generalizes to new data, and they are key concepts when evaluating model performance.

1. Underfitting

  • Definition: Underfitting happens when the model is too simple to capture the underlying patterns of the data. It doesn't learn enough from the training data, resulting in poor performance on both the training set and test set.
  • Cause: This can happen when:
    • The model is too simple (e.g., using a linear model when the data is non-linear).
    • The model has too few features or is not complex enough to capture the relationships in the data.
    • The model is not trained enough (insufficient epochs in training).
  • Visual Example: Imagine trying to fit a straight line to a set of data points that follow a curved pattern. The straight line will fail to capture the curve, leading to high bias.
  • Impact:
    • High bias (the model’s assumptions are too strong and incorrect).
    • Low variance (the model’s predictions don’t change much with different training data).

2. Overfitting

  • Definition: Overfitting happens when the model learns not only the underlying patterns but also the noise and random fluctuations in the training data. This leads to excellent performance on the training data but poor generalization to new data (i.e., test data).
  • Cause: This can happen when:
    • The model is too complex (e.g., using a very deep neural network for a simple problem).
    • The model is trained for too many iterations, capturing too much noise.
    • There are too many features (or irrelevant features) in the model.
  • Visual Example: Imagine trying to fit a very wiggly curve to data that is actually linear. The model will adapt to every small fluctuation in the data, resulting in an overly complex curve that doesn't generalize well to new data.
  • Impact:
    • Low bias (the model fits the training data well).
    • High variance (the model is highly sensitive to small changes in the training data).

 

Bias and Variance Explained

Bias and Variance are two sources of errors in machine learning models. Understanding the trade-off between them is crucial for building good models.

1. Bias

  • Definition: Bias refers to the error introduced by approximating a real-world problem (which may be complex) by a simplified model.
  • In Simple Terms: Bias is the model’s tendency to consistently make certain types of mistakes because it oversimplifies the problem.
    • High Bias: If a model has high bias, it means the model is too simple and makes strong assumptions. This leads to underfitting and poor performance.
    • Low Bias: If a model has low bias, it means the model is flexible enough to learn the underlying patterns in the data, leading to a better fit.

 

 

2. Variance

  • Definition: Variance refers to the model’s sensitivity to small changes in the training data. If a model has high variance, it means it can adapt too much to the training data, capturing noise along with the signal.
  • In Simple Terms: Variance is how much the model's predictions would change if we used a different training dataset.
    • High Variance: High variance means the model is overfitting and will perform poorly on new, unseen data because it’s too sensitive to the training data.
    • Low Variance: Low variance means the model's predictions are stable and consistent across different training datasets.

The Bias-Variance Trade-Off

  • Ideal Model: The goal is to find a model with low bias (accurately captures the data patterns) and low variance (performs well on new data).
  • Too Simple Model: High bias, low variance (underfitting).
  • Too Complex Model: Low bias, high variance (overfitting).

There’s a natural trade-off between bias and variance:

  • Increasing Model Complexity (e.g., adding more features or using a more complex model) will decrease bias but increase variance.
  • Decreasing Model Complexity (e.g., reducing features or simplifying the model) will decrease variance but increase bias.

Visuals for Underfitting, Overfitting, Bias, and Variance

  • Underfitting: A straight line fitting a curved data pattern. High bias, low variance.
  • Overfitting: A very complex curve that fits the data too closely. Low bias, high variance.



Final Summary:

  • Underfitting: Model is too simple (high bias).
  • Overfitting: Model is too complex (high variance).
  • Bias: Error due to overly simplistic assumptions.
  • Variance: Error due to the model being too sensitive to fluctuations in the training data.

 

Comments

Popular posts from this blog

RAG Explained Simply: How AI Finds and Generates Better Answers

How LLMs Understand Text — From Tokens to Meaning (Beginner-Friendly)

Running DeepSeek on Your Local Machine: Complete Setup Tutorial