Linear Regression(Machine Learning Technique) for beginners
Linear
Regression: A Simple Introduction
Linear
regression is a technique that helps us predict a continuous value(Numeric Values) like house
prices from a set of input features (like the number of rooms, square footage,
etc.). Here’s how it works:
- Supervised Learning: Linear regression is a
supervised learning method, meaning it uses labeled data (input-output
pairs) to make predictions.
- Predicting Continuous Values: Linear regression predicts
continuous values based on input features. For example, if we want to
predict the price of a house, the features could include the house's size,
number of rooms, etc.
How the
Formula Works
Let's take
an example of predicting house prices. Suppose we have the following features
for different houses:
|
Size (sq ft) (x1) |
Number of Rooms (x2) |
Age of House (x3) |
Price (y) |
|
2000 |
3 |
10 |
300000 |
|
1500 |
2 |
20 |
250000 |
|
1800 |
4 |
15 |
275000 |
|
2200 |
4 |
5 |
350000 |
|
1600 |
3 |
15 |
280000 |
The Formula
for Linear Regression
Linear
regression uses the following formula to predict the price of a house:
y (House
Price) = w₀ + w₁(x₁) + w₂(x₂) + w₃(x₃)
Where:
- y is the predicted house price
(target).
- x₁, x₂, x₃ are the features (e.g., size,
number of rooms, age of house).
- w₀, w₁, w₂, w₃ are the weights (or
coefficients) that the model learns from the data.
Example:
Let’s say
the formula for our linear regression model looks like this:
y =
50,000 + 100x₁ + 10,000x₂ - 2,000x₃
Where:
- x₁ = Size of the house in sq ft
(e.g., 2000, 1500, etc.)
- x₂ = Number of rooms
- x₃ = Age of the house
Using the
Formula
Now, let’s
predict the price of a new house with the following features:
- Size = 2100 sq ft
- Rooms = 4
- Age = 8 years
Putting
these values into the formula:
y =
50,000 + 100(2100) + 10,000(4) - 2,000(8)
y =
50,000 + 210,000 + 40,000 - 16,000
y =
284,000
So, the
predicted price for this new house is $284,000.
Learning
the Weights (w₀, w₁, w₂, w₃)
Now, you
might be wondering: How are the weights (w₀, w₁, w₂, w₃) calculated? Well,
that’s the main thing that makes linear regression so powerful and
effective. The goal of linear regression is to find the best weights for the
features (x₁, x₂, etc.) so that the predictions are as accurate as possible.
The model
adjusts these weights to minimize the error between the predicted and actual
prices. This process is known as training the model, and it’s what
allows the model to improve over time.
The
technique used to find these best weights is called Gradient Descent,
and it works by adjusting the weights step by step to reduce the error. But
don’t worry—we'll dive deeper into how this works in the next section!
Calculating
the Weights (w₀, w₁, w₂, …)
- Start with the formula: The general linear regression
equation is:
y=w0 + w1x1+ w2x2 + w3x3+...+wkxk
Here:
- y is the predicted output (e.g.,
house price).
- x₁, x₂, ... xₖ are the features (e.g., size,
number of rooms, age of the house).
- w₀, w₁, w₂, ... wₖ are the weights (or
coefficients) that the model will learn.
- Randomly initialize the weights: Before starting, we assign
random values to the weights. For example:
- w₀ = 0.5
- w₁ = 0.1
- w₂ = -0.2 (These values are
just random guesses to start with.)
- Calculate the predicted output: Using the randomly initialized
weights, calculate the predicted output (y) for each data point. For
example, for the first data point:
ŷ = w₀ +
w₁x₁ + w₂x₂ + w₃x₃
where x₁,
x₂, x₃ are the features (like house size, number of rooms, and age).
- Calculate the error: The error is the difference
between the predicted value and the actual value (the true house price).
For each data point, the error is:
Error = ŷ -
y(actual)
This tells
you how far off the model’s prediction is from the actual value.
Explaining
the Cost Function in Linear Regression
After
calculating the predicted values (denoted as ŷ) for each data point, we
want to know how far off these predictions are from the actual values (denoted
as y). This difference, or error, gives us an indication of how well our
model is performing. To measure this error across all data points, we use a Cost
Function.
The cost
function for linear regression is often referred to as the Mean Squared
Error (MSE), but we’ll start with a simple version called the Sum of
Squared Errors (SSE), which is later averaged.
Cost
Function Formula:
Where:
- J(w₀, w₁, ..., wₖ) is the cost (or error)
function.
- ŷᵢ is the predicted value for the i-th
data point.
- yᵢ is the actual value for the i-th
data point.
- m is the total number of data
points.
- w₀, w₁, ..., wₖ are the weights (parameters) we
are trying to optimize.
Why is
the error squared?
The reason
we square the errors is simple but important:
- Ensure Positive Errors:
If we simply summed the errors, negative and positive errors could cancel each other out. This would make the total error seem smaller than it actually is, making it harder to gauge how well the model is performing.
For example,
if you have two data points:
Error 1: -2 (prediction is too high)
Error 2: +2 (prediction is too low)
The sum of errors would be -2 + 2 = 0, which is misleading. The total
error should reflect the magnitude of the error, so squaring makes all errors
positive, and larger errors contribute more.
- Prevent Negative Values:
Squaring makes sure that both positive and negative errors contribute positively to the total error. This way, we avoid cases where a positive and negative error might cancel each other out. - Simplify Optimization:
The reason we multiply the errors by themselves (i.e., square them) is that it makes it easier to perform optimization mathematically. Squared terms make the cost function smoother, and we can apply mathematical techniques like Gradient Descent to minimize it efficiently.
Why
Multiply by (1/2)?
You might be wondering why we multiply the
cost function by 1/2. The answer is that it simplifies the math later
when we compute the derivative (slope) of the cost function with respect to the
weights. This will come in handy during Gradient Descent optimization,
which we’ll discuss in the next section. The 1/2 cancels out the factor
of 2 that appears when we take the derivative.
In summary,
squaring the errors allows us to:
- Make all errors positive.
- Prevent cancellation of errors.
- Make the error function
mathematically easier to minimize.
Gradient
Descent for Optimizing Weights
After
calculating the cost function J(w0,w1,...,wk), the goal is to minimize this
error function to find the best weights. To do this, we use Gradient Descent.
Gradient
Descent is an optimization algorithm used to minimize the cost function by
updating the weights in the opposite direction of the gradient (slope) of the
cost function. The size of the update is determined by the learning rate, α\alphaα.
Gradient
Descent Formula:
The weight
update rule is as follows:
Where:
- W_new is the weight being
updated.
- α is the learning rate.
- ∂J(w)/∂W is the derivative
(slope) of the cost function with respect to the weight w.
Visualizing
Gradient Descent
- Case 1: w on the right side of
the graph (high cost):
- If (∂J(w) / ∂wi) > 0, the
gradient is positive, meaning the current weight is too high, and we need
to decrease it to move towards the minimum. Therefore, the weight is
updated as:
wi = wi - α * (∂J(w) / ∂wi)
- On the right side, the slope is
positive, so the weight wi will decrease, as the negative sign in the
formula makes the weight move towards the minimum.
- Case 2: w on the left side of
the graph (low cost):
- If (∂J(w) / ∂wi) < 0, the
gradient is negative, meaning the current weight is too low, and we need
to increase it to move towards the minimum. Therefore, the weight is
updated as:
wi
:= wi - α * (∂J(w) / ∂wi)
- On the left side, the slope is
negative, and since the negative sign in the formula is already there,
the weight will increase, which is what we want to move towards the
minimum.
Final
Step:
As we
continue updating the weights, the slope of the cost function gets smaller and
approaches zero. When the gradient is zero, we have reached the minimum of the
cost function, and further updates are no longer needed. This is the point
where the weight updates stop, and we have found the optimal weights.
Note on
Learning Rate (α):
The learning
rate (α) is a crucial parameter in gradient descent. It controls how much we
adjust the weights in each iteration based on the gradient. A carefully chosen
learning rate helps the model converge to the optimal solution. However, if the
learning rate is not chosen wisely, it can cause issues:
- If α is too small (slow
learning):
- The gradient updates will be
very small, and the model might take a long time to converge. It could
get stuck in local minima and never reach the global minimum.
- This is often seen when the
updates are so small that the weights do not change significantly between
iterations, leading to slow progress.
- If α is too large
(overshooting):
- A large learning rate can cause
the updates to overshoot the optimal values, jumping back and forth over
the minimum. Instead of converging to the global minimum, it can diverge
or get trapped in a suboptimal solution, oscillating around a point.
- In extreme cases, the learning
rate can even cause the cost to increase rather than decrease, as seen
when the model moves past the minima.
To avoid
these issues, you need to tune the learning rate appropriately. A good practice
is to start with a moderate learning rate and adjust it based on the behavior
of the cost function. If the cost function decreases too slowly or fluctuates,
consider reducing the learning rate. If the cost doesn't decrease or oscillates
too much, it may be necessary to reduce the learning rate to avoid
overshooting.
Underfitting
vs Overfitting
In machine
learning, underfitting and overfitting refer to how well your
model generalizes to new data, and they are key concepts when evaluating model
performance.
1. Underfitting
- Definition: Underfitting happens when the
model is too simple to capture the underlying patterns of the data. It
doesn't learn enough from the training data, resulting in poor performance
on both the training set and test set.
- Cause: This can happen when:
- The model is too simple (e.g.,
using a linear model when the data is non-linear).
- The model has too few features
or is not complex enough to capture the relationships in the data.
- The model is not trained enough
(insufficient epochs in training).
- Visual Example: Imagine trying to fit a
straight line to a set of data points that follow a curved pattern. The
straight line will fail to capture the curve, leading to high bias.
- Impact:
- High bias (the model’s
assumptions are too strong and incorrect).
- Low variance (the
model’s predictions don’t change much with different training data).
2. Overfitting
- Definition: Overfitting happens when the
model learns not only the underlying patterns but also the noise and
random fluctuations in the training data. This leads to excellent
performance on the training data but poor generalization to new data
(i.e., test data).
- Cause: This can happen when:
- The model is too complex (e.g.,
using a very deep neural network for a simple problem).
- The model is trained for too
many iterations, capturing too much noise.
- There are too many features (or
irrelevant features) in the model.
- Visual Example: Imagine trying to fit a very
wiggly curve to data that is actually linear. The model will adapt to
every small fluctuation in the data, resulting in an overly complex curve
that doesn't generalize well to new data.
- Impact:
- Low bias (the model fits
the training data well).
- High variance (the model
is highly sensitive to small changes in the training data).
Bias and
Variance Explained
Bias and Variance are two sources
of errors in machine learning models. Understanding the trade-off between them
is crucial for building good models.
1. Bias
- Definition: Bias refers to the error
introduced by approximating a real-world problem (which may be complex) by
a simplified model.
- In Simple Terms: Bias is the model’s tendency
to consistently make certain types of mistakes because it oversimplifies
the problem.
- High Bias: If a model has high bias, it
means the model is too simple and makes strong assumptions. This leads to
underfitting and poor performance.
- Low Bias: If a model has low bias, it
means the model is flexible enough to learn the underlying patterns in
the data, leading to a better fit.
2. Variance
- Definition: Variance refers to the model’s
sensitivity to small changes in the training data. If a model has high
variance, it means it can adapt too much to the training data, capturing
noise along with the signal.
- In Simple Terms: Variance is how much the
model's predictions would change if we used a different training dataset.
- High Variance: High variance means the model
is overfitting and will perform poorly on new, unseen data because it’s
too sensitive to the training data.
- Low Variance: Low variance means the
model's predictions are stable and consistent across different training
datasets.
The
Bias-Variance Trade-Off
- Ideal Model: The goal is to find a model
with low bias (accurately captures the data patterns) and low
variance (performs well on new data).
- Too Simple Model: High bias, low variance
(underfitting).
- Too Complex Model: Low bias, high variance
(overfitting).
There’s a
natural trade-off between bias and variance:
- Increasing Model Complexity (e.g., adding more features or
using a more complex model) will decrease bias but increase variance.
- Decreasing Model Complexity (e.g., reducing features or
simplifying the model) will decrease variance but increase bias.
Visuals
for Underfitting, Overfitting, Bias, and Variance
- Underfitting: A straight line fitting a
curved data pattern. High bias, low variance.
- Overfitting: A very complex curve that fits
the data too closely. Low bias, high variance.
Final
Summary:
- Underfitting: Model is too simple (high
bias).
- Overfitting: Model is too complex (high
variance).
- Bias: Error due to overly simplistic
assumptions.
- Variance: Error due to the model being
too sensitive to fluctuations in the training data.
Comments
Post a Comment