Linear regression is an elementary statistical approach utilised to forecast the linear relationship between two variables. This, in turn, helps in forecasting the outcomes by implanting a straight line through observed data points. In this article, we will discuss the fundamentals of linear regression, its assumptions, and how it operates. You will also get to know about linear regression in machine learning.
What is Linear Regression?
Linear regression is a form of supervised machine learning algorithm that obtains data from tagged datasets and maps the data points using the most optimized linear functions, which can be utilized for prediction on new datasets. A linear relationship is assumed between the input and output. This means that the output evolved at a constant rate with the input changes. A straight line shows this association.
For instance, to predict a student’s exam score based on the number of hours they have studied. We may observe an increase in scores with the number of hours the student has spent. Therefore, the aim is to predict the exam scores depending on the hours invested. In this example, the independent variable is the hours invested by the student, and the dependent variable is the exam score. Here, the independent variable can be used to predict the dependent variable.
Why is Linear Regression Important?
There are several reasons that make Linear Regression important:
Simplicity and interpretability– It is an easy way to comprehend and interpret, which makes it an initial phase for learning about machine learning.
Predictive ability– This helps in predicting future outcomes depending on historical data. This makes it beneficial to be used in different segments like healthcare, finance, and marketing.
Basis for other models– Several advanced algorithms, such as logistic regression or neural networks, depend on the concepts of linear regression.
Efficiency– It is efficient and performs good for the issues with a linear relationship.
Widely used– It is one of the commonly used techniques for both statistics and machine learning for regression activities.
Analysis– This offers insights into the relationships between factors.
Best Fit Line in Linear Regression
Best-fit line in linear regression is straight line that displays the association between independent and dependent variables. The line blurs difference between real data points and expected values from the model.
Target of the Best-fit Line
The target of linear regression is to navigate to a straight line that reduces the error between the expected data points and observed data. These lines help in predicting the dependent variable for new, unknown data.
Generally, Y is the dependent variable while X is the independent variable. Several functions or modules are used for regression analysis. A linear function is the easiest function.
Equation
The best-fit line is shown through equation below for simple linear regression:
y=mx+b
Here, y is the dependent variable
X is the independent variable
M is the slope of the line
B is the intercept
Hence, best-fit line can optimise the m value and b value. Therefore, predicted y values should be close to the real data points.
Reduce Error
Least Square method can be used to find the best-fit line. The notion behind this method is to reduce the sum of the squared differences between the actual values and the estimated values from the line. Such differences are referred to as residuals.
The formula for the residuals is as follows:
Residual- yi-^yi
Here, yi is the real observed value
^yi is predicted value from line
The least squares method reduces the sum of the squared residuals
Sum of squared errors (SSE)= Σ (yi-^yi)2
Interpretation of the Best-fit Line
Slope (m): The slope of the best-fit line shows how much the dependent variable (y) changes with every unit change in the independent variable (x). For instance, if the slope is 5, it supports that for every 1-unit increase in x, the value of y will increase by 5.
Intercept (b): This means the predicted value of y when x=o. It is the stage where line surpasses the y-axis.
Some hypotheses are developed to ensure the reliability of the model outcomes in linear regression.
Hypothesis Analysis
The hypothesis analysis in linear regression refers to the equation used to make predictions regarding the dependent variable on the basis of the independent variables. It determines the association between the dependent variable and the independent variable. The formula of the hypothesis testing is:
h(x)=β₀+β₁x
The multiple regression analysis follows a different formula:
h(x₁,x₂,…,xₖ)=β₀+β₁x₁+β₂x₂+…+βₖxₖ
Here,
X1, x2…. Are the independent variables
Β₀ is the intercept
Β₀, β₂… are the coefficients, determining the impact of every single independent variable on the predicted outcome.
Assumptions of Linear Regression
Linearity– The association between x and y factors is a straight line.
Independence of Errors- The prediction errors may not impact each other.
Constant Variance– Errors should have equal variance across input values. If there is a change in spread, it refers to heteroscedasticity, and it is unfavourable for the model.
Normality of Errors: Errors may follow normal distribution
Limitations of Linear Regression
- Linear regression examines linear association between dependent and independent factors. The model may not work well if the association is not linear
- The linear regression is sceptical to multicollinearity that exists when there is a high correlation between independent variables. Multicollinearity can reduce the variance of the coefficients and result in unstable model predictions.
- Linear regression is vulnerable to both overfitting and underfitting. The former exists when the model obtains training data and struggles to generalise the known data. Whereas, the latter exists when the model is very simple to capture the underlying associations in the dataset.
Summary
Linear regression is an effective method to understand the linear relationship. This is being used in different segments like economics, finance and so on to predict the behaviour or impact of a variable. Hence, linear regression machine learning is widely used to predict the impact of a variable on another.
Also Read: