Sidebar Menu

Simple Linear Regression

Introduction

Definition of Simple Linear Regression

Linear regression is a statistical analysis technique used to predict the relationship between two or more variables. These variables can be divided into two types: dependent variables (Y) and independent variables (X). Simple linear regression refers to a model where there is only one independent variable, while multiple linear regression involves more than one independent variable.

The concept of regression was first introduced by Sir Francis Galton (1822-1911), who studied the relationship between the heights of fathers and their sons. Galton discovered that the sons' heights tended to "regress" or return to the average value of the population after several generations. In other words, sons of extremely tall fathers tend to be shorter than their fathers, while sons of very short fathers tend to be taller than their fathers. Therefore, this concept is known as "regression". Today, the concept of regression is used in various types of forecasting.

For example, We might ask, "Can I predict rice yield if I use x kg of urea?" or "Can systolic blood pressure be predicted based on my age?" Linear regression could be the answer. This method uses a linear equation to explain the relationship between the independent (X) and dependent (Y) variables. In this context, rice yield and systolic blood pressure are dependent variables, while urea and age are independent variables. 

The general form of this equation is:

$$\hat Y = {\beta _0} + \;{\beta _1}X$$

Where $\hat Y$ is the predicted value of the dependent variable, X is the observation value of the independent variable, β0 is the intercept (the point where the curve intersects the Y-axis), and β1 is the slope of the linear curve. The values of β0 and β1 are parameters of the regression line that are estimated based on observational data.

Observe the following image:

By understanding these parameters, we can estimate the value of Y based on the value of X. The intercept β0 refers to the value of Y when X = 0, while the slope β1 reflects how much Y changes (Δy) as one unit of X changes (Δx). The values of β0 and β1 are constants along the linear curve.

The data for the independent variable X in linear regression can be either observational data (which is not predetermined by the researcher) or experimental data (which is predetermined by the researcher). Experimental data provides stronger information about the cause-effect relationship between X and Y compared to observational data. In experimental data, the researcher has set the value of X to be studied. Meanwhile, in observational data, the observed values of X can vary depending on field conditions. Generally, experimental data is obtained from laboratory experiments, while observational data is obtained through questionnaires.

However, it is important to note that the equation Y = β0 + β1X is a deterministic model. This means that if we know the value of X, the value of Y can be determined precisely, without considering error factors. Conversely, in practice, the linear regression model is a stochastic model, where the regression line does not always exactly cover all observational data. This means that although the regression model tries to find the pattern of the relationship between X and Y through a certain equation, there are still variations and errors that need to be considered. The regression model is a simplification of a more complex system, and this diversity and error reflect that complexity.

Consider the following images. In the deterministic model (first image), all observational data points lie precisely on the straight line, whereas in the stochastic regression model (second image), not all observational data points are precisely covered by the straight line.

Difference between Correlation and Linear Regression

Correlation analysis and linear regression are two interrelated statistical techniques, both used to analyze the relationship between two or more variables. Correlation measures the strength and direction of the relationship between variables, indicating how far variables change together. Meanwhile, linear regression focuses on the cause-effect relationship between variables, providing further insights on how variables interact and influence each other. In other words, correlation measures how strong the relationship between variables is, while linear regression is used to predict the value of the dependent variable based on the independent variable. Through the regression equation, we can forecast the value of the dependent variable (Y) based on the value of the independent variable (X), allowing us to predict a parameter based on the value of another known variable.

For illustration, we might want to know whether crop yield is related to the nitrogen content in the soil, or whether systolic blood pressure is related to age. In such cases, we can first plot the relationship between these two variables, then measure their correlation. After that, we can use regression analysis to find the best straight line that goes through the data points so that the deviations of these points are minimized.

Objectives of Regression Analysis

Regression analysis is a method used to understand the relationship between dependent and independent variables in a dataset. The objectives of regression analysis encompass the following aspects:

  1. Describing the Relationship Between Variables. The first objective of regression analysis is to describe the relationship between two or more variables. By observing this relationship, we can understand how these variables interact with each other. For instance, if the dependent variable is an individual's income, and the independent variable is the level of education, regression analysis can show whether an increase in the level of education is associated with an increase in income.
  2. Creating a Model Response to Dependent Variable. After describing the relationship between variables, the next objective is to create a model capable of responding to changes in the dependent variable based on changes in the independent variable. This model can be used to predict the dependent variable if we know the value of the independent variable.
  3. Prediction or Forecasting. Using the model created, we can use regression analysis to predict the value of the dependent variable based on the value of the independent variable. This is very useful in various fields, such as economics, science, and technology.
  4. Statistically Controlling the Effect of Variables While Investigating the Relationship Between Independent and Dependent Variables. Regression analysis can also be used to control the effect of other variables when we are investigating the relationship between the independent and dependent variables. By doing this, we can be more accurate in determining the actual relationship between these two variables.
  5. Determining How Much Variation (Uncertainty) in Y Can Be Explained by a Linear Relationship with X and How Much Variation Remains Unexplained. Another significant objective of regression analysis is to know how much percent of the variation in the dependent variable can be explained by the independent variable. This is crucial to understand how good our model is and how much information we have about the dependent variable based on the independent variable.
  6. Testing and Validating the Regression Model. One of the primary objectives of regression analysis is to test and validate the regression model created. This involves checking the assumptions used in the analysis, conducting hypothesis tests for the regression parameters, and examining the quality of the regression model, for instance, by calculating the R-squared value or performing residual tests.
  7. Understanding the Relative Influence of Independent Variables on the Dependent Variable. Regression analysis also aims to understand the relative influence of independent variables on the dependent variable. For example, if we have multiple independent variables, we may want to know which one has the greatest impact on the dependent variable. This can aid in decision-making, such as in determining resource allocation or in designing interventions.
  8. Aiding in Decision-Making. By understanding the relationship between the dependent and independent variables, regression analysis can aid in decision-making across various fields. For example, in business, regression analysis can be used to understand what factors influence sales and how this can be utilized to increase future sales.
  9. Performing Interpolation and Extrapolation of Data. Regression analysis can also be used to perform interpolation and extrapolation of data. Interpolation involves using a regression model to predict the value of the dependent variable for independent variable values within the range of existing data, while extrapolation involves using the model to predict the value of the dependent variable for independent variable values outside the range of existing data.

Linear Regression Model

Here is a simple case example that we will use in this discussion. This example will focus on the relationship between age and systolic blood pressure. We are interested in knowing whether there is a relationship between age and systolic pressure, and if so, how strong that relationship is.

Case Example:

Individual

Age (years)

X

Systolic Blood Pressure (mm Hg)

Y

A

34

108

B

43

129

C

49

126

D

58

149

E

64

168

F

73

161

G

78

174

With this data, we can apply various concepts and methods of regression analysis:

Basic Understanding: We can explain that our aim is to determine whether age can be used to predict systolic pressure, and that we will use linear regression to do this.

Linear Regression Model: We can show how the linear regression model is formulated, with age as the independent variable X and systolic blood pressure as the dependent variable Y.

Parameter Estimation: We can explain how regression coefficients are calculated using the least squares method, and then calculate these coefficients using our data.

Fitting the Regression Line

In linear regression analysis, one very important aspect is fitting the regression line to the observed data set. The goal is to get the line that best represents the pattern of the relationship between the independent variable (X) and the dependent variable (Y). This process is often known as 'fitting the regression line'.

In simple linear regression, the regression line is defined by the mathematical equation:

$$\hat{Y}=b_0+\ b_1X$$

The values of b0 and b1 are what we need to find so that the regression line fits our data best.

In the case example above, the equation $\hat{Y}=b_0+\ b_1X$ can be interpreted as follows:

  • $b_0$ is the average systolic blood pressure when age is 0. However, this interpretation may not make sense in this context because it is impossible for someone to have an age of 0.
  • $b_1$ is the change in expenditure for every one unit change in income. For example, if b1 is 0.5, then for every one unit increase in age, we can expect systolic blood pressure to increase by 0.5 units.

With this regression model, we can predict systolic blood pressure based on age. However, it should be remembered that the regression model only describes the average relationship between variables. There is also random variation that cannot be explained by the model.

Parameter estimation in the regression model

Each regression model contains two main components, namely the predictive relationship between the independent and dependent variables, and the random or error component that cannot be explained by the model. Therefore, the regression model can be written in the form:

$$Data = Model + error$$

$$y = {b_0} + {b_1}x + \varepsilon $$

Data is the dependent variable (Y) that we want to predict, Model is the regression line equation (b0 + b1X), and Error is the deviation of the observation data values from the values estimated by the regression line.

A simple linear regression model includes two main coefficients, namely: slope and intercept.

  1. Slope: In mathematics, slope describes the tilt of a line. In statistics, slope shows the magnitude of the contribution of variable X to variable Y. The slope value can be interpreted as the average addition (or reduction) that occurs in variable Y for each one unit increase in variable X. There is a close mathematical relationship between the slope in linear regression and the correlation coefficient.
  2. Intercept: The intercept is the intersection point between the regression line and the Y-axis when the value of X is 0. Statistically, this means the average value on variable Y when variable X is valued 0. In other words, if X does not contribute, then on average, variable Y will be valued as much as the intercept. Although the intercept is a constant in the regression model, its practical meaning might be less important, especially if there are no values of X in the data that are 0 or close to 0. In this case, the intercept might not need to be interpreted.

Prediction values and residuals

Regression analysis is a technique used to build equations and use these equations to make estimates or predictions. Because it is a prediction, the predicted value does not always match its actual value. The smaller the difference between the predicted value and its actual value, the more accurate the regression equation. The difference between the predicted value (y) and its actual value is referred to as a residual or error.

The regression equation can be written as:

$$Data = Model + error$$

$${y_i} = {b_0} + {b_1}{x_i} + {\varepsilon _i}$$

$${y_i} = ({b_0} + {b_1}{x_i}) + {\varepsilon _i}$$

$${y_i} = {\hat y_i} + {\varepsilon _i}$$

$${\varepsilon _i} = {y_i} - {\hat y_i}$$

Where ${\varepsilon _i}$ is the residual value, with ${\hat y_i}$ being the prediction value and ${Y_i} is the actual value.

For example, the regression equation for the above case is as follows: y = 61.244 + 1.4694x. Using this equation, we can calculate the predicted value and its residual as in the table below.

Individual

X

Y

Prediction

($\hat Y$)

Residual (e)

A

34

108

111.20

-3.20

B

43

129

124.43

4.57

C

49

126

133.24

-7.24

D

58

149

146.47

2.53

E

64

168

155.29

12.71

F

73

161

168.51

-7.51

G

78

174

175.86

-1.86

Total

     

0.00

This residual value is used to determine the best regression line that most accurately reflects the observation data and is the best regression equation among the various other possible straight lines that can be created.

The sum of error values (residuals) is zero.

$$\sum {y_i} - {\hat y_i} = 0$$

The best fitting regression line is when the sum of the squares of the differences between the predicted value and the actual value of the dependent variable of the sample data is the smallest or the sum of the squares of these residuals is minimum.

$$\sum {({y_i} - {\hat y_i})^2} = \sum {e^2} = minimum$$

The estimation of the regression line model in this way is called ordinary least squares (OLS) or least squares method, which is a method of estimating parameters by minimizing the sum of squared errors (Sum Square of Error = SSE).

$$SSE = \sum {e_i}^2 = minimum$$

To solve this equation, we can use the following procedure:

We will minimize the SSE value:

$$SSE = \sum {e_i}^2 = \sum {({y_i} - {\hat y_i})^2} = \sum {({y_i} - {b_0} - {b_1}{x_i})^2}$$

Next, we differentiate this equation to get the estimated values of b0 and b1:

$$\frac{{\partial SSE}}{{\partial {b_0}}} = 0\;{\rm{and}}\frac{{\partial SSE}}{{\partial {b_1}}} = 0$$

From which the following equations are obtained:

$${b_1} = \frac{{S{S_{xy}}}}{{S{S_{xx}}}}{\rm{ and }}{b_0} = \bar y - {b_1}\bar x$$

Where:

$S{S_{xx}} = \;\mathop \sum \limits_{i = 1}^n {\left( {{X_i} - \bar X} \right)^2}$

$S{S_{xy}} = \;\mathop \sum \limits_{i = 1}^n \left( {{X_i} - \bar X} \right)\left( {{Y_i} - \bar Y} \right)$

Therefore, we have now obtained the equation for estimating the values of b0 and b1.

$${b_1} = \;\frac{{\mathop \sum \nolimits_{i = 1}^n \left( {{x_i} - \bar x} \right)\left( {{y_i} - \bar y} \right)}}{{\mathop \sum \nolimits_{i = 1}^n {{\left( {{x_i} - \bar x} \right)}^2}}}$$

and

$${b_0} = \bar y - {b_1}\bar x$$

For convenience in calculation, another form of equation for the coefficient b1:

$${b_1} = \;\frac{{n\sum {x_i}{y_i} - \sum {x_i}\sum {y_i}}}{{n\sum x_i^2 - {{\left( {\sum {x_i}} \right)}^2}}}$$

Or

$${b_1} = \;\frac{{\sum {x_i}{y_i} - \frac{{\sum {X_i}\sum {Y_i}}}{n}}}{{\sum x_i^2 - \frac{{{{\left( {\sum {x_i}} \right)}^2}}}{n}}}$$

Applied example:

We will use these formulas to calculate the regression coefficients. Here are the steps of the calculation:

Step 1: Create the following table

Individual

X

Y

X2

Y2

XY

A

34

108

1156

11664

3672

B

43

129

1849

16641

5547

C

49

126

2401

15876

6174

D

58

149

3364

22201

8642

E

64

168

4096

28224

10752

F

73

161

5329

25921

11753

G

78

174

6084

30276

13572

Total (Σ)

399

1015

24279

150803

60112

Average

57

145

From the table, we obtain:

  • ΣX = 399 (total age)
  • ΣY=1015 (total systolic blood pressure)
  • ΣXY=60112 (total product of X and Y)
  • ΣX2 =24279 (total square of X)

Step 2: Calculate the value of b1 and b0

Once we have these values, we can insert them into the earlier formula:

Using the formula for b1:

$${b_1} = \;\frac{{n\sum {x_i}{y_i} - \sum {x_i}\sum {y_i}}}{{n\sum x_i^2 - {{\left( {\sum {x_i}} \right)}^2}}}$$

we get:

${b_1} = \;\frac{{7\left( {60112} \right) - \left( {399} \right)\left( {1015} \right)}}{{7\left( {24279} \right) - {{\left( {399} \right)}^2}}}$

${b_1} = \;1.4694\;\;\;$

After calculating, we will get the value of b1, which is the gradient of the regression line.

Then, we insert this b1 value into the formula to get b0:

${b_0} = \bar y - {b_1}\bar x$

we get:

${b_0} = 145 - 1.4694\left( {57} \right)$

${b_0} = 61.2441$

After calculating, we will get the value of b0, which is the intercept of the regression line with the Y-axis.

Step 3: Write the Regression Equation

After obtaining b0 and b1, we can write the regression equation:

$y = 61.244 + 1.4694x$

Here is the interpretation of this equation:

  • b0= 61.244 is the average systolic blood pressure when the age is 0 years. In this context, this interpretation may not be practical, as education is generally not 0 years. This interpretation is more of a mathematical formality than something that can be used in clinical practice.
  • b1= 1.4694 is the change in systolic blood pressure for every one-year increase in age. For example, for every one-year increase in age, we can expect the systolic blood pressure to increase by 1.4694 mm Hg.

In practice, this equation can be used to predict systolic blood pressure based on age. For example, if someone is 64 years old, the expected systolic blood pressure would be:

$y = 61.244 + 1.4694x$

$y = 61.244 + 1.4694\left( {64} \right)$

$y = 155.2858$

So, the expected systolic blood pressure for a person of 64 years of age is 155.29 mm Hg.

This model is an estimate and in reality, there are many other factors that can influence a person's blood pressure. This simple linear regression model only considers one predictor variable (age), but in practice, more complex analyses may be needed.

For Residual (e):

Residual is the difference between the actual value (Y) and the predicted value ($\hat{Y}$). These residuals will give us an idea of how well our model is able to predict the actual data.

For the fifth individual (E), the residual is e = Y - $\hat{Y}$ = 168 - 155.29 = 12.71.

We can also repeat this process for every individual in our dataset.

Residuals can be used to check model assumptions and understand how well the model can explain the variation in the data. For example, if residuals have a certain pattern or are not normally distributed, it could be a sign that the model is not fitting the data well.

It should be noted that these calculations are the basis of simple linear regression and advanced tests such as residual analysis and model validation can be used to further understand the model's performance and underlying assumptions.

Regression Model Evaluation

Explained and Unexplained Sum of Squares

In regression analysis, we are often interested in determining the extent to which variations in the dependent variable can be explained by the independent variables in the model. To this end, we can compare the Sum of Squares Regression (SSR) with the Sum of Squares Error (SSE).

For instance, consider the diagram below. We assume that we have a set of paired data and the y prediction values are obtained using the regression equation. The sample mean of y is $\bar y$.

  • The total deviation (from the average) for a particular point (x, y) is the vertical distance $y - \bar y$, which is the distance between the point (x, y) and the horizontal line passing through the sample mean value.
  • The explained deviation is the vertical distance $\hat y - \bar y$, which is the distance between the predicted value and the line passing through the sample mean value.
  • The unexplained deviation is the vertical distance $y - \hat y$, which is the distance between the point (x, y) and the regression line. In previous languages, this distance is known as the residual/error/remainder/discrepancy.

Thus, we can write:

$$Total\;Deviation = Explained\;Deviation + Unexplained\;Deviation$$

or in mathematical form:

$$\left( {{y_i} - \bar y} \right) = \left( {{{\hat y}_i} - \bar y} \right) + \left( {{y_i} - {{\hat y}_i}} \right)$$

This is a basic identity in regression. By squaring both sides of the equation and summing them for all data points, we get:

$$Total\;Variation = Explained\;Variation + Unexplained\;Variation$$

${\rm{\Sigma }}{\left( {{y_i} - \bar y} \right)^2} = {\rm{\Sigma }}{\left( {{{\hat y}_i} - \bar y} \right)^2} + {\rm{\Sigma }}{\left( {{y_i} - {{\hat y}_i}} \right)^2}$

or in other words:

$$SST = SSR + SSE\;$$

where:

  • SST = Sum of Squares Total
  • SSR = Sum of Squares Regression
  • SSE = Sum of Square Error

Simply put, the Sum of Squares Total (SST) consists of the Sum of Squares Regression (SSR) and the Sum of Squares Error (SSE). In this context, SSR represents the variation in the data that can be explained by the model, while SSE represents the variation that cannot be explained by the model.

We can illustrate this concept with the following example:

Applied Example

Suppose we have the following data and we have obtained the predicted values ${\rm{\hat Y}}$ for each data point using our regression model.

Individual

X

Y

${\rm{\hat Y}}$

Total Square${\left( {{y_i} - \bar y} \right)^2}$

Regression Square

${\left( {{{\hat y}_i} - \bar y} \right)^2}$

Error Square${\left( {{y_i} - {{\hat y}_i}} \right)^2}$

A

34

108

111.20

1369.00

1142.18

10.26

B

43

129

124.43

256.00

${\rm{\hat Y}}$.19

20.90

C

49

126

133.24

361.00

138.18

52.49

D

58

149

146.47

16.00

2.16

6.40

E

64

168

155.29

529.00

105.80

161.65

F

73

161

168.51

256.00

552.74

56.41

G

78

174

175.86

841.00

952.18

3.45

Average

57

145

       

${\rm{\Sigma }}$

   

3628.00

3316.44

311.56

From the table above, we can derive the following values:

  • Total Sum of Squares (SST) = 3628.00
  • Sum of Squares Regression (SSR) = 3316.44
  • Sum of Squares Error (SSE) = 311.56

From this example, we can see that SST, SSR, and SSE each represent components of total variation in the data. By comparing these values, we can assess how much of the variation in the data is explained by our regression model.

Coefficient of Determination

The coefficient of determination (R²) is a metric used to measure how well a regression model explains the variation in data. The R² value reflects the proportion of the total variation in the dependent variable Y that can be explained by the independent variable X in the regression equation model. In other words, R² provides a measure of how well our model fits the data.

The coefficient of determination (R2) is calculated using the following equation:

$${R^2} = \frac{{Explained\;Variation}}{{Total\;Variation}} = \frac{{SSR}}{{SST}}$$

Or in mathematical terms:

$${R^2} = \frac{{{\rm{\Sigma }}{{\left( {{{\hat y}_i} - \bar y} \right)}^2}}}{{{\rm{\Sigma }}{{\left( {{y_i} - \bar y} \right)}^2}}}$$

In this formula, SSR is the sum of variation explained by the model (Sum of Squares Regression), while SST is the total variation in the data (Sum of Squares Total). The value $\hat y$ is the predicted value from our model, and $\bar y$ is the average value of Y in the data sample.

The R² value ranges between 0 and 1. A value close to 1 indicates that our model can explain most of the variation in the data, while a value close to 0 indicates that our model explains little of the variation in the data.

However, it's important to remember that R² is not the only evaluator of our regression model's quality. A high R² value doesn't always mean that we have a good model. Conversely, a low R² value doesn't always mean our model is bad. R² just gives us information about how much of the variation in data is explained by the model. To evaluate the quality of the regression model, we also need to consider other metrics and relevant domain knowledge.

Applied Example

Suppose we already have the SSR and SST values from the previous example. Using the formula above, we can calculate R² as follows:

$${R^2} = \frac{{SSR}}{{SST}} = \frac{{3316.44}}{{3628.00}} = 0.9141$$

This means that our regression model can explain approximately 91.41% of the variation in the data. However, it's important to remember that this is not the only metric to consider when evaluating a regression model.

Standard Error of the Estimate

The Standard Error of the Estimate, or se, is a statistical metric used to measure how much individual data points differ from the regression line determined by the model. In simpler terms, se provides a measure of the model's precision in predicting the value of the dependent variable based on the value of the independent variable. A smaller se value indicates that the model is more accurate in predicting the value of the dependent variable.

Mathematically, se is calculated using the formula:

$${s_e} = \sqrt {\frac{1}{{n - 2}}\mathop \sum \nolimits_1^n {{({y_i} - {{\hat y}_i})}^2}} $$

Explanation:

  • n is the number of observations in the data.
  • ${y_i}$ is the actual value of the dependent variable.
  • $\hat y$ is the predicted value of the dependent variable generated by the regression model.
  • ${s_e}$ is the standard error of the estimate.

However, if you are doing manual calculations, you can use the following formula:

$${s_e} = \sqrt {\frac{{\sum y_i^2 - \frac{{{{\left( {\sum {y_i}} \right)}^2}}}{n} - {b_1}\left[ {\sum {x_i}{y_i} - \frac{{\sum {x_i}\sum {y_i}}}{n}} \right]}}{{n - 2}}} $$

Or

$${s_e} = \sqrt {\frac{{\sum y_i^2 - {b_0}\sum {y_i} - {b_1}\sum {x_i}{y_i}}}{{n - 2}}} $$

Note that the above formula assumes that the number of degrees of freedom (df) for this model is n−2, which is typically true for simple linear regression. The number of degrees of freedom will be different for more complex types of regression, such as multiple regression.

The Standard Error of Estimate can be used to help assess the quality of a regression model. Generally, the smaller the value of se, the better the model is at predicting the value of the dependent variable based on the value of the independent variable.

Applied Example

In this example, we will calculate the Standard Error of Estimate (se) using two methods: the direct method from the definition of se and the manual method using an additional formula.

Direct Method

Using the basic formula of the Standard Error of Estimate:

$${s_e} = \sqrt {\frac{1}{{n - 2}}\mathop \sum \nolimits_1^n {{({y_i} - {{\hat y}_i})}^2}} $$

We know that n (number of observations) is 7 and the total square error (SSE) is 311.56 (as calculated in the previous subsection). So,

$${s_e} = \sqrt {\frac{1}{{7 - 2}}\left( {311.56} \right)} = \sqrt {62.3124} = 7.8938$$

This means, on average, the actual data points are about 7.89 units away from the regression line.

Manual Method

Using the manual formula:

$${s_e} = \sqrt {\frac{{\sum y_i^2 - {b_0}\sum {y_i} - {b_1}\sum {x_i}{y_i}}}{{n - 2}}} $$

Previously, we knew that $\sum y_i^2$ is 150803, b0 is 61.2441, $\sum {y_i}$ is 1015, b1 is 1.4694, and $\sum {x_i}{y_i}$ is 60112.

So, substitute these values into the formula:

$${s_e} = \sqrt {\frac{{150803\; - 61.2441\left( {1015} \right) - \;1.4694\left( {60112} \right)}}{{7 - 2}}} = \sqrt {62.3124} = 7.8938$$

This value is the same as the one calculated using the direct method, so we can ensure that our calculation is correct. This value also indicates that our model is quite accurate, with an average error of 7.89 units.

Keep in mind that this calculation applies to simple linear regression, and the standard error of estimate can differ for more complex types of regression.

Variance Analysis (ANOVA) in Regression Models

Variance Analysis, better known as ANOVA, is a statistical technique used to test the extent to which independent variables in a regression model affect the dependent variable. In the context of regression models, ANOVA is used to test the hypothesis that the regression model we have fitted significantly explains the variation in the data.

Specifically, we use ANOVA to test the null hypothesis that all regression coefficients (except the intercept) are zero, versus the alternative hypothesis that at least one regression coefficient is not zero. If we reject the null hypothesis, this indicates that our regression model significantly explains the variation in the data.

ANOVA in regression models usually involves the following concepts:

  • Total Sum of Squares (SST): This is the total variation in the data. SST is calculated as the sum of the squares of the differences between each observation and the average observation.
  • Regression Sum of Squares (SSR): This is the variation explained by the regression model. SSR is calculated as the sum of the squares of the differences between the model's predicted values and the average observation.
  • Sum of Squared Errors (SSE): This is the variation that is not explained by the model. SSE is calculated as the sum of the squares of the differences between each observation and the model's predicted value.

The ANOVA table will look like this:

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Square

F-statistic

P-value

Regression

SSR

k

MSR = SSR/k

F = MSR/MSE

P-value

Error

SSE

n-k-1

MSE = SSE/(n-k-1)

   

Total

SST

n-1

Legend:

  • n is the number of observations.
  • k is the number of independent variables.

The F-statistic is used to test the null hypothesis that all regression coefficients are zero, meaning the independent variables have no effect on the dependent variable. If the P-value is less than the predetermined significance level (for example, 0.05), then we reject the null hypothesis and conclude that at least one of the independent variables is significant in predicting the dependent variable.

In practice, the results of ANOVA provide a general overview of the model fit. However, it is more than just looking at the P-value, it is also important to assess the assumptions of the model, the relevance of variables in the model, and potential multicollinearity or dependencies between independent variables.

Applied Example

In this example, we will apply the concept of Analysis of Variance (ANOVA) in the context of simple linear regression. The aim is to understand to what extent the regression model we have built can explain the variation in the data. We will use the ANOVA table to interpret the results of this analysis.

ANOVA table:

Variance Source

Degrees of Freedom (DF)

Sum of Squares (SS)

Mean Square (MS)

F-Statistic

P-Value

F-0.05

F-0.01

Regression (Age)

1

3316.438

3316.438

53.223 **

0.001

6.608

16.258

Error

5

311.5618

62.3124

-

-

-

-

Total

6

3628

-

-

-

-

-

The following is an interpretation of the above ANOVA table:

  • Regression (Age): This refers to the variation in systolic blood pressure that can be explained by age. With a sum of squares of 3316.4382, this suggests that age explains most of the variation in systolic blood pressure.
  • Error: This is the variation in systolic blood pressure that is not explained by age. With a sum of squares of 311.5618, this suggests that there is still variation in systolic blood pressure that is not explained by our model.
  • Total: This is the total variation in systolic blood pressure. This value is the sum of the regression sum of squares and the error sum of squares.
  • F-Statistic: The F-Statistic value of 53.223 indicates that our regression model significantly explains the variation in the data, as the F-Statistic is larger than the critical value at the 0.05 and 0.01 significance level.
  • P-Value: The P-value is less than 0.05, indicating that we can reject the null hypothesis that age does not have a significant effect on systolic blood pressure. In other words, age has a significant effect on systolic blood pressure.

Thus, based on the variance analysis, we can say that the regression model we have built significantly explains the variation in systolic blood pressure, and that age is a significant predictor of systolic blood pressure.

Testing Regression Model

Testing a Regression Model involves two types of tests: simultaneous testing and partial testing.

Simultaneous Regression Model Testing

Simultaneous testing in linear regression aims to test whether the regression model that has been created is generally accepted. The specific purpose of this test is to ensure whether there is a linear relationship between the independent variable X and the dependent variable Y, or at least between one of the independent variables X and the dependent variable Y.

This test is conducted using an analysis of variance approach, by dividing the total variation of the dependent variable into two components, namely:

SST = SSR + SSE

The hypotheses applicable for this test are:

  • H0: All regression coefficients are zero (β1 = β2 = ... = βk = 0)
  • H1: At least one regression coefficient is not zero (there is at least one βi ≠ 0)

This test uses the F Test, where the F value is calculated as the ratio of the variance explained by the model (MSR) to the variance unexplained by the model (MSE):

F = MSR / MSE

Applied example:

Referring to the previous ANOVA table, the F-value (F-observed) is 53.223. This value indicates the extent to which the variation in the data is explained by the model compared to the variation not explained. The p-value (0.001) is less than 0.05, which means we can reject the null hypothesis and conclude that at least one independent variable has a linear relationship with the dependent variable.

However, it is important to note that this does not mean all independent variables have a significant relationship with the dependent variable. For this, we need to conduct partial testing or t-testing for each independent variable. This test will be discussed in the next section.

Partial Regression Model Testing

Partial testing is used to individually test whether parameter b has significant meaning in the model. In other words, whether an independent variable X significantly contributes to the dependent variable Y when tested separately. Partial testing is performed using the t-test.

The steps for testing include:

1. Determine the Hypothesis:

The null hypothesis (H0) and alternative hypothesis (H1) are usually determined as follows:

  • H0: β = 0 (there is no significant effect of the independent variable on the dependent variable)
  • H1: β ≠ 0 (there is a significant effect of the independent variable on the dependent variable)

2. Perform Statistical Testing:

The statistical test used is the t-test, which is calculated as follows:

$$t = \frac{{b - \beta }}{{S{E_b}}}$$

where $S{E_b}$ is the standard error of the regression coefficient, and b is the estimate of the regression coefficient.

Test for b1 (regression coefficient):

$$t = \frac{{{b_1} - {\beta _1}}}{{S{E_{b1}}}}$$

where $S{E_{{b_1}}}$ is the standard error of the regression coefficient and can be calculated using the formula:

$$S{E_{{b_1}}} = \sqrt {\frac{{s_e^2}}{{\sum {{\left( {{x_i} - \bar x} \right)}^2}}}} \;\; = \;\sqrt {\frac{{s_e^2}}{{\sum x_i^2 - \frac{{{{\left( {\sum {x_i}} \right)}^2}}}{n}}}} $$

The Standard Error of the Estimate (SE), or se is a measure that describes how well our regression model fits the data we have. This value is calculated using the formula explained in the previous chapter. Basically, se is the square root of the mean square error or residue, which is the difference between the observed value and the value predicted by the model.

Test for b0 (intercept):

$$t = \frac{{{b_0} - {\beta _0}}}{{S{E_{b0}}}}$$

$S{E_{b0}}$ is the standard error of the intercept and can be calculated with the formula:

$${\rm SE}_{b_0}=\sqrt{s_e^2}\ \bullet\sqrt{\frac{\sum x_i^2}{n\sum{(x_i-\bar{x})}^2}}$$

or

$${\rm SE}_{b_0}=s_e\ \bullet\sqrt{\frac{1}{n}+\frac{{\bar{x}}^2}{\sum{(x_i-\bar{x})}^2}}$$

or

$${\rm SE}_{b_0}=s_e\ \bullet\sqrt{\frac{1}{n}+\frac{\left(\frac{\sum x_i}{n}\right)^2}{\sum{x_i^2-\frac{\left({\sum x}_i\right)^2}{n}}}}$$

Decision making:

The hypothesis test is done by calculating the t statistic and comparing it with the critical value from the t distribution at the significance level α with degrees of freedom (n-2). If the absolute value of the t statistic is greater than the critical value, we can reject the null hypothesis and conclude that the independent variable has a significant effect on the dependent variable.

or:

H0 is rejected if |tobserved| > tα/2; (n-2) at the significant level α.

Applied example

Test for b1 (regression coefficient):

From the previous calculation we have obtained the value of se = 7.8951, $\sum x_i^2$ = 24279, $\sum {x_i}$ = 399, n = 7. Next we put it into the formula:

$$S{E_{{b_1}}} = \sqrt {\frac{{{{\left( {7.8951} \right)}^2}}}{{24279 - \frac{{{{\left( {399} \right)}^2}}}{7}}}} = 0.2014$$

Therefore:

$$t = \frac{{1.4694 - 0}}{{0.2014}} = 7.2954$$

Interpretation:

With a calculated t-value of 7.2954 and the table t-value at α/2 (0.025) with 5 degrees of freedom being 2.571, we can see that the calculated t-value is greater than the table t-value. Therefore, H0 is rejected. This means variable X1 makes a significant contribution to variable Y.

Test for b0 (intercept):

$${\rm SE}_{b_0}=s_e\ \bullet\sqrt{\frac{1}{n}+\frac{\left(\frac{\sum x_i}{n}\right)^2}{\sum{x_i^2-\frac{\left({\sum x}_i\right)^2}{n}}}}$$

$${\rm SE}_{b_0}=7.8938\ \bullet\sqrt{\frac{1}{7}+\frac{\left(\frac{399}{7}\right)^2}{24279-\frac{\left(399\right)^2}{7}}}=11.862$$

Therefore:

$$t = \frac{{61.2441}}{{11.862 = }} = 5.16305$$

Interpretation:

With a calculated t-value of 5.16305 and the table t-value at α/2 (0.025) with 5 degrees of freedom being 2.571, we can see that the calculated t-value is greater than the table t-value. Therefore, H0 is rejected. This indicates that the intercept ($b_0$) is significant in our regression model.

Assumptions in Regression Analysis

In conducting a linear regression analysis, there are several classic assumptions that need to be met. These assumptions are tested using residual data, not observation data, except in the case of multicollinearity tests. The following are the assumptions in regression analysis:

1. The Model is Correctly Specified

This assumption refers to the requirement that the regression model should be correctly designed. In this context, "correctly specified" means that the model created by the researcher is in accordance with the theoretical concept of the case being studied. There are no specific statistical tests that can be done for this assumption as it is closely related to the researcher's theoretical understanding of the case under study.

In this case example, the test context focuses on the relationship between age (independent variable) and systolic blood pressure (dependent variable), which is theoretically sound as increased age is often associated with increased blood pressure.

2. Errors are Normally Distributed

This assumption requires that the errors from the regression model have a normal distribution with zero mean and constant variance. This can be tested using various methods, such as normal probability plots, the Shapiro-Wilk test, or the Kolmogorov-Smirnov test.

Here are some results of normality tests:

Test

Statistic

P-Value

Shapiro-Wilk's

0.927

0.523

Anderson Darling

0.302

0.578

D'Agostino Pearson

0.809

0.667

Liliefors

0.173

p > 0.20

Kolmogorov-Smirnov

0.173

p > 0.20

Based on the normality test results we provided (Shapiro-Wilk's, Anderson Darling, D'Agostino Pearson, Liliefors, and Kolmogorov-Smirnov), all p-values are greater than 0.05. This means we fail to reject the null hypothesis that the residuals are normally distributed. Therefore, this assumption is met.

3. Homoscedasticity of Errors

This assumption requires that the variance of the errors is constant or unchanging along the regression line. Deviation from this assumption is called heteroscedasticity and can be tested with residual plots or using the White test or Breusch-Pagan test.

Here are the homoscedasticity test results:

DF

χ²-Calculated

P-Value

χ²-0.05

1

0.329

0.566

3.841

The Breusch-Pagan-Godfrey test gives a p-value of 0.566, which is greater than 0.05. This means we fail to reject the null hypothesis that the variance is the same across all groups (homoscedasticity). Therefore, there is no strong evidence of heteroscedasticity in our model, meaning this assumption is also met.

4. There is No Autocorrelation in Errors

This assumption requires that there is no correlation between the error of one period and the error of another period. Deviation from this assumption is called autocorrelation and can be tested with residual plots or the Durbin-Watson test.

5. No Multicollinearity

This assumption applies in the context of multiple linear regression, which requires no perfect correlation between two or more independent variables. Multicollinearity is the term used to describe deviation from this assumption and can be tested using the Variance Inflation Factor (VIF) or tolerance.

Here are guidelines for interpreting the Variance Inflation Factor (VIF) value:

  • 1: no correlation.
  • 1 - 5: moderate correlation.
  • 5: strong correlation.
  • 10: indicates multicollinearity

In the current regression model we have, there is only one independent variable, namely age. Therefore, multicollinearity (which can only occur when there are two or more intercorrelated independent variables) cannot occur. Hence, this assumption is also met.

By meeting these assumptions, the linear regression model will have better reliability and validity in analyzing and predicting data.

Based on the results of the analysis and testing of regression assumptions above, it appears that the linear regression model for this data meets all assumptions and therefore, the results can be considered reliable and valid.

Model Quality Evaluation

Here are some metrics commonly used in model quality evaluation:

  • Coefficient of Determination (R2): This is a metric commonly used to evaluate the quality of a linear regression model. R2 is the proportion of the total variance of the dependent variable explained by the model. R2 values range between 0 and 1, with 1 indicating that the model explains the entire variance in the data, and 0 indicating that the model does not explain any variance.
  • Adjusted R2: Although R2 can be a useful metric, it has a drawback in that its value tends to increase as we add more variables to the model, regardless of whether the variables are truly useful for the model. Therefore, Adjusted R2 is often more useful as it accounts for the number of variables in the model and can penalize overly complex models.
  • Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE): Both these metrics measure the average magnitude of the prediction errors of the model. RMSE gives more weight to large errors (because errors are squared before averaging), while MAE gives equal weight to all errors.
  • Residual Plots: Residual plots can provide useful visual insights about how the model behaves against the data. Ideally, residual plots should look random (indicating that the model generates equal errors at all levels of the dependent variable), and no clear patterns should be identifiable in the plot.
  • Statistical Tests: Many statistical tests are used in the context of regression, including t and F tests for regression coefficients, Durbin-Watson test for autocorrelation, and Breusch-Pagan test for heteroskedasticity.
  • Cross-Validation: Cross-validation is a useful technique for evaluating how well the model can predict new data. In cross-validation, the data is divided into several subsets, and the model is trained using some subsets and tested on the remaining subset.
  • Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC): Both these metrics are used to compare different models and penalize overly complex models. Models with a lower AIC or BIC are usually considered better.

Remember that no single metric can provide a complete picture of the model's quality. Instead, a comprehensive analysis using various metrics and plots must be performed to properly evaluate a linear regression model. Additionally, this evaluation should be adjusted according to the context and purpose of the analysis. For instance, if the purpose is prediction, we might care more about how well the model predicts new data, whereas if the purpose is understanding, we might care more about how well the model explains the existing data.

Confidence Interval

A confidence interval is a range of values generated from a statistical procedure used to estimate a population parameter. The confidence interval provides an upper and lower bound for the estimated parameter, which contains the true value of the parameter with a certain probability. This probability level is called the confidence level and is usually set at 95% or 99%.

In the context of linear regression, we are often interested in finding the confidence interval for regression coefficients, which will provide a range of estimates where we believe the actual population coefficient lies.

The confidence interval for the regression coefficient β can be calculated using the following formula:

$$CI = b \pm t \times SE_b$$

where:

  • CI is the confidence interval,
  • b is the estimated regression coefficient,
  • t is the t-value at the desired significance level (usually 0.05) for n - 2 degrees of freedom, and
  • SEb is the standard error of b.

Note that $t \times SE_b$ is the margin of error for our estimate.

With confidence intervals, we not only get our best estimate for our parameter, but also an understanding of how far this estimate might vary if we were to take a different sample from the population. This gives a more complete picture of the uncertainty surrounding our estimate.

Interpolation and Extrapolation

Interpolation and extrapolation are two techniques used to predict the value of a dependent variable from the value of an independent variable in regression analysis. Both techniques use a regression model that has been estimated from existing data.

Interpolation is the process of using a regression model to predict the value of the dependent variable for the value of the independent variable that lies between the minimum and maximum values in the data used to estimate the model. In other words, interpolation is used to predict values "inside" the range of data we have.

For example, if we have a regression model from monthly sales data over one year, we can use interpolation to predict sales in the 7th month if we only have data for the 6th and 8th months.

Extrapolation, on the other hand, is the process of using a regression model to predict the value of the dependent variable for the value of the independent variable that lies outside the range of values in the data used to estimate the model. That is, extrapolation is used to predict values "outside" the range of data we have.

Using the same example, we could use extrapolation to predict sales in the 13th month if we only have data up to the 12th month.

It should be noted that extrapolation is typically riskier than interpolation, as we are predicting values outside the range of the data we possess and this can result in less accurate predictions. When performing extrapolation, it's always important to ensure that the model we use makes sense for values outside of our data range.