Linear Regression 101

In this guide, we’ll walk you through the essentials of what you need to know about linear regression

July 13, 2022
Linear Regression 101

Linear regression is one of the most important tools for practical business applications of statistics. It’s used to analyze relationships and make predictions for everything from financial planning to machine learning and healthcare, to name just a few. If you’re doing almost anything involving big data or artificial intelligence, you need to have at least a working knowledge of what linear regression is and how it’s done--or at least you need to have someone on your team who does. (If you do need help with linear regression applications but you don't want to do it yourself, don't worry: Fiverr lets you connect with data analytics specialists who can handle the technical side for you.)

Here we’ll walk you through the essentials of what you need to know about linear regression. We’ll cover:

Before We Start 

We’ll assume that if you’re reading this, you’ve had enough mathematical background to understand some of the essential concepts used in linear regression, like how to graph a straight line, what a statistical average is, and what we mean by a normal distribution pattern (bell curve). However, in case you haven’t taken a math class in a while, we’ll try to keep the math as simple as possible, apart from some brief comments for those interested in digging more into advanced topics, and we'll pause to review key concepts as they come up.

If you’ve taken high school algebra and geometry, you should be able to follow most of the discussion, although you’ll find it easier if you’ve taken an introduction to statistics.

The section on assumptions needed to do linear regression involves more advanced statistics than the other sections. You can skip over that section if you just want to understand the basics of linear regression.

Let’s get started:

What Is Linear Regression?

Let’s start with a definition of linear regression:

Linear regression is a statistical method for analyzing how a selected variable X, usually known as the independent variable, influences one or more other variables such as Y, known as dependent variables. A linear regression equation models how making the independent variable bigger or smaller affects the dependent variable or variables. For example, how does the height of a person’s parents (X) influence their own height (Y)?

Now let’s try to visualize what this means:

  • Linear regression starts with data points that don’t quite form a straight line when graphed onto a coordinate system using dots to represent their values (a way of visualizing data called a scatter plot diagram).

  • After graphing the points onto a scatter plot diagram, linear regression analysis seeks to find the best line to fit the points as closely as possible.

  • The best fit is a line that minimizes the distance between points falling above or below the line.

  • The idea is that the line represents the average of the distance between the points, providing a center which the data points are scattered around.

  • The line is known as the regression line or line of regression.

After a line of regression is established, it can be developed in a couple of ways:

  • It can be used as a guide to fill in where additional data points would fall if selected values for the independent variable were used to fill in the line within the original data range (called interpolation).

  • It also can be used to extend the line by filling in points beyond the original range (extrapolation). However, the further you extend the line, the less certain you can be that data outside your original range will follow the same pattern, so this must be approached carefully.

A regression line serves to model how the dependent variable influences the independent variable or variables. This has a couple of general applications:

  • It can be used to extrapolate from a statistical sample to conclusions about a population.

  • It also can predict future behavior consistent with the trend of the regression line.

These general applications have many specific uses in any field of science or business which relies on data, including:

  • Medicine and healthcare (for example, how do data on smoking relate to data on heart disease?)

  • Psychology (how does genetics relate to depression?)

  • Sociology (how does educational level relate to income?)

  • Financial planning (how does marketing performance relate to revenue?)

  • Machine learning (what is the likelihood that someone who types in one search term wants information about a topic with another search term?)

Let’s look at an example:

An Example of Linear Regression

To illustrate linear regression in action, let’s take a famous historic example from 19th-century British scientist Sir Francis Galton, one of the pioneers of regression analysis. Galton studied the relationship between the heights of parents and their offspring in order to determine how parents’ height affected their children’s height.

While many people intuitively would assume that tall parents have tall offspring and short parents have short offspring, Galton discovered that this wasn’t necessarily the case. He found that children’s heights tend toward an average height regardless of their parent’s height.

Galton found that regardless of how tall parents were, their children’s height tended to average about 68 1/4 inches. Deviations from this height fell randomly around the average in a pattern we today would refer to as a normal distribution pattern.

This phrase refers to a bell-shaped curve where the average value is in the middle of the curve at its peak and results larger or smaller than this value fall away from either side of the peak in order of descending frequency. It was actually Galton’s student Pearson who introduced the term “normal” to refer to this type of distribution pattern. Galton himself described his results using the phrase “regression toward the mean”, referring to how values tend to drift toward the average value.

The practical result of Galton’s study was that it gave him a mathematical model which could be used to represent average height and to predict how likely a child’s height would diverge from this average by a specific amount given the height of their parents. As this illustrates, linear regression can be used to model real or hypothetical scenarios and to predict outcomes.

Why Is It Called Linear Regression?

The term “linear regression” combines the ideas of linearity and regression:

  • The “linear” part of this phrase comes from the use of a straight line to represent the statistical mean of a set of a numbers.

  • “Regression” comes from a Latin phrase meaning literally to “go back”, referring to the way the actual data points seem to fall back away from the regression line.

So the phrase “linear regression” invokes the image of points falling back from a straight line. This image is a good visual aid to understanding what linear regression means both graphically and mathematically.

What Is the Line of Regression?

The line of regression is the line that represents the best match between the data points being plotted and a linear equation in slope-intercept form. It is also referred to as the regression line.

What Are Residuals?

When doing linear regression analysis, you measure the vertical distance each data point falls from the line of regression. This vertical distance represents how far an actual value differs from the average value represented by the line. It is referred to as a residual. A point that lies above a regression line has a positive residual, while one which lies below it has a negative residual, and one which falls on the line has no residual.

What Does Least Squares Mean in Linear Regression?

In order to use linear regression to find the line which best fits the data points, you find a linear equation that minimizes the sum of the squares of your residuals. This method is known as finding the least-squares regression line or finding the least-squares line. One reason we square the residuals instead of just using the residuals is that this gets rid of negative numbers without using absolute values, which makes the equation much easier to solve.

What Is a Linear Regression Model?

A linear regression model is an equation representing the relationship between an independent variable and one or more dependent variables. Linear regression models fall into three main categories:

  • Simple linear regression (SLR) models: one independent variable related to one dependent variable

  • Multiple linear regression (MLR) models: multiple independent variables related to one dependent variable

  • Multivariate linear regression (MVLR) models: multiple independent variables related to multiple dependent variables

We’ll focus on SLR, but you should be aware of MLR and MVLR, so we’ll cover them briefly. Let’s flesh out the above distinctions:

Simple Linear Regression Models

A simple linear regression model plots one independent variable against one dependent variable. It is graphed using a straight line drawn through points on a scatter plot diagram. The equation for the line uses a variation on the slope-intercept formula. Instead of the formula Y = mX + b, the simple linear regression formula becomes:

Y = β0 + β1 + ε

In this formula:

  • β0 represents the intercept of the regression line when the X value is 0

  • β1 represents the average slope of the line, determined by measuring residuals

  • ε represents an error factoring in the difference between observed and expected values

The Epsilon variable sometimes is omitted in simplified versions of this equation, shortening it to:

Y = β0 + β1

This basically means the same thing as Y = mX + b, except the order of the right side of the equation has been reversed by moving the intercept to the front of the equation and the slope to the back:

Y = b + mX

In linear regression, this is often written as:

Y = a + bX

We’ll talk about how to use this equation for actual calculations later, but for now, this gives you an idea of what’s involved. You’re basically finding an equation for a straight line in modified slope-intercept form.

Multiple Linear Regression Models

A multiple linear regression model plots multiple independent variables against one dependent variable. Graphically, this requires plotting relationships on a coordinate system with three or more dimensions, depending on the number of variables involved. For instance, with two independent variables, you’d be graphing in a 3D coordinate system and trying to find the plane that best fits the data. Mathematically, this involves expanding the simple regression line equation by adding multiple terms representing average line slopes. To take a simple example:

Y = β0 + β1 + β2 +  ε

The more variables involved, the longer this equation becomes and the more abstract the graphical representation. Matrices typically are used to solve multiple linear regression equation problems, so they require more advanced math than simple linear regression.

Why would you want to use a multiple linear regression model? It’s useful if you have several factors influencing an outcome and you want to see what their combined effect is, or perhaps how it differs from the result when one factor is removed. For example, how do diet and exercise together affect cardiovascular health, as opposed to just diet or exercise alone?

Multivariate Linear Regression Models

A multivariate linear regression model relates multiple independent variables to multiple dependent variables. Briefly, this involves expanding the formula for linear regression so that it relates the average of multiple dependent variables to the average of multiple intercepts, multiple slopes, and multiple error factors. Matrices are used to solve multivariate linear regression equations.

Multivariate linear regression models are useful when you have multiple factors influencing multiple outcomes. For example, you might want to study how a combination of marketing, sales, and customer service performance affects both revenue and brand reputation.

Types of Linear Regression

The main types of linear regression are those discussed in the last section:

  • Simple linear regression: one independent variable, one dependent variable

  • Multiple linear regression: multiple independent variables, one dependent variable

  • Multivariate linear regression: multiple independent variables, multiple dependent variables

There also are other advanced types of linear regression for dealing with scenarios such as:

  • Dependent variables which are vectors (general linear models)

  • Variables with varying variances (heteroscedastic models)

  • Dependent variables which are bounded or discrete (generalized linear models)

  • Hierarchies of regressions (hierarchical linear models, also known as multilevel regression)

  • Errors in observations of independent variables (errors-in-variables models)

These all represent advanced applications of the basic concepts contained in simple linear regression.

Assumptions of Linear Regression

In order to apply linear regression models accurately, several assumptions must hold. Four of the most important are:

  • Independent observations: the value of any data point is not influenced by the value of any other data point

  • Linearity: the relationship between dependent and independent variables approximates a straight line

  • Homoscedasticity: the data are scattered an even average distance from the line of regression

  • Normality: the data fall into a normal distribution pattern

Let’s break down what these mean:

Independent Observations

Like most statistical procedures, linear regression assumes that observations are independent of each other, meaning that the value of one data point is not influencing another and skewing the outcome. For example, in Galton’s height study, if one person accidentally was included twice, or if some of the offspring were identical quintuplets sharing the same parents, their data would not be independent.

Linearity

A fundamental assumption of linear regression analysis is that the relationship between dependent and independent variables approximates a line. When data don’t form a line, linear regression doesn’t apply. Some sets of data may be represented better by the graph of another function such as an exponential or logarithmic curve.

Homoscedasticity

Homoscedasticity is a term that comes from Greek root words which carry the connotation of being scattered the same way, as in scattered evenly away from the line of regression. When this condition holds, all values of X share the same residual variance. The opposite condition is referred to as heteroscedasticity. Heteroscedasticity makes it hard for linear regression models to represent variance accurately.

Normality

Normality assumes that data residuals fall into a normal distribution pattern. When this is not the case, slope and confidence intervals can be difficult to determine accurately.

How to Validate Linear regression Assumptions?

Specific procedures may be used to test linear progression assumptions and address common issues. Here are some simplified solutions:

Validating Independent Observations

You can test if observations are independent by plotting residuals against independent variables or residuals against row numbers to verify that there is no pattern of correlation between consecutive errors no matter how values are sorted. If you’re doing a time series analysis, which is a variation of linear regression analysis, you can check for independence using by plotting residuals against time or by using a technique called a Durbin-Watson test. For time series, independence issues may be fixed by methods such as adding lags.

Validating Linearity

The easiest way to check for linearity is to visually inspect a scatter plot of your data and look for a diagonal line running through your points. If you don’t see one, you may be able to apply a nonlinear transformation or add another independent variable which is a nonlinear function of one of your other variables. Alternately, you may be overlooking an independent variable that would produce a linear pattern.

Validating Homoscedasticity

You can check for homoscedasticity by plotting residuals against predicted values and looking for residuals that grow larger as a function of the value. To correct this, you can do a logarithmic transformation of the dependent variable, use a rate to redefine the dependent variable, or assign a weighted regression to each data point based on its variance.

Validating Normality

You can check for normality by using a quantile-quantile plot (Q-Q plot), which compares the distributions of different parts of your data, or by using tests designed for this purpose, such as the Shapiro-Wilk test. If your data isn’t normal, you first should check if any outliers are skewing your results. If not, you can do a nonlinear transformation of your independent or dependent variables.

Calculating Linear Regression

You can calculate linear regression by using the regression line formula introduced earlier:

Y = β0 + β1 + ε

This formula can be rewritten in various ways. Some simplified versions of the formula omit the Epsilon term. A common simplified version of the linear regression formula is:

Y = a + bX

Here a is the intercept and b is the slope. The intercept can be calculated by using a subformula that uses summation notation to represent the sum of X values and Y values and operations on them:

A = [(Σy)( Σx2) – (Σx)( Σxy)] / [n(Σx2) – (Σx)2]

Here n is the number of values in the sample.

The slope can be calculated using the subformula:

B = [n(Σxy) – (Σx)( Σy)] / [n(Σx2) – (Σx)2]

An online regression line calculator can assist you with the process of finding your line of regression. You can enter your X and Y values into the calculator and it will apply the formula to generate your line of regression. If you have a large amount of data that needs to be entered or you have specialized needs, you can use a program such as Excel to create a linear regression model, or you can use hire a specialist familiar with the R programming language, which is designed for statistics applications.

Find Linear Regression Support through Fiverr

Linear regression can be an extremely powerful technique for businesses to model and predict outcomes for everything from financial planning to marketing results. This guide provides you with a basic understanding of how linear regression works, but to gain the most benefit, your best strategy is to consult an expert.

Fiverr’s data analytics resources can put you in touch with experienced linear regression specialists who can handle the technical side for you, enabling you to focus on the business results you want. Try Fiverr today to find the perfect data analytics services for your business needs.