This post is the first in our series on machine learning (ML) algorithms. (See posts 2 and 3 on Logistic Regression and k-Nearest Neighbors.) In these posts, we explain the basic underlying concepts behind various algorithms, their pros and cons, and their most common applications. We are starting this discussion with the oldest ML algorithm, linear regression. It is an excellent starting point for learning about ML because many of the basic concepts involved are easy to explain in this context and because it is relatively easy to implement your very own linear regression algorithm.

Linear regression is common enough in statistics and fitting routines that it is easy to forget it is a form of machine learning. Recall that any algorithm that improves with the addition of more data, i.e., as the algorithm gains experience, is a learning algorithm, where “improvement” is defined as becoming more accurate in performing its intended task. (Tom Mitchell posed a formal definition along these lines in 1998 in his book *Machine Learning*, published by McGraw Hill. You can find it here). At its heart, linear regression is about finding the coefficients of a line that best match the given data; the more data, the more likely the line will be accurate for new data. Hence, linear regression can be considered a ML algorithm. It should be noted that it is a supervised ML algorithm, meaning “correct” values for the data must be known for training purposes before a prediction can be made on new data.

You may be familiar with the most basic equation of a line, y = mx + b, where m is the slope of the line, b is the y-intercept, x is the independent variable (the predictor), and y is the dependent variable (the response). For a basic linear regression algorithm this still holds true, but instead of m and x being a single number they are many pairs of values, so the equation becomes y = m_{1}x_{1} + m_{2}x_{2} + m_{3}x_{3} + … + b. Each x represents one of the variables that affect the prediction (called a feature), while each coefficient (m, in the above example) is a number that, essentially, adjusts the contribution of that variable to the final result. Variables with bigger coefficients are generally more influential than those with smaller coefficients.

As a classic example, say you wanted to figure out how much your house was worth. If you had data about other houses in the area and their prices, you could probably make a pretty good guess. You’d compare the age of your house, the number of bedrooms and bathrooms, the square footage, and other such features to the other houses.

If you organized the information about the houses into a standard ML format, where the data is in a matrix (a table, if you prefer) with each row corresponding to one house and each column corresponding to one feature of the house, then ran a linear regression algorithm on that data, it would come up with coefficients for each of the features (one coefficient per column of the matrix), along with an intercept value. You could then plug the values for your house into that equation, and the number that came out would be the predicted price for your house.

One point that should be made about the linear regression algorithm is that it can accommodate nonlinearity by preprocessing the data. Features can be created by multiplying the values of features together or with themselves. So, back in our house example, you might think that the combination of number of bedrooms and bathrooms is important, in addition to each one being important on its own. You could form a new feature by multiplying those two features together for each house. That new feature would get its own column/coefficient, and everything would proceed as it did in the original example. The only difference would be, you would now be accounting for the new combination feature.

There are many nuances to the implementation of linear regression, especially when used on big data. Explaining them all here would go beyond the scope of this post, but we should mention the terms feature scaling, regularization, cost function, tunable parameters, normal equation and gradient descent, at a minimum, as elements that should be understood for this topic (and, indeed, for a number of ML algorithms).

The biggest pros to using linear regression are its simplicity, the ease of explaining the result to another human being, and the insight you can gain for your system by looking at the coefficients that were generated (since they tell you about the relative importance of different features). It is also fast and computationally cheap compared to many other algorithms. Plus, because it is so old and common, it is built into virtually any numerical software or coding language you care to use.

There are also several cons to this approach. Only numerical data can be accommodated in these models (although categorical data can be transformed to fit within the linear model framework, ordinal data cannot). It is assumed that the value being predicted (for example, housing prices) is a continuous variable (so, it is not useful for predicting discrete values (count data), categorical variables, or classification into one of several discrete groups). If nonlinear relationships exist (and cannot be accounted for by combining features), predictions will be poor. Outliers tend to have a large impact on the final outcome. And missing values cannot be accommodated by the algorithm.

In cases with relatively complete data and a phenomenon that is expected to be linear, this algorithm should be considered the starting point of any analysis. Unfortunately, many real-world systems do not meet the assumptions required for the use of linear regression. However, because it is so simple to implement, it is often used as the benchmark for comparing the quality of other ML algorithm predictions.