In Part 1, I identified my goal to estimate when Investor Juan will earn 50,000 monthly pageviews given 20 months of historical data. I also discussed the basic premise of linear regression forecasting and how it's only applicable to data which displays a linear trend. In this post, I will proceed by discussing how to define the trend line and how to use this to make forecasts.
Step 2: Estimate the "best fit line"
In linear regression, we identify the general linear trend of the data by fitting a straight line to the data points--the best fit line.
In high school, I remember doing this with our physics experiment results mechanically--as in taking a straight rule to find a straight line that touches most of the plotted points. The problem with this technique is that different people will get different lines given the same data since the qualifier "best fit" becomes a matter of personal judgment. In statistics, there is a way to find the best fit line objectively and mathematically: the method of ordinary least squares finds a line such that the sum of the squared distances of the data points to the line is minimum. The result is a unique straight line defined by two parameters a and b, such that
y = a + bx
where y is the dependent variable (e.g., pageviews), x is the independent variable (e.g., month), and a is the y-intercept and b the slope of the best fit line. Computing for a and b can be quite troublesome when done by calculator (to which my past statistics students will attest); fortunately, spreadsheet programs like Excel and Google Spreadsheets have features and functions that can easily solve for these parameters.
For Google Spreadsheets, we can use the INTERCEPT and SLOPE functions to get a and b, respectively. For example, using the SLOPE function, we enter the data this way (I find it easier to use reference values for "month" instead of the actual months):
Following the same procedure for the INTERCEPT function, we get the following parameters for our best fit line
a = 929, b = 792 (rounded to the nearest whole number)
resulting in the best fit line or linear regression equation
y = 929 + 792x
What do these numbers mean? a = 929, the y-intercept, is the interpolated number of pageviews when x = 0, which is on August 2010. Meanwhile b = 792, the slope, is just the predicted number of additional page views per month. Now that we know where we started (y-intercept) and by how much we change per month (slope), forecasting future pageviews is easy peasy.
Step 3: Use the linear regression equation to make forecasts
Let's say we want to forecast the monthly page views in April 2013, or 12 months from now, we just enter x = 32 (which is 20 + 12) into the equation and compute for y. We get y = 26,273 pageviews in April 2013.
For the objective that we defined at the beginning of this exercise, we do things a bit differently. Instead of the number of pageviews y, we want to know x when y is 50,000. In equation form, this is
50,000 = 929 + 792x
Solving for x, we get x = 61.96 ~ 62, or 42 months or 3 and a half years from now. That's October 2015; let's see how good our forecast is then. ;)
There you have it, a brief introduction to forecasting with linear regression. I'm sure you can find better uses for it than projecting pageviews of an obscure personal finance blog, hahaha.