Here’s my challenge, created by me, for me. I want to explain where the line of best fit comes from. Not just the algorithm to find it, but conceptually how it is found. My intended audience: students in Algebra II. Where the derivation comes from? Multivariable calculus.
So here we go.
Let’s say we have a set of 5 points: (1,1), (3,5), (4, 5), (6, 8), (8,8)
We want a “line of best fit.” It’s tricky because we don’t exactly know what that might mean, quite yet, but we do know that we want a line that will pass near a lot of the points. We want the line to “model” the points. So the line and the points should be close together. In other words, even without knowing what exactly a “line of best fit” is, we can say pretty certainly that it is not:
Instead, we know it probably looks like one of the following lines:
LINE A: y=1.1x
or
Of course it doesn’t have to be either of those lines… but we can be pretty sure it will look similar to one of them. You should notice the lines are slightly different. The y-intercepts are different and the slopes are different. But both actually lie fairly close to the points. So is Line A or Line B a better model for the data? And an even more important question: might there be another line that is an even better model for the data?
In other words, our key question is now:
How are we going to be able to choose one line, out of all the possible lines I could draw, that seems like it fits the data well? (One line to rule them all…)
Another way to think of this question: is there a way to measure the “closeness” of the data to the line, so we can decide if Line A or Line B is a better fit for the data? And more importantly, is there an even better line (besides Line A or Line B) that fits the data?
(Part II to come…)
UPDATE: Part II here
Sam, I’m excited to see this. Years ago, I taught a College Algebra with Applications course at a small college. The previous teacher (who had retired, which was why I was hired) had a course pack for the course. His explanation of this? Press this button, and then this one…
I studied regression, and wanted to understand the formula. There was one step I never felt was justified, and I am eager to see what you have to say about it. I wrote a paper back then (’95, long before blogging), just for myself. It’s not as clear an exposition as yours, but if you’d like it, I’d be happy to email it to you.
Hi Sue,
I’ll take the paper, but I might email you for it AFTER I blog about it. I don’t know why, but I’m just simply excited by the chance of seeing how “easy” I can make it on my own. (Without any outside resources.)
I was planning on skipping over some of the math, and focusing on the concepts. HOW the equation was found, rather than actually finding the equation. So don’t get your hopes too up :)
And also, I’m glad you’re back blogging! Boo to being ill, hooray for feeling better (for now, anyway).
SAM
[In the past, these ‘attacks’ have been about a year apart, so I expect to be better for quite a while. (I would dearly love to know what’s going on, of course.)]
Thanks for letting me know you read my post. That makes me feel connected.
Whatever you have to say about regression will intrigue me, I’m sure.
I’m especially intrigued by how you’ll explain regression. It always used to bother me that if one minimizes the sum of the squares of the y errors one gets a different line of best fit compared to if you minimize the sum of the squares of the x errors (but same r squared value.) So the idea that one is the “best fit” seems paradoxical.
Thanks, Noah. I hadn’t noticed that! I expect that’s based on the distance (error) being measured in the x or y direction, instead of as a perpendicular to the line (which is the usual way of measuring distance between a line and a point).
If one of the coordinates is naturally the input, and the other is naturally the output, maybe it’s ok (that the two lines would be different). The distance we measure is the difference between predicted output and recorded output, for a given input value.
What I didn’t like was squaring the errors, when I thought absolute value should be used. (But of course, calculus won’t do its thing on an absolute value.)
I do a derivation of least squares regression with my Alg II kids as well. I’m curious to see how you finish this discussion, and I’ll add my explanation afterward if it’s at different. I did want to comment on the absolute value issue. The derivation of the least squares regression line actually does not require calculus, and correspondingly, there is a non-calculus reason for not using absolute value: if you do use absolute value in the calculation the “line of best fit” is not unique. In other words, you’ll get a family of lines all of whom fit equally well.
Aran, I can’t see that. Can you give me an example (with more than 2 points)?
(Sorry. 2 points makes a line anyway.) When I draw 3 points, it sure looks like only one line would have a smallest absolute value error.
If you have a solution with no points on the line and an equal number of points above and below, then there is a region (between the first point above and the first point below) where you can slide the line up and down without changing the sum of the absolute values of the errors.
Sue, I believe I miss spoke about there not being a unique solution, however gasstationwithoutpumps explanation is exactly what I was getting at. Take the points (1,1), (2,3), (3,3), and (4,5). The lines y=x, y=x+1, and y=x+0.5 all give the same total absolute residual.*
That being said, it is still possible to minimize the total absolute residual. In fact, doing this is called (by wikipedia at least) Least Absolute Deviation. There are many ways of calculating a “line of best fit” depending on what you mean by “best.” (again, see wikipedia under “Linear Regression.”) When I teach this, I always have a discussion about what we mean by “best,” and how there are different interpretations which lead to different methods of calculation. On that subject, I push them toward squaring rather than absolute value by an analogy. You can ask the same question about standard deviation. Why do we square rather than absolute value? Standard Deviation is meant to be a measure, in some sense, of the average distance from the mean. Consider the values {2,2,6,6}. The mean is 4. So each point is 2 away from the mean, and the average distance from the mean is 2. That is intuitive, it is clear, and it uses absolute value. There is a problem, however. How far are the values, on average, from the 2? Well, 2 of them are 4 away and 2 of them are 0 away. The average distance from 2 then is ALSO 2. The same is true for 6, and any number in between 2 and 6. If we have formulated our idea of standard deviation correctly, then the mean should be the “closest” number to all of the data values, however if we use absolute value it is not. A quick comparison where you square the distances instead shows that the mean is the number with the smallest standard deviation for a given data set.
*if you calculate the Least Absolute Deviation for these data points you get the line y=(4/3)x-1/3 and if you calculate the Least Squares regression line you get y=6/5x
My favorite derivation and explanation for the best fit line comes not from calculus, but from higher dimensional linear algebra. I can never remember the exact details without pencil and paper (and I really need to go prep a class right now), but I like it because it explains the sum of squares thing. My vague memories are that you need a dimension for each point–so the more points you are fitting, the larger the matrix. Anyway, the line of best fit corresponds to a projection of a point onto a hyperplane defined by the points you are trying to fit to, and so the sum of squares is the Euclidean distance formula, and you’re just finding the closest point on the hyperplane. I don’t think my Algebra-level students really “get” the explanation, but it’s just so elegant that I can’t resist.
Sam, this derivation is in Chapter 1 of our Algebra 2 book. It doesn’t need multivariate calculus, just a nice exploration of linear equations and quadratics.
I will explain more later if you like, but if you have our A2 book, it’s the goal of Investigation 1B.
I’m very glad you are teaching this and not “black-boxing” it!
– Bowen
Hi Bowen,
I do have your book — so I hopefully will look at it when I remember (and am at school, with the book) this week.
I am still going to try to (when I get the time) finish this set of posts up with the multivariable calculus-based part to it, because there is something really elegant about it which I love. Plus, I like the challenge it has presented me — how to clearly explain it.
But I had to abandon it for a short while because the work has been piling up…
I feel a little bad because I DO black box it for my Algebra 2 kids. I do explain it in general terms, but we don’t really spend much time uncovering the black box of linear regressions. We do talk a lot about predictions and confidence – but not too much about what the calculator is doing to find the “best fit line.” For that, I do give a little general explanation, but nothing more specific.
Sam
One intermediate way that is less of a black box is to let students know that the line of best fit always goes through the “balance point” of data (mean(x), mean(y)). This feels at least a little intuitive, especially when deviations in one dimension are calculated as (x – mean(x)).
Then, use the point-slope form of a line
y – mean(y) = m * (x – mean(x))
You can then have kids estimate the slope of the best-fit line then let the calculator black-box the exact value.
(This is our Algebra 1 approach to line of best fit, A1 Lesson 4.15.)
OR you can let the sum of squared errors be a function of m, the slope, and it turns out to be a quadratic in m. Quadratics have vertices, and whammo.
Hope this helps and I look forward to seeing the multivariable calculus explanation!