Share
Sign In
2️⃣

Ch2. Linear Regression

Linear Regression
Linear regression: Y = ax+b
Suppose you suspect there is a relationship between two variables, x and y. The simplest relationship (and the one you can usually assume as a starting point) is that of a straight line, or y=ax+b.
The process of determining a and b for a set of x and y data is called linear regression.
The "best fit" line is the line through the data that minimizes the distance between the actual, observed values of y and the values of y predicted by the equation y=ax+b.
You can use the fit function to fit a straight line to data. fit will provide you with coefficients a and b of the best fit line y=ax+b.
1.
Define the data vectors x and y. Note that x and y must be column vectors.
2.
The first two inputs to fit are the data vectors x and y. The third input specifies the type of model. Use 'poly1' to fit a straight line.
The output xyfit is a variable with information on the fitted model equation and coefficients. (Note: fit uses the names p1 and p2 for the coefficients a and b, respectively.)
3.
Extract the value of the coefficients using dot notation. xyfit.CoefficientName
Task1 - Fit a Line to Data
Fitting Data from a Pump
Consider a variable-speed pump. A knob on the pump can be turned to change the flow rate, but there are no labels on the knob. You propose to measure the flow rate for several knob settings and then find the equation of a straight line between knob setting and flow rate. The data you collect is summarized in the table below.
In this activity, you will use the fit function to determine the equation of a straight line that best represents the relationship between the knob setting and the flow rate.
The knob setting is the independent variable, and the flow rate is the dependent variable.
Recall that you can perform linear regression with the fit function by using 'poly1' as the third input:
fitResult = fit(x,y,'poly1')
1.
Perform linear regression on the variables knobSetting and flowRate using the fit function. Name the output variable flowfit.
Leave off the semicolon to display the output.
2.
The slope is stored as p1 in flowfit.
You can extract the coefficients of the model using dot notation. flowfit.CoefficientName
Extract the slope of the linear model to a variable named slope.
3.
The intercept is stored as p2 in flowfit.
Extract the intercept of the linear model to a variable named intercept.
4.
You now know that equation of the best fit line is y=2.3x−0.1.
You can plot this line, along with your data, using the plot function with three inputs:
plot(modelName,xData,yData)
Task2 - Fit a Polynomial to Data
The variables age and pressure are column vectors containing age and pulse pressure of 330 people.
You can fit a quadratic function with the fit function by using 'poly2' as the third input:
fitResult = fit(x,y,'poly2')
Perform linear regression on the variables age and pressure using the fit function. Assign the output to the variable ppfit.
Leave off the semicolon to display the output.
You have three regression parameters for this quadratic model, p1, p2, and p3.
You can extract all three parameters of the model using coeffvalues(fitResult).
1.
Extract the p1, p2, and p3 values from the model to a variable named allCoeff.
2.
Plot the model ppfit and the original data age and pressure using the plot function.
Evaluating Goodness of Fit
How Well Does the Curve Fit the Data?
There are many ways of evaluating the quality of a fit. The fit function returns information on the quality of the fit it finds.
Confidence Bounds
The fit object displays the 95% confidence bounds for each parameter. These bounds mean there is a 95% likelihood that the "true" value of the parameter is in that interval. In other words, if data were taken and fitted repeatedly, the parameter value would be in that interval 95% of the time.In general, the closer the 95% confidence bounds are to the value of the parameter, the better the fit is.
Residuals
Residuals are the differences between the actual data and the fit. You can plot the residuals using the plot function.Residuals should be normally distributed around the zero line. If there are clear outliers or a detectable pattern in the residuals, the fit can be improved.
Goodness of Fit Structure
Calling fit with a second output returns a structure of statistics that quantify the goodness of fit.
R^2 is a commonly used measure of goodness of fit. It is a value between 0 and 1, and the closer it is to 1, the better the fit. What is considered a "good" R^2 value varies from discipline to discipline.
Task1 - Evaluate and Improve the Fit
The variables time and height are column vectors containing the data for the height of an ocean buoy as a function of time.
In this activity, you will fit the data with two different polynomials and evaluate the quality of each fit.
1.
Use the fit function to fit a quadratic polynomial to the ocean buoy data. Ask for two outputs, and assign them to the variables fQuad and gQuad.
The R^2 value of the quadratic model is a property of the goodness of fit gQuad. You can access the properties of gQuad using dot notation: gQuad.PropertyName
1.
Extract the R^2 value from gQuad and save it in a variable named r2Quad.
You may want to leave off the semicolon to see the result.
The R^2 value of the quadratic model is about 0.72. Ideally, the R^2 value for a "good" model would be above 0.98.
This indicates that a quadratic model is not a good fit for the data, and you can investigate how it's not a good fit by plotting the model and data and visually inspecting the fit.
Visualize the model fQuad with the data by using the plot function with three inputs.
The quadratic model doesn't appear to be an accurate representation of the data. The residuals, or the differences between the actual data and the fit, can help identify patterns or trends that can be used to improve the model.
You can plot the residuals of a model by using the plot function with four inputs.
plot(fitObject,xData,yData,'residuals')
Visualize the residuals of the model fQuad by using the plot function with four inputs.
There is a clear pattern in the residuals that resembles the original data. This is a strong indication that a quadratic polynomial is the wrong type of model for the data.
Fit a cubic polynomial to the ocean buoy data by using the fit function with the fit type 'poly3'. Name the two output variables fCubic and gCubic.
Linear Regression
Linear regression: Y = ax+b
Suppose you suspect there is a relationship between two variables, x and y. The simplest relationship (and the one you can usually assume as a starting point) is that of a straight line, or y=ax+b.
The process of determining a and b for a set of x and y data is called linear regression.
The "best fit" line is the line through the data that minimizes the distance between the actual, observed values of y and the values of y predicted by the equation y=ax+b.
You can use the fit function to fit a straight line to data. fit will provide you with coefficients a and b of the best fit line y=ax+b.
1.
Define the data vectors x and y. Note that x and y must be column vectors.
2.
The first two inputs to fit are the data vectors x and y. The third input specifies the type of model. Use 'poly1' to fit a straight line.
The output xyfit is a variable with information on the fitted model equation and coefficients. (Note: fit uses the names p1 and p2 for the coefficients a and b, respectively.)
3.
Extract the value of the coefficients using dot notation. xyfit.CoefficientName
Task1 - Fit a Line to Data
Fitting Data from a Pump
Consider a variable-speed pump. A knob on the pump can be turned to change the flow rate, but there are no labels on the knob. You propose to measure the flow rate for several knob settings and then find the equation of a straight line between knob setting and flow rate. The data you collect is summarized in the table below.
In this activity, you will use the fit function to determine the equation of a straight line that best represents the relationship between the knob setting and the flow rate.
The knob setting is the independent variable, and the flow rate is the dependent variable.
Recall that you can perform linear regression with the fit function by using 'poly1' as the third input:
fitResult = fit(x,y,'poly1')
1.
Perform linear regression on the variables knobSetting and flowRate using the fit function. Name the output variable flowfit.
Leave off the semicolon to display the output.
2.
The slope is stored as p1 in flowfit.
You can extract the coefficients of the model using dot notation. flowfit.CoefficientName
Extract the slope of the linear model to a variable named slope.
3.
The intercept is stored as p2 in flowfit.
Extract the intercept of the linear model to a variable named intercept.
4.
You now know that equation of the best fit line is y=2.3x−0.1.
You can plot this line, along with your data, using the plot function with three inputs:
plot(modelName,xData,yData)
Task2 - Fit a Polynomial to Data
The variables age and pressure are column vectors containing age and pulse pressure of 330 people.
You can fit a quadratic function with the fit function by using 'poly2' as the third input:
fitResult = fit(x,y,'poly2')
Perform linear regression on the variables age and pressure using the fit function. Assign the output to the variable ppfit.
Leave off the semicolon to display the output.
You have three regression parameters for this quadratic model, p1, p2, and p3.
You can extract all three parameters of the model using coeffvalues(fitResult).
1.
Extract the p1, p2, and p3 values from the model to a variable named allCoeff.
2.
Plot the model ppfit and the original data age and pressure using the plot function.
Evaluating Goodness of Fit
How Well Does the Curve Fit the Data?
There are many ways of evaluating the quality of a fit. The fit function returns information on the quality of the fit it finds.
Confidence Bounds
The fit object displays the 95% confidence bounds for each parameter. These bounds mean there is a 95% likelihood that the "true" value of the parameter is in that interval. In other words, if data were taken and fitted repeatedly, the parameter value would be in that interval 95% of the time.In general, the closer the 95% confidence bounds are to the value of the parameter, the better the fit is.
Residuals
Residuals are the differences between the actual data and the fit. You can plot the residuals using the plot function.Residuals should be normally distributed around the zero line. If there are clear outliers or a detectable pattern in the residuals, the fit can be improved.
Goodness of Fit Structure
Calling fit with a second output returns a structure of statistics that quantify the goodness of fit.
R^2 is a commonly used measure of goodness of fit. It is a value between 0 and 1, and the closer it is to 1, the better the fit. What is considered a "good" R^2 value varies from discipline to discipline.
Task1 - Evaluate and Improve the Fit
The variables time and height are column vectors containing the data for the height of an ocean buoy as a function of time.
In this activity, you will fit the data with two different polynomials and evaluate the quality of each fit.
1.
Use the fit function to fit a quadratic polynomial to the ocean buoy data. Ask for two outputs, and assign them to the variables fQuad and gQuad.
The R^2 value of the quadratic model is a property of the goodness of fit gQuad. You can access the properties of gQuad using dot notation: gQuad.PropertyName
1.
Extract the R^2 value from gQuad and save it in a variable named r2Quad.
You may want to leave off the semicolon to see the result.
The R^2 value of the quadratic model is about 0.72. Ideally, the R^2 value for a "good" model would be above 0.98.
This indicates that a quadratic model is not a good fit for the data, and you can investigate how it's not a good fit by plotting the model and data and visually inspecting the fit.
Visualize the model fQuad with the data by using the plot function with three inputs.
The quadratic model doesn't appear to be an accurate representation of the data. The residuals, or the differences between the actual data and the fit, can help identify patterns or trends that can be used to improve the model.
You can plot the residuals of a model by using the plot function with four inputs.
plot(fitObject,xData,yData,'residuals')
Visualize the residuals of the model fQuad by using the plot function with four inputs.
There is a clear pattern in the residuals that resembles the original data. This is a strong indication that a quadratic polynomial is the wrong type of model for the data.
Fit a cubic polynomial to the ocean buoy data by using the fit function with the fit type 'poly3'. Name the two output variables fCubic and gCubic.
Linear Regression
Linear regression: Y = ax+b
Suppose you suspect there is a relationship between two variables, x and y. The simplest relationship (and the one you can usually assume as a starting point) is that of a straight line, or y=ax+b.
The process of determining a and b for a set of x and y data is called linear regression.
The "best fit" line is the line through the data that minimizes the distance between the actual, observed values of y and the values of y predicted by the equation y=ax+b.
You can use the fit function to fit a straight line to data. fit will provide you with coefficients a and b of the best fit line y=ax+b.
1.
Define the data vectors x and y. Note that x and y must be column vectors.
2.
The first two inputs to fit are the data vectors x and y. The third input specifies the type of model. Use 'poly1' to fit a straight line.
The output xyfit is a variable with information on the fitted model equation and coefficients. (Note: fit uses the names p1 and p2 for the coefficients a and b, respectively.)
3.
Extract the value of the coefficients using dot notation. xyfit.CoefficientName
Task1 - Fit a Line to Data
Fitting Data from a Pump
Consider a variable-speed pump. A knob on the pump can be turned to change the flow rate, but there are no labels on the knob. You propose to measure the flow rate for several knob settings and then find the equation of a straight line between knob setting and flow rate. The data you collect is summarized in the table below.
In this activity, you will use the fit function to determine the equation of a straight line that best represents the relationship between the knob setting and the flow rate.
The knob setting is the independent variable, and the flow rate is the dependent variable.
Recall that you can perform linear regression with the fit function by using 'poly1' as the third input:
fitResult = fit(x,y,'poly1')
1.
Perform linear regression on the variables knobSetting and flowRate using the fit function. Name the output variable flowfit.
Leave off the semicolon to display the output.
2.
The slope is stored as p1 in flowfit.
You can extract the coefficients of the model using dot notation. flowfit.CoefficientName
Extract the slope of the linear model to a variable named slope.
3.
The intercept is stored as p2 in flowfit.
Extract the intercept of the linear model to a variable named intercept.
4.
You now know that equation of the best fit line is y=2.3x−0.1.
You can plot this line, along with your data, using the plot function with three inputs:
plot(modelName,xData,yData)
Task2 - Fit a Polynomial to Data
The variables age and pressure are column vectors containing age and pulse pressure of 330 people.
You can fit a quadratic function with the fit function by using 'poly2' as the third input:
fitResult = fit(x,y,'poly2')
Perform linear regression on the variables age and pressure using the fit function. Assign the output to the variable ppfit.
Leave off the semicolon to display the output.
You have three regression parameters for this quadratic model, p1, p2, and p3.
You can extract all three parameters of the model using coeffvalues(fitResult).
1.
Extract the p1, p2, and p3 values from the model to a variable named allCoeff.
2.
Plot the model ppfit and the original data age and pressure using the plot function.
Evaluating Goodness of Fit
How Well Does the Curve Fit the Data?
There are many ways of evaluating the quality of a fit. The fit function returns information on the quality of the fit it finds.
Confidence Bounds
The fit object displays the 95% confidence bounds for each parameter. These bounds mean there is a 95% likelihood that the "true" value of the parameter is in that interval. In other words, if data were taken and fitted repeatedly, the parameter value would be in that interval 95% of the time.In general, the closer the 95% confidence bounds are to the value of the parameter, the better the fit is.
Residuals
Residuals are the differences between the actual data and the fit. You can plot the residuals using the plot function.Residuals should be normally distributed around the zero line. If there are clear outliers or a detectable pattern in the residuals, the fit can be improved.
Goodness of Fit Structure
Calling fit with a second output returns a structure of statistics that quantify the goodness of fit.
R^2 is a commonly used measure of goodness of fit. It is a value between 0 and 1, and the closer it is to 1, the better the fit. What is considered a "good" R^2 value varies from discipline to discipline.