. Cadl Gig Ree. cl = ne ws fia Forcing Regression Through a Given Point Using Any Familiar Computational Routine by Edward B. Hands - / TECHNICAL PAPER NO. 83-1 : MARCH 1983 \ \ tae NG et ne | Approved for public release; | a distribution unlimited. U.S. ARMY, CORPS OF ENGINEERS COASTAL ENGINEERING RESEARCH CENTER sé Kingman Building TU Fort Belvoir, Va. 22060 or republication of any of this material Reprint Army Coastal shall give appropriate credit to the U.S. Engineering Research Center. Limited free distribution within the United States of single copies of this publication has been made by this Center. Additional copies are available from: Nattonal Technical Information Service ATIN: Operations Division 5285 Port Royal Road Springfteld, Virginia 22161 The findings in this report are not to be construed as an official Department of the Army position unless so designated by other authorized documents. NA fii ll 0 0301 0050007 2 A UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) REPORT DOCUMENTATION PAGE BR COE ea 1. REPORT NUMBER 2. GOVT ACCESSION NO, 3. RECIPIENT'S CATALOG NUMBER Less 4. TITLE (and Subtitle) 5. TYPE OF REPORT & PERIOD COVERED FORCING REGRESSION THROUGH A GIVEN POINT Technical Paper USING ANY FAMILIAR COMPUTATIONAL ROUTINE 7. AUTHOR(s) 8. CONTRACT OR GRANT NUMBER(s) Edward B. Hands 9. PERFORMING ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT, PROJECT, TASK AREA & WORK UNIT NUMBERS Department of the Army Coastal Engineering Research Center (CEREN-GE) D31677 Kingman Building, Fort Belvoir, VA 22060 11. CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE Department of the Army March 1983 Coastal Engineering Research Center 13. NUMBER OF PAGES Kingman Building, Fort Belvoir, VA 22060 20 14. MONITORING AGENCY NAME & ADDRESS(if different from Controlling Office) 15. SECURITY CLASS. (of thie report) UNCLASSIFIED 15a, DECLASSIFICATION/ DOWNGRADING SCHEDULE . DISTRIBUTION STATEMENT (of this Report) Approved for public release; distribution unlimited. . DISTRIBUTION STATEMENT (of the abstract entered in Block 20, if different from Report) - SUPPLEMENTARY NOTES - KEY WORDS (Continue on reverse side if necessary and identify by block number) Coastal engineering Prediction equations Data analysis Regression 20. ABSTRACT (Continue em reverse side if necessary and identify by block number) : This report describes a simple method for obtaining the prediction equation best fit to all data points (in the least squares sense) while forcing an exact fit at any kmown point. The decision to constrain the solution at a point should be justified on theoretical grounds without appeal to data. Examples are given. When required any familiar regression program can be forced to select the best line through a given point by simply adjusting and extending the data entry. All necessary changes to the program results (test statistics and estimates of regression parameters) can be accomplished without modifying the computer program. DD jan 7a 1473 EDITIon oF t Nov 65 1s OBSOLETE UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) ay avila eos Ug : aA i \ fee iy a bie Hh 7 Pa i f ay ea ge, Naa ie " t ep eat Bee i 25S ean! dla ee Pest alte bl, hl wy a AY! off ¥ bo y hi A hy i ; mi | ; ik ae NE at oe PREFACE This report draws attention to the frequent, but often neglected, need to force a regression line through a known point while obtaining the best possi- ble fit to all experimental data points. A simple method is described for solving this problem without modifying customary computational routines. This method can be applied to many problems, but is especially useful when cali- brating empirical prediction formulas to fit site-specific coastal conditions or when choosing from among several theoretical prediction models. The work was carried out under the U.S. Army Coastal Engineering Research Center's (CERC) Shore Response to Offshore Dredging work unit, Shore Protection and Restoration Program, Coastal Engineering Area of Civil Works Research and Development. The report was prepared by Edward B. Hands, Geologist, under the general supervision of Dr. C.H. Everts, Chief, Engineering Geology Branch, and Mr. N. Parker, Chief, Engineering Development Division. The author acknowledges the helpful suggestions received from C.B. Allen, C.H. Everts, R.J. Hallermeier, R.D. Hobson, and P. Vitale. Technical Director of CERC was Dr. Robert W. Whalin, P.E. Comments on this publication are invited. Approved for publication in accordance with Public Law 166, 79th Congress, approved 31 July 1945, as supplemented by Public Law 172, 88th Congress, approved 7 November 1963. TED E, BISHOP ; Colonel, Corps of Engineers Commander and Director CONTENTS Page CONVERSION FACTORS, U.S. CUSTOMARY TO METRIC (SI). ........-- : SNAMINOIS ANNUD) IDAIMIORITINIOINS 5" 6 0 0 0000000000000 656 00 0 6 TENN OIDIVICIEILON TWO IINENNSSIUCIN 5 5 56 00 0 Oo oO OOK ooo 9 JX TIO \RALet WHS, GWISINOMIAK? INDDROVNGEIS 56 5 6 6 5606600005000 11 SOLUTION TO THE PROBLEM. ... . ee eRe aha mmcaterh 5 Ciredh Yedbie el. Aebuet ta Oho 12 1. Regression Through the Orefien eats Behe a) leur vo. zo A le 2. Regression Through Any Arbitrary Point “Gs 'b) A” a) 2 ello a Ae SIMICIIONG, WIOAKIIAON WOES I ANNI Ile 4g 666 60a.6 8 ogee ooo du6 ob» Leo TIGER RAT URE ACUMCED ie Meare tien vos Vm eer a acl es Pe aa pee ae ot Gast gy) oo, 220) TABLES Adjustment of standard elements produced by programs using ext endedudatagnra amc tpt the tet NWAw Tu Re tates, cat, the fe ellen me Palelidicalitbrathionidatas ey si scty uses ines weirs fs) fs oe st ve et et felt 16 Extend cdudatama ct Noma Wem. lmrward matin Carmen amen utes en a) 2 (st rseeuee ets ve ee con 28g, Extendedridatamscitr iNOk Wik cotm seis Yeo Ly ee nemnat fst anes, ah oW et cin cole cs fa el tomes 19 FIGURES Application of Model I produces an intercept (a), which may be a useful estimate of a component of longshore flow which is independ- ent of wave conditions and presumably pervades the entire data set. . 10 Application of Model I identified a threshold value below which WEVES CAUSE MO GAME 5 o 05000000 Fo 7 oo oOo ooo KO 10 Application of Model II forces a zero-intercept solution ....... 11 Model II estimates an increase in Y per unit increase in X that is nearly twice that predicted using Model I. .-........... il Real test data for example problem Hacer staiet trove tofaets@ sn ouster; is? Got ce Suite! aya ts 17 Real test data for example problem 2 and fitted equations. ...... 18 CONVERSION FACTORS, U.S. CUSTOMARY TO METRIC (SI) UNITS OF MEASUREMENT U.S. customary units of measurement used in this report can be converted to metric (SI) units as follows: Multiply by To obtain inches 25.4 millimeters 2.54 centimeters Square inches 6.452 Square centimeters cubic inches 16.39 cubic centimeters feet 30.48 centimeters 0.3048 meters square feet 0.0929 Square meters cubic feet 0.0283 cubic meters yards 0.9144 meters Square yards 0.836 square meters cubic yards 0.7646 cubic meters miles 1.6093 kilometers Square miles 259.0 hectares knots 1.852 kilometers per hour acres 0.4047 hectares foot—pounds 1.3558 newton meters millibars 1.0197 x 1073 kilograms per square centimeter ounces 28.35 grams pounds 453.6 grams 0.4536 kilograms ton, long 1.0160 metric tons ton, short 0.9072 metric tons degrees (angle) 0.01745 radians 1 Fahrenheit degrees 5/9 Celsius degrees or Kelvins lt) obtain Celsius (C) temperature readings from Fahrenheit (F) readings, use formula: C = (5/9) (F -32). To obtain Kelvin (K) readings, use formula: K = (5/9) (F -32) + 273.15. Hp SS yeu SYMBOLS AND DEFINITIONS The F-value may be produced by a multiple regression program and is analogous to the t-value in simple regression (one independent varia- ble). The F-value indicates the "significance" of r* and is useful in selecting the most important independent variables. p 2G =D" (a-2=2) abaecoen, /Misapie E(y - y)? P Ls 62 P height of breaking waves size of the sample total number of independent variables. Caution, several observed car- riers may end up combined into a single independent variable; e.g., X= (gH,) 1/2 sin 2a, has two distinct carriers (Hj, and ap) but is one independent variable (see example problem 1). The value of p will be one less than the number of constants to be estimated in Model I, and is equal to the number of constants in Model II. sample correlation coefficient. The r-value produced by regression partially measures the closeness of fit between the linear predictor and data. Its square is called the coefficient of determination. ia NS me ane eB = Boas = 2y y) & a) (Model 1) (y - y)? i(y - y)* U(x - x)? r2 = see ES caus (Model IT) (Sy) 2(2x) 2 sum of squares of x may be produced by the regression program and is useful for computing other values, e.g., Sg. = =) 2 SS. = Gs = 2) standard error of the estimated slope, : S$ ox B scm The larger Sg, the less reliable is the estimate of slope. unbiased estimator of the variance of the random component €, e.g., ig = y)2 2 eo eet Atal . Sy ox Apel in Model I The number of independent variables, p, is 1 in simple regression with Model I. The mean square deviation from regresston corresponds to > the simple variance used to measure the spread of values in a single data set. It is also sometimes called the standard error of the estimate. The value produced by regression to indicate uncertainty of the esti- mated y; the value Boose depends on the variances of all the estimated coefficients. The t-value produced in simple regression to test whether the estimated regression coefficient is "significantly" different from zero. longshore current velocity independent variable in regression observed values of X. A string of n-values in simple regression; a n by p matrix in multiple regression dependent variable to be estimated n observed values of Y estimated value of Y for given values of X Y-intercept in a regression model angle between the crest of the breaking wave and the shoreline estimated regression coefficients in multiple regression or the slope of the line in simple regression 5 3G =o DG =p 8 = ——————————_ (Model I) I(x - x)2 a del IT = ee (Mode ) zero-mean random component of Y assumed by both regression models atti ae yo Wes - : Aveta, aia brat was tesa gph. ie ta wat ee) sibaraien ol - Pre). he Lod pie = wh mb ae “oka vey vn ba weve bog badness o RM Ph a re ee oboe ar bao at a St ‘seston a ee ae Ar ca ieiteonlld rin tos rie eat any css se i cabs ae) ats a er a. wecknaasgsi Sythe ak eae Mand: (utes stant “ator ses 2 ' ; Pale yeti! ae 2 RRR a> wd aiid y - f a a A - S Se a eee — Rye e.g oF ae Pes A Pay Oe Ok, 1 Daeboh): Sle View "Sh Tipowtidess aolincs gaaeaes FORCING REGRESSION THROUGH A GIVEN POINT USING ANY FAMILIAR COMPUTATIONAL ROUTINE by Edward B. Hands I. INTRODUCTION TO REGRESSION The engineer frequently needs to estimate some response or dependent variable Y (e.g., sand transport rate, change in shoreline position, or structural dam- age), when given the magnitude of other factors, or independent variables X (e.g., longshore wave energy flux, storm frequency, elevation of storm surges, etc.). A common approach is to assume a linear model, Y = a + 8X + € (Model I) then adopt the principle of least squares; and use sample data to estimate the unknown parameters, a and 8. Both 8 and X can be considered as strings of numbers in the case of multiple regression with several independent varia- bles; e¢ indicates that the response is not being thought of as an exact linear function of X. The e€ represents random and unpredictable elements in Y; therefore, e¢ does not appear in the prediction equation: y = a+ 8x, where eCard 8 are estimates of the corresponding components in the conceptual Model I. The assumption that e has an expected value of zero indicates that the “average'' response is considered linear. If e varies widely, Model I, though conceptually correct, may have only limited predictive value. In such a case the estimated mean value of Y would frequently be thrown off by noise in the data. If ce varies only slightly, good predictions will be possible provided good estimates of a and 8 are available. Adopting the principle of least squares means one is willing to define the best estimates of a and 8 as those that minimize the sum of the squares of the deviations between the observed and predicted values (i.e., y and y). Customarily, no constraints are placed on the contenders for the best fit line. Of all possible lines in the XY plane, the prediction equation is chosen because it has the least sums of squares of deviations in y's from the data points. The y-intercept, a, is the point where the best fit line inter- sects the Y-axis. The a may be of Special interest, e.g., in the regression of current speed against longshore wave energy flux measured in a field test (Fig. 1). An intercept substantially above zero would suggest that during the test a component of the longshore current was driven by mechanisms other than waves (e.g., tides or winds). In this case, the nonzero intercept would not only be meaningful, but would also provide a good estimate of the velocity of any steady, nonwave-generated coastal current during the test. An additional example of unconstrained regression would be where greater and greater structural damage occurs as the wave forces exceed an undetermined threshold value. Again Model I applies and produces the correct regression coefficient (8). In the process it produces a meaningless response intercept well below zero (Fig. 2). In contrast with the previous example, the interest here is strictly in the prediction of future damage for given wave forces, not in the value of the intercept itself. The resulting linear relationship applies only to values of the independent variable above the threshold of wave effect. Biveuisey iy Figure 2. Flow Rate Wave Energy Flux Application of Model I produces an intercept (a), which may be a useful estimate of a component of longshore flow which is independent of wave con- ditions and presumably pervades the entire data set. Wave Forces Vi. s—Negative Intercept Application of Model I identified a thresheld value below which waves cause no damage. A negative inter- cept is produced, but is of no interest in this particular problem. Although the negative intercept (a) is in itself meaningless, Model I is correct because there is no basis for constraining 4. II. A PROBLEM WITH THE CUSTOMARY APPROACH There are many cases where the logic of the application dictates the response at a particular value of X. For example, if the response is some change that is regressed against time then the response must be 0 when X= 0 (Fig. 3). If there is no elapse time, there can be no change. If the linear assumption is valid, the appropriate conceptual mode is Y = 8 X + ©€ (Model IT) and the customary predictive equation (based on Model I) is inappropriate and May give poor estimates of 8 (see Fig. 4). Yet the vast majority of regres-— sion programs (e.g., SPSS, IMSL, IBM's 5110 package, and TI-59) do not allow specification of a zero intercept or any constraint through a known point. Statistical texts usually do not cover this topic either. However, formulas for the zero-intercept case are given by Brownlee (1965) and Krumbein (1965). Figure 3. Application of Model II forces a zero-intercept solution. Y A Model 11> 8 =0.63 “_= Model 1 > 8 =0.34 Figure 4. Model II estimates an increase in Y per unit increase in X that is nearly twice that predicted using Model I. The phy- sical relationship between X and Y dictates which model should be adopted. If Model II is appropriate the solution can be obtained using a simple artifice described in this report to modify results of standard computer programs intended for Model I. 1 The value of Y may be known for a single value of X (mot necessarily 0). The best prediction should then be sought from among the limited subset of lines through this point. All these lines will have a larger sum of squares (Z[y - y]?) than the line that would have been selected by Model I. A simple procedure is described herein for picking from among these restricted candidates the one with the smallest ‘Z[y - y]*. Thus, regressing through the origin is but one specific case that can be solved by a general model forcing regression at an arbitrary point. III. SOLUTION TO THE PROBLEM This report describes a method for getting the best fit to all data points (in the sense of least squares) while forcing an exact fit at any known point. A simple procedure for forcing regression through the origin was described by Hawkins (1980), who indicated the procedure was not well known. The author of this report knows of no references to the general case of an exact fit to an arbitrary point. However, if a fit can be constrained through the origin, then a simple transform of variables can force the line through any given point. The details of the through-the-origin solution will be explained first. 1. Regression Through the Origin. For each set of measured dependent and independent variables observed (yj, x4), also enter, or program, a mirror-image set (-y,, =x; ))- Thus), the computer is given an extended data set consisting of 2n data points, only n of which were observed. By definition of this extended data set, the depend- ent and all the independent variables each individually sum to zero, forcing a zero intercept: qa by the principle of least squares u < | DR bea > a = 0 because ‘x and Yy = 0 and thus xX = y = O on the extended data set Thus a zero-intercept solution is obtained. Is it still the least squares solution for the observed data set? The principle of least squares by defini- tion minimizes the sum of the squares of the deviations of the observed from the predicted values. Because each squared deviation from the observed data set generates an identical squared deviation in the extended data set, the sum of these two positive sequences is minimized over the extended data set only if it is also minimized over both the observed and the mirror-image sets. Thus, the regression coefficient produced in this manner; not only the least Squares solution for the artificially extended data set, but for the observed data set as well. By this artifice the proper estimate is obtained for the regression coefficient (8) with the prediction forced through the origin. 2. Regression Through Any Arbitrary Point (a, b). If the predicted response (Y) must be a when the independent variables (X) are b, then regress an extended data set u on v, where u=x-a and v=y-b. If (a, b) = (0, 0), then this collapses to the exact situation described above. If (a, b) # (0, 0), the direct results, wu = Bv, should be unraveled to produce the y prediction: 12 G> tl y - b = B(x - a) (b - aB) + Bx M< ll NOTE: The proper estimate of the regression coefficient (8) now forces the prediction through the point (a, b) as desired. By using this procedure the correct regression coefficient is obtained by using any familiar computational routines. The second most frequently reported output from regression programs, the correlation coefficient (r), is also the correct, unbiased estimator for Model Il. If additional information is provided by the regression program, then corrections may be necessary before adopting them for the real data set. The estimate of the residual variance will be correct for simple regression (one independent variable) and can be easily adjusted for multiple regression (see Table 1). Any sums of squares, cross products, and F-values produced by the program will be exactly twice the correct values. The standard error of the estimated slope will be too small by a factor of V2. Therefore, the t-value, for testing the zero slope hypothesis, will be too large by the same factor. Table 1 indicates the corrections for most of the elements produced by various .cgression programs. However, employing the described extended data procedures does not require consideration of any part of the output beyond that used in the standard unconstrained approach. IV. SELECTING BETWEEN MODELS I AND II If either the true or mean value (whichever interpretation fits the situa-— tion) of the dependent variable (Y) is unknown for all values of the independ- ent variable in the range of concern, then the customary model (I) may be appropriate. However, if the postulated physical relationship between X and Y dictates constraint through any point (a, b) and the relationship is linear from the maximum observed x to x = a, then Model II should be used. To pro- ceed with the customary evaluation of Model I would be equivalent to ignoring what is already known about the relationship between X and Y and, instead, relying totally on the limited information available in the sample data. The objective should be to obtain the best interpretation of the data, which does not override any more firmly established understanding of the situation. Assuming Model II applies, it may still be useful to evaluate Model I to test in the conventional way (Draper and Smith, 1966) the significance of the estimated nonzero intercept. If this test fails to provide enough evidence to reject the strawman hypothesis (H,: a= 0) then this failure may be cited as additional evidence strictly from the data, substantiating the choice of Model II to estimate 8. The results of this formal test of hypothesis should not, however, be relied on as the criterion for selecting Model II. It should serve only as a source of auxiliary information clarifying the extent to which the sample data will support the model choice. The choice should be made on the basis of functional insight and understanding of the relationship between X and Y. Comparing the correlation coefficients or r-values, produced using the real data and the extended data, is likewise not a valid method for choosing 13 °(996T) YaFUS pue todeag UT eTqe[TeAe st uoTSsseaZea1 Jo sjUSUETA prepueys 9yi SsuTJeAdzeq{UE_ UO UOTJeEWIOJUT [TeUOTITPpY ‘peaNseeuUl e19M Jey seTqeTIeA JUepuedepuT jo Jequnu ey} st d ‘(sjUsWeINsPell sulos Jo UOTJeENTeAS UO peseq eeWTISe SI]— s— g pue Tepow Tenzdaouos eB UE JoqJewesed umouyun ue st g ‘*8°9) saoqoweaed peqjeyun oni ey. Jo seqewTyzse ose senqea peqieH, OSIeT AjTe_eiepow st = ane Obs Ajejeupxoidde pue [=d “go ok ATQ0exXe ST YOTYM d-u ‘ ) teh eS oURTIPA [TeENpPTsel pejeurisy Z/l (e/ =o = u) ¢ 5B A anTqTea- Fj TeA-d “al al oTIsTIeIs 3S99-3 eg ee ee Sa Glo Ota sezenbs jo ung (tig Age 2gig d esie, AjTeqeiepow Ss} — 3: & ;28 AjTerewrxoidde pue ped ar @ 1s ATWIeX® ST yoPYM (d - u) g J dS ws g jo 201x709 piepueqjs aft | Cr = & = ee) |e $ Ms 5 BA d :jUeTOTJJOoO UoTIeTeI1I09 1g 9 g Phi - _ Oe i ee = : We _ : ~~—m &. @ i aon =s od Gea = P@ > as 6): ' PY * ae goat 5) a a We ay i ah i