[ASO SEA-SURFACE TEMPERATURE ESTIMATION n Empirical Study of the Effect of Missing Data on Regression and Autocorrelation Analyses of Time Series of Data C.J. Van Viiet ° Research Report ° 9 January 1965 U.S. NAVY ELECTRONICS LABORATORY, SAN DIEGO, CALIFORNIA 92152 - A BUREAU OF SHIPS LABORATORY eo th “i 1655 LOS) ne. 1256 DDC AVAILABILITY NOTICE Qualified Requesters May Obtain Copies Of This Report From DDC wn THE PROBLEM Develop statistical, physical, and computer techniques for interpreting, Summarizing, and extrapolating oceanic and meteorologic data for reliable estimation of the sound velocity distribution in the ocean. Specifically, determine the effect of random missing data and the effect of several long periods of missing data on the regression and auto- correlation analyses used in the estimation of sea-surface temperatures. RESULTS Analysis of records of sea-surface temperature, taken in the N. Atlantic and N. Pacific and up to 40 years in length, has shown that: 1. For many stations, the time series of sea-surface temperatures have missing temperatures scattered at ran- dom throughout the series. For each day there is a certain probability that the temperature will be missing. For such series, proper adjustments can be made in the computations of the regression and autocorrelation coefficients. The random deletion of data yields coefficients whose variances exceed those of a complete time series by an amount as predicted by the reduction in sample size. 2. For certain stations, there are an excessive number of longer sequences of missing data. For the time series considered, the increase in the variances of the re- gression coefficients attributable to this nonrandom missing data is twice the increase attributable to random missing data. Alternatively, for fractions of missing data greater than 0.2, time series with nonrandom missing data will have regression coefficient variances equal to those the same series with 0.15 more missing data would have, if all the missing data were random, 0301 0 CO 3. The effect of nonrandom, longer sequences of missing data on autocorrelation coefficients is less pro- nounced than for regression coefficients. The increase in the variances of autocorrelation coefficients attributable to nonrandom missing data is 1.2 times the increase attributable to random missing data. Alternatively, for fractions of missing data greater than 0.2, time series with nonrandom missing data will have autocorrelation coefficient variances equal to those the same series with 0.05 more missing data would have, if all the missing data were random. RECOMMENDATIONS 1. Examine the nature of missing data in time series of sea-surface temperatures as to the randomness of occurrence intime. Then apply the appropriate results of this report in estimating the variances of regression coefficients and autocorrelation coefficients. 2. Perform an investigation similar to the present one On the effect of missing data for the regression prob- lem but with several independent variables, namely, time, depth, and geographical location. The dependent variable will be water temperature. 3. Examine the effect of missing data on the short range prediction of sea-surface temperatures, ADMINISTRATIVE INFORMATION Work was performed under SR 004 03 01, Task 0586 (NEL L40551, formerly L4-5) by a member of the Computer Center. The report covers work from October 1963 to August 1964 and was approved for publication 5 January 1965. The author wishes to express appreciation to E. R. Anderson of the NEL Oceanometrics Group for advice on the oceanographic aspects of the problem. CONTENTS INTRODUCTION... page 7 REGRESSION AND AUTOCORRELATION ANALYSES... 12 MISSING DATA...15 MODEL FOR MISSING DATA... 18 MONTE CARLO APPROACH... 19 NONRANDOM MISSING DATA...26 THE AUTOCORRELATION COEFFICIENT... 29 COMMENTS AND CONCLUSIONS...31 RECOMMENDATIONS... 32 TABLES Sea-surface temperature time series... page 11 Correlation coefficients between sample B's, 50 percent of data missing... 24 Number of longer sequences deleted... 27 ILLUSTRATIONS 1 Geographical location of oceanographic stations... page 9 Sea-surface temperatures as a function of time for selected years of data... 10 3 Autocorrelation coefficients for residual time series...14 4 Histograms of the frequency of missing temperature Sequencesam nel.” 9 Histograms of differences between regression coefficients of sample time series and complete time series. SSeijo oS IPAS. 55 Bu 6 Histograms of differences between regression coefficients of sample time series and complete time series. Triple Island... 22 7 Plot of regression coefficients B versus B. for sample time series. Scripps Pier, 0.5 data missing...23 8 Fractional increase in variance of regression coefficients due to random missing data...25 9 Fractional increase in variance of regression coefficients due to nonrandom missing data... 23 10 Increase in variance of autocorrelation coefficients due to random and nonrandom missing data... 39 REVERSE SIDE BLANK INTRODUCTION The time-series analysis of sea-surface temperatures is of interest to oceanographers, meteorologists, and biologists. This study discusses methods used in an anal- ysis of daily sea-surface temperatures. It is the first ina proposed series and is primarily concerned with the effect of missing data in certain regression and autocorrelation analyses. Only enough detail of these analyses will be in- cluded to ensure a degree of completeness to the present study. Many time-series measurements have been made at various locations. Inthe eastern Pacific Ocean such measurements have been made by Canadian and American oceanographers at coastal, island, and ship locations for time periods up to 45 years. These data have been the subject of numerous papers including, among others, those of Pickard and McLeod’ and Roden.*** This study differs *Pickard, G. L. and McLeod, D. C., ''Seasonal Variation of Temperature and Salinity of Surface Waters of the British Columbia Coast,'' Journal of the Fisheries Research Board, Canada, v. 10, p. 125-145, 1953 ?Roden, G. I., ‘Spectral Analysis of a Sea-Surface Temperature and Atmospheric Pressure Record off Southern California,'’ Journal of Marine Research, v. 16, p. 90-95, 1958 “Roden, G. I., "On Nonseasonal Temperature and Salinity Variations Along the West Coast of the United States and Canada,'' California Cooperative Oceanic Fisheries In- vestigations. Reports, v. 8, p. 95-119, 1961 from those cited in that the original daily temperatures are used in the analysis without a preliminary smoothing by monthly averaging. The purpose of time-series analysis is to isolate trends oscillations, and random elements, which are defined as follows. Trend is a gradual increase or decrease in a system over a long period of time; an oscillation is a variation about the trend that occurs with more or less regularity over some time interval; and a random element is an unpredictable variation in the variable. If long term trend does not exist, then the primary need is the statistical fitting of some function to time series to represent the oscillatory element. Several sets of daily sea-surface temperatures have been examined. Measurements were made at the two open ocean and four island or coastal locations shown in figure 1. To indicate how individual temperature measurements vary throughout the year, one year of measurements for each location is presented in figure 2. These years of tempera- tures are taken from records that vary in length from 7 to 40 years. Pertinent information about the stations yielding these records are included in table 1. °suo0r7ngs o1ydnibounads0 fo uorzn00T PooLYdnLboay °F GAMO 2 if MoO1T 021 o0EL Obl 0051 : - NoOZ ~ o0E 308|—— XC) YAld SddlYds 000 NoOZ mly o0E VdVd 40S es SS, ‘S| SAWVE “LS EE PINT ‘S| VAVONV 1 WY \ INA NoOS MOLL 0021 o0€T o OUT 00ST *o07z0p fo sivah pazoazas uof aum297 fo uwozg0unf dD sn Ssainzoiadmwa2 aonfuins—pag OF OLMGO 2 it 030 AON 100 ddS SNV ATM INN AVA UdV UVW 834 NV J30 AON 190 d3S SAV ATM INN AVW UdV UWA add NVI ZZ61 Wald Sddl¥as Lv6l VUVONYI al Jo ‘JUMLVUIdWal 1S61 OHO 6561 WdWd 10 TABLE 1. SEA-SURFACE TEMPERATURE TIME SERIES Time Number | Number Daily | Percent Possible Location Period Days Observations | Observations Weather Ship ''PAPA" 1/56 - 1/63 2557 1690 66 50°N 145°W 7 yr North Pacific Weather Ship ''ECHO"' 9/49 - 9/56 255K 1533 60 35°N 48°w il sae North Atlantic St. James Island 1/40 - 1/61 7671 6180 81 52°N 131°W 21 yr North Pacific Triple Island 1/40 - 1/61 7671 7244 95 54°N 131°W 2 ileyats North Pacific Langara Island 1/An = 1/61 7304 6402 88 54°N 133°W 20 yr North Pacific Seripps Pier Lj2l = 1/8) 1a610 14352 98 BS IN, WLW 40 yr North Pacific al 12 REGRESSION AND AUTOCORRELATION ANALYSES A visual observation of the data suggests statistically fitting some theoretical function which oscillates with a period of one year. Further justification is provided by the autocorrelation function = Ky = G, =COV(Z T, )/ VAR(Z,), for lags Oil mens In this application the variable 7, is the sea-surface tem- perature on day v, 7,;is the temperature * days later, and COV and VAR are the covariance and variance of the variables as indicated. Computation of the autocorrelation function yields peaks whose magnitudes and spacings strongly indicate the existence of an annual oscillation in the time series. The simplest model consisting of an oscillatory function with period one year is Gee i] 6B, +asin [2r(D- 6)/365] + « iT] B, + B,sin (27D/365) + B.cos (27 D/365) + € where D is time measured in days from some arbitrary origin and 7” is the fitted value of the surface temperature. Fitting the function of equation (2B) to the observed sur- face temperatures 7 using the method of least squares yields estimates of the regression coefficients B,, B, and 8, and an estimate of the variance of €. The amplitude a and phase @can be obtained from B, and B,. The quantity € is the random, or error, or residual term. If the residuals 7 - 7’ are examined visually or by computation of the autocorrelation function of the residual time series, a fairly strong semiannual oscillation is dis- covered for some of the stations. This suggests the model (1) (2A) (2B) ip = B, + B,sin(2mD/ 365) + B, cos(2mD/365) (3) + B sin(4mD/365) + B cos(4mD/365) + € The addition of semiannual oscillatory terms to the re- gression equation improves the fit obtained with the annual terms. Tests of significance of sums of squares attributable to annual and semiannual oscillations are performed using the appropriate /-ratios. Computation of the autocorrelation functions of the residual time series after equation (3) has been fitted to the series of sea-surface temperatures yields the plots of fig- ure 3. These residuals are themselves autocorrelated, although no additional oscillatory terms exist. The least Squares method is valid if (1) the error between the true regression curve and the observed value is distributed in- dependently of the independent variables with zero mean and constant variance; and (2) ideally, successive errors are distributed independently of one another. Actually, the problem of using the method of least squares when the error terms are autocorrelated has been solved if the e's follow certain autoregressive processes.* The autocorrelated residuals should affect the distributions of the regression coefficients. As will be seen later, the effect on the variance of the regression coefficients is negligible. “Anderson, R. L., ''The Problem of Autocorrelation in Regression Analysis,''’ American Statistical Association. Journal, v. 49, p. 113-129, March 1954 3 AUTOCORRELATION COEFFICIENT EDO DIPO, Bo 14 WEATHER SHIP ''PAPA’’ (7 YRS. OF DATA) = ”, wok segs nd ay ot ST. JAMES ISLAND (21 YRS. OF DATA) LANGARA ISLAND (20 YRS. OF DATA) — | 6 M0 12 M0 18 MO 24 M0 WEATHER SHIP ‘'ECHO’’ (7 YRS. OF DATA) be ; Fae x oo ote, ‘ A nes ne /2 iy om fe fs ‘—-) ye . FA ma 9 mea * 7a Ss 4 F foo F 8 TRIPLE ISLAND (21 YRS. OF DATA) SCRIPPS PIER (40 YRS. OF DATA) | | | 6 M0 12 M0 18 MO 24 M0 INVGCOCOPLCHOGIOW COCTYUCBERES FOr PESICUOL GBRG SCPUGSo If equation (3) is fitted to individual years of data, the analysis provides in B . an estimate of the yearly average of the surface temperature. The sequence of B 's can be ex- amined for the existence of trend in the time series. Testing for trend using either the theory of runs or the autocorrelation coefficient with lag unity indicates that no long term trend exists in any of the time series under consideration. Details of the trend analysis will be presented in a subsequent report. MISSING DATA The six locations have data missing in amounts varying from 2 percent to 40 percent of the number of possible observations. Intuitively it would seem that, for the types of analyses attempted, a fairly large fraction of randomly distributed missing data can be tolerated. It is the purpose of this report to examine quantitatively the effect of various fractions of missing data. Although the expression missing data has been used thus far in the discussion, it is worthwhile now to comment on this usage. Conceivably, in a statistical prob- lem, missing data can result in nothing more drastic than a sample of smaller size than planned. This might well be the case in a regression analysis in which the residuals are independently distributed with equal variances, and the missing data are uniformly or randomly distributed through- out the ranges of the independent variables. On the other hand, missing data in an extreme case can invalidate an experiment. 15 16 Table 1 shows that the stations with the largest fractions of missing data are the weather ships, partly because of the exclusion of temperatures if the ships are off station. Data may also be missing at either weather ships or shore locations because of bad weather or equipment failure. As an indication of the nature of occurrence of the missing data for stations with fairly large fractions of missing data, consider figure 4. Shown are histograms of the frequency of missing temperature sequences for 7 years of PAPA data and 21 years of St. James Island data. Sta- tion PAPA was selected from the two open ocean locations since the bathythermograph observations were made by oceanographers and were considered to be more accurate than for station ECHO. St. James Island was chosen from among the island and coastal locations since it had the largest fraction of missing data of these stations. Except for a few long periods of missing data for each station, as indicated in figure 4, the missing data days are distributed very much as though at random. That is, given the appropriate probability of there being data on a day, the distributions by length of data-present sequences and data- missing sequences are like those expected. More specifi- cally, the computed histograms shown in figure 4 result from randomly generated time series with two controlling conditional probabilities. The first conditional probability used for figure 4A is 0.76, which is the probability that a temperature will be observed, given that a temperature was observed the previous day. The second conditional probability used is 0.51, which is the probability that a temperature will be observed, given that a temperature was not observed the previous day. The corresponding probabil- ities for figure 4B are 0.89 and 0.59. These conditional probabilities agree quite well with the physical situation that missing data sequences occur infrequently, but once they occur they persist longer than can be explained by a single probability. *sgouenbas aunjzoiadwaz? buissiuw fo houanbaif ay} fo swvi60}siyt SAV ZL ONV 6S ‘OF ofl Cn C:Cer Ls Vem leer ed SJININDIS JUV NMOHS LON VLV0 ONVISI SAWVE “LS 40 SUVIA INO-ALNIML SJININDIS VIVG INISSIW 40 HLONGT 00! 002 00¢ 00+ qaindWwoo -----— d3Au9S80 SAVO €& ONY LZ 40 SJININDIS YY NMOHS LON VIVG WdVd 40 SUV9A NIAIS 7 BAMS 2 ii 0S 001 OSI 002 AQNAND INA iG 18 MODEL FOR MISSING DATA It is proposed that the effect of missing data be evaluated in the following manner. There exist series of sea surface temperatures for which there are no missing data (Scripps Pier), or almost none (Triple Island), over periods of several years. Complete series of length up to 12 years can be selected from each of these sources. The few missing temperatures for Triple Island are filled in by adding to interpolated values random normal deviates having the appropriate variance. The complete series remains unchanged thereafter. It is thought necessary to consider two stations, whose time series of sea-surface temperatures have slightly different characteristics with respect to residual variability and significance of semiannual oscilla- tory terms, in order to avoid decisions which might be too dependent on the characteristics of a single station. Regression and autocorrelation analyses are performed on the complete series. Estimates of the variances of the regression coefficients B are available from the matrix in- verse to that of the coefficients in the normal equations of the least squares analysis. The estimates of the variances used assume independent, equal variance residuals. These variances are attributable to the residual variability of observations about the true regression curve. The 40 years of Scripps Pier residuals with very few missing observations provide an estimate of the variance of the near-zero autocorrelation coefficients. The autocor- relation function for Scripps Pier was computed out to a lag of 1800 days, an arbitrary figure slightly over 10 percent of the total sample length. The standard deviation of the autocorrelation coefficients with lags from 400 days (end of the initial decay of the function) to 1800 days iso, = 0.293. This estimate of o, is considered to be the best available measure of the random variability of the near-zero auto- correlation coefficients of sea-surface temperature anomalies. It is based on a large sample of the autocorrelation coefficient, and the maximum lag involved is still only a small fraction of the total time series length. Missing data days are randomly introduced into a complete time series using computer-generated, uniformly distributed random numbers. Any desired fraction of missing data can be introduced by associating with each daily tem- perature one of the random numbers with range Otol. If the random number has a value greater than the desired fraction, the temperature is retained; if not, the tempera- ture is deleted. Although two probabilities are used to generate each of the computed histograms of figure 4, it is more convenient in the analyses below to use single prob- abilities yielding the same fractions of missing data. The resulting computed histograms decrease more rapidly as a function of length of sequence than do those of figure 4, but the analysis used is fairly insensitive to the shape of the histograms. Since the gross characteristics of the time series are similar for all stations, any deletion of tempera- tures from complete Scripps Pier and Triple Island time series yields sample time series which are like those with naturally missing larger fractions of data, and which have whatever weaknesses are implied by the missing data. In the remainder of this paper, the namesample time series refers to a complete time series with data deleted by the above described process. MONTE CARLO APPROACH For a sample time series, the harmonic and autocor- relation analyses can be performed just as for a complete series, the proper adjustments being made in the computa- tions. The regression and autocorrelation coefficients ob- tained from a sample time series are different from those obtained from the corresponding complete series. If many sample time series with the same fraction of missing data are independently generated from the same complete time series, and if regression and autocorrelation analyses are performed for each sample time series, then the variabilities 20 in the resulting coefficients will measure the effect of missing data. The generation of many such time series to give estimates and confidence limits for parameters is an example of the technique which has been given the name Monte Carlo. The major interest in the B's as statistical variables is the variability from sample to sample of their deviations from some true, but unknown, values. For an integral number of years of complete time series, the four estimates of the variances of B,, B.. B.» and Bs as obtained from the inverse matrix, are equal. Figures 5A, 5B, 6A, and 6B are histograms of the differences between the f's of 120 independently generated sample time series and the corresponding B's of the complete time series for 7 years of Scripps Pier and Triple Island data. The B's are uncor- related and have equal variances. Because of the effective increase in sample size, the differences have been grouped for the four B's. For figures 5A and 6A the sample time series average 50 percent missing data; for figures 5B and 6B they average 20 percent missing data. The histograms are presented to demonstrate that the differences are sym- metrically distributed about zero, and to show the dispersion. Figure 7 is a plot of B, vs B, for a sample time series. It is included to demonstrate that the B's are uncorrelated. Table 2 presents the correlation coefficients & for all com- binations of B's for the 7-year sample time series with 50 percent of the data missing. The 5 percent critical value of fis 0.179. One of the twelve #'s barely exceeds this value. This not unlikely event has prior probability 0.34. OHO DE SCICLLAOG °sa1l4as auW27 a7aTAWO0D Pun Salias au1z aTdauns {0 squaroiffa0od uolssaibai uaanmzaq saduaiaffip fo supibo7zsifz °G aunbia JOVUIILNID $39Y990 80° 70 -WLVG ONISSIW 40 NOILOVYS 90° 90° 80° GO -V1V0 SNISSIW 40 NOILOWUI Wald Sddldos Ol Gl 02 4 Of INI Udd 21 90° CDM PO OCR LAG *salzias au17 a7a7TAUW00 Pun Satlas au1l2 ATduDsS fo squai0iffa0d uor1ssaibai uaamzaqd saduauaffip fo supiborsix "9 ainbig JOVUSIINGD $33x930 0 #0": 80° 00° 0 70° 80° SO -WLVG ONISSIW 40 NOILOWYS 20 -WLVO ONISSIW 40 NOILOVY4 QNVISI a1d1uL INIUdd 22 SCRIPPS PIER, 0.5 DATA MISSING IP RODPG* Lo PRIRONG coefficients Bi sample time series. O55 GGEO WUSS URC o OF FBGrESS LOR versus Bo FOr SCr8D—DS PU@iro 23 24 TABLE 2. CORRELATION COEFFICIENTS BETWEEN SAMPLE £'S, 50 PERCENT OF DATA MISSING Correlation Coefficients Variables Scripps Pier Triple Island Bi Bo 0.139 0.187 By,» Bs -0,151 0.023 Bi, Bz = ()eylulir -0. 043 Bz, Bs Os 0.061 5 5 (8a -0.148 =On 075 (se we -0.155 0. 026 The Monte Carlo technique has been applied to data for the combinations of two stations, Scripps Pier and Triple Island; for three lengths of series, 4, 7, and 12 years; and for fractions f of data missing in the range 0.09 to 0.70. With respect to regression coefficients, figures 8A, 8B, and 8C display the results of these analyses ina normalized form. The quantity plotted is the ratio 9 of the variances of the B's attributable to missing data to the variances of the f's attributable to residual variability. This ratio is the fractional increase in the variance of the B's attributable to missing data. Each point is based on 120 sample time series. "D700 buUlssiu WOPUDL OF anp squa10r1ffa0o uolssalbai £0 aoUDIADA UI AaSDaLOUL [TDUOL}IDA YG *9 ainbig SUVIA ZL ONISSIW WIV 40 NOILOVUS s v" e a i (EP oF ci v e a aes ome oslhataelias) Gell al ] I Ee an =| OL WZ1S JIdWVS 039NG34Y Ol ING ASVIYINI | 02 o ONVISI 1d1UL © Yald Sddldas 62 SUVA L SUV4A JINVIGVA NI AISVIYINT TVNOILIVUS 25 26 The ratio attributable to reduced sample size is shown by the continuous curve. The variances of B's based on samples from the same population are inversely pro- portional to sample size. If //is the sample size for the complete time series, J/(1 - f) is the size of the sample time series. For the B's o icine = 1/(1 - 7) s where the subscripts s and crefer to sample and complete time series, respectively. The increase in variance attributable to reduced sample size is Oe Gh aap) = lS gel = a) The variances of the B's for the complete time series are computed as though the residuals are independent. The variances for the sample time series reflect the influence of the autocorrelated residuals. The empirical ratios of figure 8 lie almost on the theoretical curve, which assumes independent residuals. Thus, it is concluded that the com- bination of autocorrelated residuals and random deletion of data yields regression coefficients whose variances are as expected simply on the basis of sample size. NONRANDOM MISSING DATA As indicated in figure 4, there is an excessive num- ber of longer sequences of missing data days. These sequences occur in the poor weather months October to March, inclusive. In addition there are several sequences of 5 to 10 days each, which are in excess of the number of such sequences expected by chance. It has been demon- strated above that randomly distributed missing data affect the variance of the regression coefficients just as though the sample size were smaller. It is necessary to determine if the longer sequences lead to the same result. To approximate the time series yielding figure 4, sample time series have been generated in which longer sequences of data have been deleted in a random manner during the poor weather months. Then, individual temper- atures are deleted at random from the remaining days until certain arbitrary fractions of missing data are obtained. Table 3 contains the number of longer sequences deleted for three series lengths and for three fractions of missing data. Analysis for 4-year series length was not attempted for the smallest missing fraction. The Monte Carlo technique is applied using 120 independently generated sample time series for each station and each combination of series length and fraction deleted. TABLE 3. NUMBER OF LONGER SEQUENCES DELETED Total Period Series Length Fraction Length Missing (days) 4 Years The normalized results are displayed in figure 9. The same theoretical curve has been plotted as in figure 8. The arbitrary dashed curve has twice the ordinate of the solid curve. Because of the much longer sequences deleted, and perhaps because of compromises necessary in con- structing table 3, the scatter of points in figure 9 is greater than in figure 8. The dashed curve has been fitted conservatively. It indicates that the fractional increase in the variance of the 27 28 2.5 2.0 SCRIPPS PIER © TRIPLE ISLAND © 1.5 FRACTIONAL INCREASE IN VARIANCE INCREASE DUE TO REDUCED SAMPLE SIZE ne 3 A ) 6 oll FRACTION OF DATA MISSING IEOGBRE Oo IL FOO PODGL BMC reas e GO, OCiPGCMCe Of RACIFGSS BOW coefficients due to nonrandom missing data. 6's attributable to nonrandom missing data is twice the in- crease attributable to random missing data. This suggests caution in estimating the variances of B's, or of residuals, for time series when the missing data occurs in sequences longer than those occurring by chance. The dashed curve can be interpreted another way. For fractions of missing data greater than 0.2, the dashed curve lies about 0.15 unit to the left of the continuous curve. When applicable, this quantity 0.15 should be added to the actual fraction of missing data. Then conclusions about nonrandom missing data can be made as for random missing data, but with the larger fraction of missing data used. THE AUTOCORRELATION COEFFICIENT The effect of missing data on autocorrelation coeffi- cients will now be considered. The results are perhaps not as straightforward to evaluate as for regression coefficients, but are more encouraging as far as tolerating nonrandom missing data. Figure 10 presents the results of Monte Carlo analyses of autocorrelation coefficients similar to those for regression coefficients. The autocorrelation coefficients are for the time series of residuals remaining after the regression analyses have been performed, The same combinations of stations, series length, and fractions deleted are used. The variances of autocorrelation coeffi- cients are averaged for lags from 10 to 100 in steps of 10. Assuming the variances are inversely proportional to series length, the average variances are normalized to an arbitrary series length of one year. In figure 10, results for random missing data are plotted as circles; results for nonrandom missing data are plotted as triangles. The variance of near-zero autocorrelation coefficients based on 40 years of Scripps Pier data is 0.000859, Normalized to the series length of one year, the variance is 0.0344. This quantity is plotted as the dashed line at the top of figure 10. Somewhat arbitrary curves have been fitted to the two sets of points. The effect of missing data on the variance of autocorrelation coefficients results in 29 x19 = = = > = 2 SCRIPPS PIER = RANDOM ® S 15; NONRANDOM A TRIPLE ISLAND RANDOM ° ° NON RANDOM =A oll of) a3 A aH) 6 1 FRACTION OF DATA MISSING Figure 10. LDCOGPGOSOG Bin OGPIGMCG Os CBCO? correlation coefficients due to random and nonrandom missing data. 30 curves similar to those for regression coefficients. The major difference is that the magnitude of the effect of in- troducing nonrandom missing data is much less in the case of autocorrelation coefficients. The ratio of ordinates averages about 1.2 instead of the 2.0 for the regression coefficient case. The dashed curve is about 0.05 unit to the left of the continuous curve rather than 0.15 unit. If the analyses of regression and autocorrelation coefficients are of equal importance, then the limitations on nonrandom missing data are determined by the regression coefficient results above. A comparison of the variance determined from the 40 years of Scripps Pier, near-zero autocorrelation co- efficients with the variance of the Monte Carlo analysis in- dicates that 70 percent of the data may be randomly missing before the two variances are equal. COMMENTS AND CONCLUSIONS In the analysis above, certain compromises are made with computer techniques and computing times required: (1) The distribution of missing day sequences based on a simple use of random numbers will never agree exactly with the observed distribution of missing day sequences for a given station. Nevertheless, the techniques used provide good initial estimates of the effect of missing data. (2) The use of 120 Monte Carlo runs per case is a compro- mise between computer time required and the apparent rate of convergence to a limit of the parameters estimated. It is concluded that random missing data in a time series result in regression coefficients whose variances in- crease over those of a complete time series by an amount as predicted by the reduction in sample size. However, the presence of longer sequences of nonrandom missing data may have a pronounced effect in estimating regression co- 31 32 efficients. Specifically, if variances of regression co- efficients are estimated in the usual manner, on the average these estimates must be adjusted upwards. Roughly, the increase in variance due to missing data must be doubled. Alternatively, for fractions of missing data greater than 0.2, time series with nonrandom missing data will have regression coefficient variances equal to those the same series with 0.15 more missing data would have, if all the missing data were random, The effect of nonrandom missing data on autocorrelation coefficients is less pronounced. The increase in their es- timated variance need be only 20 percent. Alternatively, for fractions of missing data greater than 0.2, time series with nonrandom missing data will have autocorrelation co- efficient variances equal to those the same series with only 0.05 more missing data would have, if all the missing data were random, RECOMMENDATIONS Almost all time series of sea-surface temperatures contain missing data. The nature of this missing data as to randomness of occurrence in time should be examined before regression and autocorrelation analyses are performed. The appropriate results of this report should be applied in estimating the variances of regression and autocorrelation coefficients. A similar investigation of the effect of missing data should be performed for the regression problem with several independent variables, namely time, depth and geographical location. The dependent variable will be water temperature. The results of this report apply to the long range estimation of sea-surface temperatures. An examination should be made of the effect of missing data on the short range (a few weeks or months) prediction of sea-surface temperatures. GAldISSWIONN S! p4ed siyy ($-p1 Ajsawsoy ‘1S¢0p7 13N) 9850 4S21 “10 €0 700 US TD PA UeA | suo)edi|ddy - SISAjPue|eINSHeIS °Z SISAJBUe |eI1}S1}2}S - eyep jediydeiboueasg “T GAldISSWIONN S! p4ed siyy ($-71 Ajsaui4oy ‘166077 13N) 9850 XSEL ‘10 €0 p00 US TD YA UeA | Suoleol|ddy - SISAjBue BIIISEIS 2 SISAJEUB [EDI}SI}E}S - Pyep jediydesboueasg ‘] *$}U9191JJ809 U01}2]8414090}Ne pue UoIssatbha4 Jo SUO!}e}NdwWod U! eyep BHulssiW al} JO $}9ajJa AY} 40J 1994409 0} aUO ajqeua jeu} payuas -d4d ase spoujyaw “eyep Bulssiw jo spolsad buoy Jo eyep Buissiw Wopued a4njeaj Sainjesadwia} adejins-eas Jo saljas awl} Aue dal4dISSVIONN ‘gg uer¢ ‘de ‘JNA UPA “TO AQ ‘NOILVWILS] JUNLYYIdWIL JOVIYNS -V3S 9621 }4oday ‘yljeg ‘obaig ues ‘*qe7 $d}U01}9a/9 Aven °$}U91914J909 U0!}2}a44090}Ne pue Uo!ssasbas Jo SUO!}E}NdwWod Ul eyep HUISS|W 9} JO $}99JJa AY} 40J 1994109 0} UO ajqeua jey} payuas -d4d aje spoujaw “eyep bulssiw jo spoliad buo| Jo eyep buissiw wopued ainjeay Sainjyesadwia} adeyins-eas jo salias awl} Aue GaldISSVIONN ‘gg uerg ‘de ‘VIIA UPA “fF “OD AQ 'NOILYWILS9 JUNLYYadWAL JOV4YNS -VS 9621 J4oday "yye9 ‘obaiq ues ‘*qe7 s9}U04}99;3 AAeN GAIdISSWIONN S! peo sty] ($-p1 Aj4autoy ‘T¢¢0p7 TaN) 9850 4S21 “10 €0 p00 US TO PNA UA | Suoleoi|ddy - SisAjeue jersieys 2 SISAJBUP [B91}S1}2}S - eyep jediydesboueas9 ‘T ($-p7 Ajsawisoy ‘T¢¢0P7 13N) 9850 ¥S€1 “10 €0 700 US TD PIA UeA + suoljeoijddy - SISA|BuUe JBIISHEIS Zz SISAJBUC J291}S1}2}S - yep jediydeuboueasg ‘T GAldISSWIONN S! p4eo sty, °$}U91914J909 UO!}2]914090}Ne pue UO!Ssasha4 Jo SUO!}eyNdwWod U! eyep BUISSIW AU} JO $}99JJa AY} JOJ 1994109 0} BU ajqeUa jeu} payuas -01d aie spouja “eyep Buissiw jo spojsed Huo Jo eyep Buissiw Wopued ainjeay Sainjesadwa} adejyuns-eas jo saljas aul, AueW daldISSVIONN ‘gg uer¢ ‘d ze ‘JBIIA UPA “fT “O AQ ‘NOLLWWILS3 JUNLWYAdWAL JOWINNS -V3S 9621 }4oday ‘yyjeQ ‘obaig ues ‘“qe] $d1U04399/9 Aven °$}U9191JJa09 U01}2]8141090}Ne pue UOlssatbas jo SuO!}e}ndwod U! eyep HUISSIW AU} JO $}99JJa AY} JOJ 1994109 0} BU a[qeua yeu} pa}uas -d4d ase spoujaw “eyep Bulssiw jo spolsad Huo| Jo eyep Bulssiw Wopued a4njeaj Sainjesadwia} adejins-eas jo salias awl} Auew daldISSVIONA ‘gg uer¢ dz ‘JANA UPA “fT “DO AQ ‘NOILVWILS] JUNLYYIdWA3L JOVIYNS -W3S 9621 j4oday “y1ye9 ‘obaiq ues ‘*qe7 $91U04}99}3 AAeN INITIAL DISTRIBUTION LIST CHIEF» BUREAU OF SHIPS CODE 210L CODE 345B CODE 240C (2) CODE 320 CODE 360 (3) CODE 370 CHIEF» BUREAU OF NAVAL WEAPONS DLI-3 DLI-31 FASS RU-222 RUDC-2 RUDC-11 CHIEF» BUREAU OF YARDS AND DOCKS CHIEF OF NAVAL PERSONNEL PERS 118 CHIEF OF NAVAL OPERATIONS OP-O7T OP-716C op-71 OP-76C OP-03EG oPp-0985 CHIEF OF NAVAL RESEARCH CODE 416 CODE 466 CODE 468 COMMANDER IN CHIEF US PACIFIC FLEET COMMANDER IN CHIEF US ATLANTIC FLEET COMMANDER OPERATIONAL TEST AND EVALUATION FORCE DEPUTY COMMANDER OPERATIONAL TEST - EVALUATION FORCE» PACIFIC COMMANDER CRUISER-DESTROYER FORCE> US ATLANTIC FLEET US PACIFIC FLEET COMMANDER TRAINING COMMAND US PACIFIC FLEET COMMANDER SUBMARINE DEVELOPMENT GROUP TWO FLEET AIR WINGS» ATLANTIC FLEET SCIENTIFIC ADVISORY TEAM US NAVAL AIR DEVELOPMENT CENTER NADC LIBRARY US NAVAL MISSILE CENTER TECHe LIBRARY» CODE NO 3022 PACIFIC MISSILE RANGE /CODE 3250/ US NAVAL ORDNANCE LABORATORY LIBRARY US NAVAL ORDNANCE TEST STATION PASADENA ANNEX LIBRARY CODE 334 CHINA LAKE US NAVAL WEAPONS LABORATORY KXL PUGET SOUND NAVAL SHIPYARD USN RADIOLOGICAL DEFENSE LABORATORY DAVID TAYLOR MODEL BASIN APPLIED MATHEMATICS LABORATORY /LIBRARY/ US NAVY MINE DEFENSE LABORATORY US NAVAL TRAINING DEVICE CENTER CODE 365H» ASW DIVISION USN UNDERWATER SOUND LABORATORY LIBRARY ATLANTIC FLEET ASW TACTICAL SCHOOL USN MARINE ENGINEERING LABORATORY US NAVAL CIVIL ENGINEERING LAB. L54 US NAVAL RESEARCH LABORATORY CODE 2027 US NAVAL ORDNANCE LABORATORY CORONA USN UNDERWATER SOUND REFERENCE LABe BEACH JUMPER UNIT TWO US FLEET ASW SCHOOL US FLEET SONAR SCHOOL USN UNDERWATER ORDNANCE STATION OFFICE OF NAVAL RESEARCH PASADENA USN WEATHER RESEARCH FACILITY US NAVAL OCEANOGRAPHIC OFFICE (2) US NAVAL POSTGRADUATE SCHOOL LIBRARY DEPT. OF ENVIRONMENTAL SCIENCES OFFICE OF NAVAL RESEARCH LONDON BOSTON CHICAGO SAN FRANCISCO FLEET NUMERICAL WEATHER FACILITY US NAVAL ACADEMY ASSISTANT SECRETARY OF THE NAVY R-D ONR SCIENTIFIC LIAISON OFFICER WOODS HOLE OCEANOGRAPHIC INSTITUTION INSTITUTE OF NAVAL STUDIES LIBRARY AIR DEVELOPMENT SQUADRON ONE /VX-1/ DEFENSE DOCUMENTATION CENTER (20) DOD RESEARCH AND ENGINEERING TECHNICAL LIBRARY NATIONAL OCEANOGRAPHIC DATA CENTER (2) NASA LANGLEY RESEARCH CENTER COMMITTEE ON UNDERSEA WARFARE US COAST GUARD OCEANOGRAPHY — METEOROLOGY BRANCH ARCTIC RESEARCH LABORATORY WOODS HOLE OCEANOGRAPHIC INSTITUTION US COAST AND GEODETIC SURVEY MARINE DATA DIVISION /ATTN-22/ US WEATHER BUREAU US GEOLOGICAL SURVEY LIBRARY DENVER SECTION US BUREAU OF COMMERCIAL FISHERIES LA JOLLA DRe AHLSTROM WASHINGTON 259 De Ceo POINT LOMA STATION WOODS HOLE» MASSACHUSETTS HONOLULU-JOHN C MARR LA JOLLA» CALIFORNIA HONOLULU» HAWAII STANFORD» CALIFORNIA POINT LOMA STA-Je He JOHNSON ABERDEEN PROVING GROUND» MARYLAND REDSTONE SCIENTIFIC INFORMATION CENTER BEACH EROSION BOARD CORPS OF ENGINEERS» US ARMY DEPUTY CHIEF OF STAFFs US AIR FORCE AFRST-SC STRATEGIC AIR COMMAND HQ AIR WEATHER SERVICE UNIVERSITY OF MIAMI THE MARINE LABe LIBRARY (3) COLUMBIA UNIVERSITY HUDSON LABORATORIES LAMONT GEOLOGICAL OBSERVATORY DARTMOUTH COLLEGE THAYER SCHOOL OF ENGINEERING RADIOPHYSICS LABORATORY RUTGERS UNIVERSITY CORNELL UNIVERSITY OREGON STATE UNIVERSITY DEPARTMENT OF OCEANOGRAPHY UNIVERSITY OF SOUTHERN CALIFORNIA ALLAN HANCOCK FOUNDATION UNIVERSITY OF WASHINGTON DEPARTMENT OF OCEANOGRAPHY FISHERIES—OCEANOGRAPHY LIBRARY NEW YORK UNIVERSITY DEPT OF METEOROLOGY - OCEANOGRAPHY UNIVERSITY OF DRe JOHN Ceo UNIVERSITY OF GEOPHYSICAL UNIVERSITY OF MICHIGAN AYERS ALASKA INSTITUTE RHODE ISLAND NARRAGANSETT MARINE LABORATORY YALE UNIVERSITY BINGHAM OCEANOGRAPHIC LABORATORY FLORIDA STATE UNIVERSITY OCEANOGRAPHIC INSTITUTE UNIVERSITY OF HAWAII HAWAII INSTITUTE OF GEOPHYSICS ELECTRICAL ENGINEERING DEPT A-M COLLEGE OF TEXAS DEPARTMENT OF OCEANOGRAPHY THE UNIVERSITY OF TEXAS DEFENSE RESEARCH LABORATORY HARVARD UNIVERSITY SCRIPPS INSTITUTION OF OCEANOGRAPHY (4) MARINE PHYSICAL LAB UNIVERSITY OF CALIFORNIA ENGINEERING DEPARTMENT UNIVERSITY OF CALIFORNIAs SAN DIEGO SIO THE JOHNS HOPKINS UNIVERSITY APPLIED PHYSICS LABORATORY INSTITUTE FOR DEFENSE ANALYSIS