^':^^i MISUSE OF THE NULL HYPOTHESIS IN DATA REPORTING AND INTERPRETATION DECEMBER 1992 Ontario Environment Environnement t I I ISBN 0-7778-0607-X MISUSE OF THE NULL HYPOTHESIS IN DATA REPORTING AND INTERPRETATION DECEMBER 1992 0 Cette publication technique n'est disponible qu'en anglais. Copyright: Queen's Printer for Ontario, 1992 This publication may be reproduced for non-commercial purposes with annrnnri;itp aftrihiitinn. with appropriate attribution Log 92-2726-009 PIBS 2201 MISUSE OF THE NULL HYPOTHESIS IN DATA REPORTING AND INTERPRETATION Report prepared by: Donald E. King Laboratory Services Branch Ontario Ministry of the Environment DECEMBER 1992 Log 92-2726-009 PIBS 2201 MISUSE OF THE NULL HYPOTHESIS IN DATA REPORTING AND INTERPRETATION ABSTRACT The concept of analytical method detection capability (MDC), has been adopted for several decades as the basis for determining measurement reporting limits. One outcome has been inadequate control of method performance at low-levels. As a result, the frequently poor comparability of such data among laboratories, has led to the introduction of progressively higher data reponing limits and the consequent loss or degradation of potentially valuable low-level environmental data. This paper proposes that the statistical process behind this concept of detection has been misunderstood and misapplied. It reviews the application of the 'null' hypothesis, in particular the requirement that this hypothesis be the opposite of what one suspects to be true. The traditional approach applies to the case where an analyte is known to be very probably present (conventional parameters, nutrients, and major ions). But in other cases, (ultra-trace contaminants in drinking water), we know that the analyte is present at levels below our analytical capability to measure. Therefore the statistical process and logic must be inverted. This paper affirms, on a statistical basis, the need to report low- level data, and disavows the application of reporting limits at levels any higher than 3 times the method repeatability as estimated by the within-run standard deviation (Sw). It supports the reporting of low-level estimates, and the adoption of four generic reference points for data interpretation: W (between Sw/2 and Sw), CD (= 3 Sw), DL (= 6 Sw), and QL (= 12 Sw). Keywords: detection, data reponing, data interpretation, null hypothesis. MISUSE OF THE NULL HYPOTHESIS IN DATA REPORTING AND INTERPRETATION INTRODUCTION The concepts of analytical detection and quantitation are critical to the way in which data is reported and interpreted. In the 1960's, A.L. Wilson* and others promoted the use of statistical protocols to assess the performance of analytical methods. He proposed a 'criteria for detection' (CD) and a 'detection limit' (DL) based on a within-run estimate of standard deviation (Sw). But the multiplicity of terms, definitions, and estimation procedures, introduced over the past thirty years have failed to clarify their application in every day life. The present paper will not review the frequently contradictory terms and definitions currendy encountered in the literature. (See Currie^ for a comprehensive examination of the topic.) Instead, it proposes that the widespread misunderstanding and misapplication of these CD and DL concepts to the issue of data reponing, particularly data censoring, and the subsequent introduction of even higher criteria to restrict reporting of low-level data, has been a disservice to the goal of Quality Improvement in environmental analytical data quality. To many, these various terms and concept convey a negative sense of analydcal performance and data quality at low-levels. Low-level measurements are perceived to be inherently invalid, biased, and non-defensible. Detection, quandtaion and higher reporting limits are used to limit the obligation of less competent laboratories to produce well-controlled data. Wheri they are applied in regulatory situadons, this censoring of low-level estimates promotes the perception that effluents are free of contaminants, and restricts the ability of a discharger to determine and demonstrate that their effluent is perhaps much cleaner than required. 'Detection' terminology varies widely with respect to function, intent, the type of data to be used, the statistical concepts involved, and application. And they vary in magnitude from 1.64 to many times higher than the within-laboratory analytical standard deviation. Since environmental concerns are focused primarily on those analytes which are only present, if at all, at levels below current quantitation limits, the absence of low-level estimates inhibits our ability to develop appropriate environmental strategies. This paper demonstrates how the statistical logic must be inverted when the target analyte is strongly suspected to be 'absent' (no analytical response at a level above Sw). It demonstartes the need to re-examine data reporting practices and to improve data quality management objectives for low-level data. Recently Keith^ has proposed the adoption of three generic levels for assessing low-level data quality. His terms MDL, RDL (2 MDL), and RQL (4 MDL) derive from the US-EPA definitions and exactiy parallel the terms CD, DL, and QL (to be discussed below). He proposes the term 'Level' rather than 'Limit' to stress their use as criteria for data interpretation rather than as criteria for data reporting. To further reduce the confusion surrounding 'DL' terminology, the following discussion proposes adoption of a generic term 'method detection capability' (MDC) to replace 'method detection limit' (MDL). This separates analytical performance from data reporting and data interpretation. MDC represents any statistical concept of detection based on an estimate of total method repeatability (Sw), without specifying the level of confidence required or implied. For general comparison among methods and laboratories the factor 3 will provide sufficient confidence and promote comparability of estimates. [Statistical confidence depends on the degrees of freedom (i.e: number of replicate measurements) used to estimate Sw.] MDC page 2 will incorporate the impact of the total method on analytical variability for a typical sample matrix. MDC will not incorporate the impact of particularly complex or atypical matrices. This parallels the US-EPA term MDL but avoids the confusion surrounding the term 'detection limit'. BACKGROUND In Wilson's papers, CD (later L^) represented the level above which a measurement could be considered to be a firm indication of the presence of analyte in a sample. DL (later L^) was set at 2 CD and represented the minimum quantity needed in the sample to ensure that most measurements would exceed CD. These 'decision points' were based on an estimate of the analytical within-run standard deviation (Sw). They included the impact of baseline/blank corrections. CD and DL were to be used to describe the capability or the relative suitability of alternative methods. But many analysts and data users, concerned that relative error increases dramatically for measurements near CD and DL introduced the concept of a 'quandtation level' (QL). Thus if the uncertainty of a measurement is estimated by ± CD, then results above a QL set at 4 CD would have a relative uncertainty of less than 25%. (Results between DL and CD would be considered to be semi-quantitative.) Finally, since these levels were based on single analyst estimates of repeatability, and it was known that repeatability (and bias) varied among laboratories, the concept of a 'practical quantitation level' (PQL) was introduced. It would be based on among-laboratory estimates of standard deviation (S). In a reasonably comparable group of laboratories the rado of S/Sw would not generally exceed about 1.3 to 1.5 based on statistical considerations (the F-test). So there is some jusdficarion for setring a PQL at 5 to 6 times CD. In principle then, we had the basis for an internally consistent sequence of meaningful criteria for interpreting data. During the 1970's, instrument manufacturers, professional analytical associations, and regulatory agencies flooded the literature with a multitude of conflicdng definitions and terms to describe instrument 'noise' and method detecnon capability. As an example, Wilson's CD concept has a similar basis as terms such as 'lower limit of detection' and 'method detection limit', while his DL concept is comparable to terms such as 'limit of quantitation' LOQ. The term PQL appears to have been developed and interpreted in a number of ways. Thus, it represents: a regulatory upper limit for laboratory MDL values, a surrogate estimate for the individual MDL estimates in a multi-parameter scan, the level above which some percentage of participants in an interlaboratory study can estimate an unknown to within some percentage of the target value, etc. The lack of data quality, and the general misuse of analytical data during the 1970' s to support opposing viewpoints on environmental and health protection issues, required laboratories and data users to be particularly war>' when releasing any data which was not absolutely beyond reproach. The concepts of DL, QL, and PQL were adopted as successively higher criteria for restricting the reporting and use of low-level measurements. There is still a lack of consensus among analysts about the validity of such data. And the confusion associated with the concepts of detection has made it difficult to discuss mechanisms for improving the quality of such data. page 3 In environmental studies, relative change is often less important than absolute trends and differences. At lower levels, absolute performance is described by the equation: Sw = So + f'C where: Sw is the total repeatability of the procedure So is the effect of methodology, f is a factor such as 0.05, and C represents analyte concentration. The term Sq actually incorporates both the inherent variability of the instrument and the methodology (analytical technique). Thus: So' = s,' + s„' + s,' Typically, for the analytical sample preparation procedures required to analyze complex environmental matrices, S„ is the dominant term in the equation. [The term S^ has been apended to cover the impact of applying correc;>ons for interference from sample matrix constituents. It can be quite critical for some tests, for example, the correction for chlorine interference in the estimation of vanadium by ICP/MS systems which depends on the level of chlorine be corrected.] It should be apparent that Sw does not vary gready within the usual operating range. Thus, given a range 1 to 200 ug/L: if Sw is 2 at 5 ug/L, then it "ill be about 3 at 20 ug/L, and about 7 at 100 ugA- We must recollect that, statistically, a c.ange or difference of 2 Sw between two measurements is significant (risk of <5%), whether it occurs at DL or 10 DL. When we are interested in spatial/temporal changes and can read to the nearest 1, the results 2 and 6 indicate a significant change. A similar level of statistical confidence at 100 ug/L requires a change of about 14 ug/L. So although the low-level data may barely meet criteria for 'quantitative' based on relative error, it is still very meaningful in absolute terms. The quality, and hence reportability and interpretation, of low-level data is always a concern. Environmental analysts have had ongoing conceptual problems with the adoption of a purely statistical basis to decide the reportability of data. There are real-life problems in determining the 'zero' point (soiu-ces of bias), establishing specific analyte identification criteria, and reducing the impact of the sample matrix (interference). Given that any suggestion of the presence of a hazardous contaminant would have serious public repercussions, many analysts simply refuse to report low-level data. Inadequate control protocols within the laboratory continue to have serious repercussions on the comparability of data from different laboratories and the acceptability of data for regulatory purposes. Analytical and regulatory personnel resort to between-run and among-laboratory estimates of variability to defend their application of high data reporting limits. The incapability of some analytical facilities to produce reliable data has been used to set limits which allows the remaining facilities to neglect the issues of low-level measurement control. Finally, the issue of data quality is aggravated because many environmental regulations resort to method specification rather than performance specification. This approach depends on the assumption that all laboratories have equally competent analysts with expertise in the increasingly technological instrumentations required, and that they apply more than the minimum level of quality assurance and quality control practices. This approach also favours the retention of page 4 obsolete analytical technologies which are inadequate for reliable identification of trace pollutants, and inhibits (because of the need for 'equivalency' the introduction of better (more specific, selective, precise, and accurate) methodology. Ultimately, for a specific methodology, the uncertainty of measurement, the difficulty of setting 'zero', and problems in identifying and confirming the presence of a specific analyte, will forestall the generation of reliable low-level measurements pending a significant improvement in analytical procedure. But the commonly accepted practice of rejecting low-level data because of the potential for bias, misidentification, etc., impedes the achievement of data comparability among laboratories. The primary laboratory data quality issues at low-levels are proper control of baseline, correction for the method blank. Other factors include laboratory contamination, analyte recovery and identification. It is general practice to ignore low-level blank esrimates because they are typically below the routine reporting hmits. When samples contain reasonable amounts of the target analyte this is no problem. But this practice contributes to measurement bias, affects the quality of all data, and will definitely impair the comparison of interlaboratory findings. On the other hand, spatial and temporal patterns and other knowledge about the environment being studied, can be used to affirm the internal consistency of data from a particular laboratory. Even if data sets produced by two agencies are biased relative to each other, they may still affirm any trends present. Initial estimates can always be confirmed by replicate measurements, multiple sampling, or by pattern analysis (contour or time sequence plots) of many single esdmates at various points or times. Project planning should recognize and incorporate a strategy to compensate for the variability and potential bias of measurement and sampUng activiries. Resampling is rarely a satisfactory approach for replacing missing information or improving data consistency. DETECTION AND PERFORMANCE CAPABILITY Detecnon capability is a performance characterisdc which indicates confidence that the analyte response was sufficiently different from the appropriate 'zero' response to conclude that the response is not an effect of measurement noise or variability. The variability estimate will reflect one or more of the following: a) the sensidvity of the detector (S|); b) the variability of the method blank under standard conditions (S^,); c) the impact of other sample constituents on variability and bias (SJ. Part of the confusion in the literature arises from lack of consensus as to what type of data should be used in estimaring capability and 'decision points'. Alternatives include various combinations of direct injection or total method replication, standards or natural samples, single analyst or inter-analyst, within or between batch, the level of confidence required (95% or 99%), and the nature of the 'null' hypothesis to be adopted. The most commonly encountered terms among working analysts are the instrument detection limit (IDL), the method detection limit (MDL), and die sample detection limit (SDL). IDL tends to reflect and relate to instrumental pages readout signal/noise ratios. MDL as described by US-EPA tends to reflect the entire method but only for standard solutions or 'typical' matrix samples. SDL tends to be used when the analytical determination cannot be performed due to severe matrix effects (e.g., foaming, emulsions, severe background effects, in which case no measurement can be performed), or when the result is panicularly uncertain because of the uncertainty induced by attempts to correct for determinate sources of error (e.g., chlorine in the vanadium test by ICP/MS). Again, to eliminate the confusion surrounding the terms 'detection limit' and 'detection level', the term detection capability is preferred. Thus these terms should be changed to EDC, MDC and SDC respectively The most commonly encountered definitions for detection capability reflect the variabihty of in- run blank estimates. This approach to the question of low-level method reliability depends on the within-batch standard deviation (Sw), which is an estimate of within-batch single-analyst total method repeatability. Typically, in general conversation among analysts, MDC is expressed as 3 Sw. This factor of 3 has been established by tradition and is traced by them to generally poorly understood statistical principles. Of course the actual factor would vary based on the number of replicates used to determine Sw and the level of risk one is willing to take that a decision may turn out to be wrong. Although this statistical basis is generally not well understood by many bench analysts, most understand that it provides 99% confidence that the analyte is present. Detection capability is independent of any bias introduced by correcting for (or ignoring) day-to-day variation in the blank, or within-run drift of the baseline, matrix interference, or inadequate analyte recovery or identification. Therefore, although bench analysts accept and use the generic concept of MDC, they will still express some lack of confidence in the quantitative nature of such low measurements (as discussed previously). Many laboratories and analysts are well aware that blank estimates can change dramatically but are unable or unwilling to develop adequate control strategies to limit the bias effect. STATISTICAL CONCEPTS FOR DATA INTERPRETATION We as analysts tend to think of detection as the 'appearance' of something that wasn't there. And this is certainly valid. BUT for waste management program managers and data users, decisions are required about the 'disappearance' of an environmental constituent (e.g., as we move from the centre to the edge of a contaminated site). They know it is there. At what point does it merge with the surrounding background and how far has it spread? Ultimately, some of these contaminants become unmeasurable. Detection limits and data reporting practices were developed at a time when we measured only things which were high enough to measure. On the other hand, drinking water treatment engineers are required to provide the 'purest' possible product. They are often required to test for contaminants which previous experience suggests are unlikely to be found at measurable levels. (As measurement analysts we are expected to have tried diligendy to measure at levels below those considered to represent a potential health impact.) page 6 Statistical decisions are based on awareness that well-controlled measurements follow the normal distribution (figure 1). It is more likely to observe results close to the average than far away from the average. Given a result R one can predict that the average will be closer rather than farther from the result, and one can predict how different the average might be from the observed result (figure 2). These confidence hmits and confidence intervals are fairly straight forward and easily appreciated. Statistical decisions are another matter: they introduce the concept of the null hypothesis and the logic of 'double negatives'. If the test fails to meet a specific criterion then we can't reject the preliminary assumption without an increased risk of making an incorrect decision. Decisions involve risk: statistical techniques provide a basis for estimating the risk. In statistical decision making, when we know or suspect something to be true, then we MUST choose the opposite possibility as the 'null' hypothesis. In this way the anticipated positive measured result will normally lead us to reject (correctly) the (unlikely) null hypothesis. There are two undesirable outcomes. Type I decision error: we accept our understanding of the truth when it was actually not true. Type II decision error: we reject the truth because we were mislead by our data. A Type I 'false positive' decision error occurs when we reject a 'null' hypothesis (the 'lie') when it was really true, (i.e., we become convinced that what we thought was true has been demonstrated to be true when in reality it was never true to start with). Alpha is the risk of making this decision error. We reduce this risk of a 'false positive decision' by setting alpha at 0.05 or 0.01 . BUT by reducing this risk we delay recognizing that our previous understanding of the 'truth' has been substantiated by the data. A Type II error occurs when we fail to reject the 'null' hypothesis. Beta is the risk, given that our previous understanding of the 'truth' was correct, that our data will fail to substantiate that 'truth' (because we set alpha too high). Beta represents the risk of rejecting the truth. If the construction of a multi-million dollar waste treatment facility depends on our single measurement being above or below some critical operational value (e.g., an effluent quality limit), we prefer to hold off until we get more data. On the other hand, one could argue that the potential presence of a hazardous contaminant, revealed by an improvement in measurement technology, should not be arbitrarily rejected without additional confirmation. But analyst concern about the misuse of low-level data has lead to the practice of using statistical principles to defend the 'censoring' of low-level data. While there might be some justification for setting reporting limits (RL) at the laboratory's MDC value or even higher when we are only concerned about exceeding some measurable health or water quality guideline, the following discussion will demonstrate that the practice of automatically setting RL at 2 to 5 times MDC is statistically indefensible. Statistical protocols provide a tool for assessing the risk in drawing a particular conclusion. They can not be applied if the data is not reported, or if the performance capability expected or observed is unavailable to the data user. page 7 ANALYTICAL DECISION POINTS: CD and DL Our present use of the terms CD and DL will be somewhat derived from A.L.Wilson. Given an estimate of low-level standard deviation (Sw), based on replicate measurements performed within the same batch/run by a single analyst, Wilson recommended setting CD at 1.64 Sw assuming that: - Sw is estimated from a large number of replicate measurements; - a Type I error risk of 5% was acceptable (see discussion later). Wilson set DL at 2 CD (i.e. 3.29 Sw). The basis for this is discussed below. Wilson also discussed the incorporation of variability due to blank estimates. Since this blank issue complicates the discussion, and since most analytical results already incorporate the blank estimate, the following assumes that Sw already includes the impact of the 'blank' correction. In North America, the corresponding terms typically encountered are MDL (Method Detection Limit) and LOQ = 2 MDL (Limit of Quantitation). MDL is set at 3 Sw assuming that: - Sw is based on 8 replicates measurements i.e: 7 degrees of freedom (dO; - a Type I error risk of <1% is necessary (see discussion later). Thus, based on Wilson's definitions, MDL is actually a CD, and LOQ is actually a DL. In the following discussion we will assume that CD = 3 Sw, although any other factor could be used. CD is the level above which it is unlikely to observe a result unless analyte is actually measurably present in the sample; • CD is a factor t times the repeatability Sw. (t depends on df and preselected risk alpha); a reliable estimate of Sw requires that measurements be made in increments less than Sw; a measurement exceeding Sw/4, Sw/2, or Sw suggests presence of analyte, (the alpha risk is <40%, <30%, <15% respectively.); a measured result of about Sw/2 or higher represents a 'measurable' result indicating possible 'presence', (but it is certainly not a quantitative estimate). ALTERNATIVE NULL HYPOTHESES The traditional environmental analytical literature uses the concepts of CD (MDL) and DL (LOQ) to distinguish the two levels at which statistical decision points have been set. The first (CD) is presumably set to minimize the risk (alpha) of deciding that something has happened when it has not, (e.g: to conclude that analyte is present in the sample when it is actually not present). The second (DL) is described in the literature as being selected (based on a preset CD) to minimize the risk (beta) of deciding that something has not happened when it has, (e.g: that analyte is not present in the sample because the analyst has reported 'less-than CD' although page 8 analyte was actually present). In traditional detection contexts, beta indicates the likelihood of obtaining a result below CD when the sample contains a sample level of DL. I When a certain fact is 'very likely to be true', statistical theory requires us to adopt the 1 hypothesis that it is 'not true'. This ensures that we will normally reject the null hypothesis, and therefore accept the 'truth'. Traditional detection concepts have been developed on the basis that we know and fully expect the analyte is present at easily measurable levels. Therefore: | a) The analyst's perspective always the following 'null' hypothesis (a lie): the sample does not contain a measurable level of analyte. I If this hypothesis were true, levels above CD are statisticallv unexpected. In fact, based on previous or alternative sources of information, we know that measurements above CD are extremely likely (e.g, calcium in water, phosphate in wastewater). b) The client's perspective adopts the so-called 'alternative' hypothesis (the truth): the sample does contain the analyte, preferably at a level above DL (2 CD). If this is true then measurements below CD will be avoided. The client wishes to avoid data in this range because it is typically censored. If there is concern that the actual level of analyte will fall below the analyst's reporting limit, the client has the option to expand the sampling design to incorporate replicate samples, or replicate measurements, (or both). Of course, if the data is not going to be reported, all this extra effort will be wasted. In principle CD must be determined before we can set DL. To estimate CD we esrimate the within-run standard deviation Sw and select a risk alpha and then look up the critical Student's t-factor. Thus, the MDL as defined by the US-EPA can be set at 2.998 times Sw based on 8 replicate measurements and a risk level alpha of <1%. In principle DL and CD can be set anywhere in relation to each other. Thus: If DL is set exactly equal to CD then the risk beta of a result below CD is 50%. If the reporting limit is then set at CD, a Type II decision error can occur 50% of the time. • If DL is set at 2 CD then the risk of a result below CD is <1%. A Type n error will occur less than 1 % of the time, so long as results at or above CD are reponed. But if the reponing limit is adjusted to DL = 2 CD then beta remains at 50%. In general pracrice DL is set at 2 CD. But 3 Sw can be interpreted to be either a CD or a DL. And some of the confusion in terminology arises from the fact that the value 3 Sw can be obtained by preselecting various combinations of alpha and beta risk, taking care to estimate Sw from the corresponding number of replicate measurements. The following optional mechanisms for obtaining a factor of 3 should demonstrate that the common statement 'my detection limit is 3 Sw' carries no specific statistical interpretation. The following are only a few of the possible page 9 optional interpretations of the value 3 Sw: • a CD with a risk level of alpha <1% ( df = 7); (null hypothesis that sample mean is located at 'zero', tail facing upwards) • a CD with a risk level of alpha <0.1% (df = large, e.g. >60 ); (null hypothesis that sample mean is located at 'zero', tail facing upwards) • a DL with a risk level of alpha <1% and beta=50% (7 df) (* see note below); (null hypothesis that sample mean is located at 'zero', with the alternate hypothesis that the sample mean is located at CD, tail facing downwards to 'zero'). • a DL with alpha <7% and beta <7% (large degrees of freedom), or any other combination of alpha and beta risk levels at appropriate degrees of freedom which yield a factor equal or close to 3! Note: Beta varies depending on how data is reported/censored. The proper interpretation of low-level data requires that all data be reported and that the CD value be available to the client. And keep in mind the analyst's warning that all bets are off if the sources of determinate error are not controlled, or the identity of the analyte can not be confirmed. The above statements were developed in the context that the 'null' hypothesis is that analyte is 'absent'. But there are two alternatives in environmental sample measurement: a) The analyte is generally known to be present at measurable levels, (e.g: major ions, conventional parameters, some trace metals, etc.). b) The analyte is generally known or strongly suspected to be absent, or only present at barely measurable levels (e.g: mercury in water, many organic pollutants, ultra-trace elements, etc.). It is only relatively recendy that we have become interested in the quality and interpretation of so-called 'less-than' data for analytes we 'very much suspect' to be absent (i.e: not measurable). [For some analysts the term 'measurable' implies the amount present to be above the quantitation level.] For statistical correctness: when we suspect that analyte is 'absent', we must hypothesize it to be 'present'. The anticipated 'zero' result will allow us to reject the 'null' hypothesis, and conclude that the analyte is 'absent'. (Of course this hypothesis requires us to state a level such as CD to represent 'present'. The statement 'absent' is then true in the context that the 'null' hypothesis was 'present at a level of CD'.) page 10 DETECTION HYPOTHESIS Case a) The analyte is known or strongly suspected to be present: NOTE: We MUST first assume the 'null' hypothesis = absence (sample content of analyte is less than about Sw/4). [Given that we must be able to measure in units smaller than the value of Sw in order to estimate it reliably, it is certainly feasible to measure at levels below some factor times Sw.] Therefore, given results in the vicinity of Sw or higher, then: • given a normal distribution and a one-tailed test (facing upscale) • we choose a critical level or decision point CD (e.g., 3 Sw), and • if the result R is at or above CD we will reject the null hypothesis, • so we conclude that analyte is 'present' (no surprise!). • the statistical risk that analyte is not present is 'alpha' (when CD = 3 Sw, alpha is set for 0.01 or 1% risk or less) (CD could be set at Sw, for a statistical risk of <15%) • for analytes 'known' to be present the real risk is negligible In theory, when the level CD is set lower, the risk (alpha) of Type I (false positive) decision error increases, (i.e the decision to reject the null hypothesis has a higher risk of being wrong). BUT, given that we already 'know* that the analyte is likely present (that's why we chose the null hypothesis of 'absent'), it is quite likely that results R below CD will be obtained from samples which contain between Sw/4 (not zero) and R + CD. This fact leads to tlie 'alternative to null hypothesis', namely that the sample contains CD (knowing that it probably contains more). If we examine the lower tail of a normal distribution centred on CD, we note that: - a result below CD will be observed 50% of the rime, (i.e. beta = 0.5 or 50%) - a result below about Sw/4 will be observed less than (100 alpha)% of the time. BETA RISK is THE RISK OF REJECTING THE TRUTH Failure to report results below CD WILL cause the uninformed data user to conclude that the analyte was riot present. Given previous knowledge of probable presence, this decision has a high probability (50% risk) of Type 11 (false negative) decision error. When we set the reporting limit at an even higher level (as generally practised) we INCREASE this beta risk to vinual certainty. page 11 DETECTION HYPOTHESIS Case b) The analyte is known or strongly suspected to be absent: NOTE: We MUST first assume the 'null' hypothesis = presence. [Given that we cannot infer presence (when the null hypothesis is 'absent') until results R exceed CD, we can also appreciate that samples containing less than CD will rarely give a result greater than 2 CD.] Therefore we can decide that absent in the sample will mean that the sample actually contains not more than DL = 2 CD, and: • given a normal distribution and a one-tailed test (facing toward zero). • if the result R is below CD we will reject the null hypothesis. • so we conclude that analyte is not present at levels above DL. • the statistical risk that analyte is 'present and >DL' is 'alpha'. • for analytes 'known' to be absent or certainly ' limits for those parameters above RMDL, it was found that this low-level data was critical to the evaluation of the likely presence/absence of analytes when results were observed in the vicinity of RMDL. And the need to report such data appears to have had a posinve impact on the management of method blank and related baseline data. LOW-LEVEL DATA QUALIFIERS The Ontario Ministry of the Environment laboratories currendy provide data qualifiers (remark codes) for use when reporting low-level data. These include: 98% of results ^ fraction of results in the extreme tail depends on reliability of Sw If Sw based on 8 replicates tall includes < 1% of data If based on 100 replicates: tail includes 0.1% of data DECISIONS AND RISK CONFIDENCE LEVELS GIVEN A RESULT ^R' THE UPPER AND LOWER LIMITS FOR SAMPLE CONCENTRATION CAN BE PREDICTED TO BE : Not greater than R + 3 Sw (upper tail risk is < 1%) Not less than R - 3 Sw (lower tail risk is < 1%) Figure 3 SAMPLE EXPECTED TO CONTAIN ANALYTE Null Hypothesis : sample contains ^zero' If a result exceeds CD: we have confirmed the Mruth' risk (alpha) that sample = absent is less than 1% < 15% of results are expected to be above Sw < 1 % of results above 3 Sw = CD 5W low BUT: The sample is expected to contain > CD Results below CD are expected to occur Risk of results below W is slight Alternate Hypothesis : sample contains CD CD 50% of results will fall below CD beta risk = RISK OF REJECTING TRUTH (high if data censored) 10W If results below DL are not reported: beta risk > 99% !! If results below CD are not reported: beta risk is 50% ! If all results reported: beta risk is <1% Figure 4 SAMPLE DOES NOT CONTAIN ANALYTE Null Hypothesis : sample contains ^DL' statistical risk of results below CD is < 1% DL If result R is below CD, then sample content is < R + CD (upper confidence interval) Zero CD W 5W low BUT: The sample is expected to contain ^zero' actual risk of results > CD is < 1% (unless blank error or contamination) Alternate Hypothesis : sample contains CD beta risk = RISK OF REJECTING TRUTH (negligible) 5W statistically 50% of results will fall above CD In fact, there is little risk of results above CD DL 10W If all results reported down to ^zero' : alpha risk <0.1% If results below CD are not reported : alpha risk < 1% IF RESULTS BELOW DL ARE NOT REPORTED : alpha risk = 50%