Testing For Normality

In testing data for normality the null hypothesis states that the data come from a normal distribution while the alternative is stated as the data do not come from a normal distribution.  Two tests for normality, the Chi Square Goodness of Fit test and the Kolmogorov-Smirnov test are outlined and applied to a data set in this notebook.

First, three Mathematica 'Packages needed to test for normality are loaded.


The data being tested for normality are shown next


The next statements display a histogram of the data--first the minimum and maximum data values are found, then based on these values, counts (BinCounts) of data values in each unit interval from floor of the minimum value to ceiling of the maximum value are computed.



Next, the mean and standard deviation of the data are computed.  If the data come from a normal distribution, these two numbers are estimators of the mean and standard deviation of the normal.


The Chi Square Goodness of Fit Method

In this method, the observed numbers of cases falling into each of the intervals have been found above.  Now, assuming that the null hypothesis is true, the expected numbers in each of the intervals must be computed.  First the sides of the intervals are computed, and then these sides are partitioned to get the intervals for which the expected numbers of cases (assuming the null hypothesis that the data are normally distributed is true) will be computed.


The next function computes the probability, assuming the null hypothesis is true with mean mu and standard deviation sd, of the interval from a to b.


The function just defined is applied to the intervals to find the probabilities, and then the expected number of cases falling into each of the intervals is found.


A formula for computing the chi-square value is given next, and applied to the information compiled above.


The p-value corresponding to a chi-square value in a chi-square distribution with degrees of freedom equal to the number of intervals-1-2=16-1-2=13 is  0, so the null hypothesis must be rejected.  However, there is one problem with the way the test was done.  The statistic  [Graphics:Images/index_gr_25.gif]=[Graphics:Images/index_gr_26.gif] should only be used when the expected value in each of the categories is greater than 2.  This is clearly not the case in the above example.  To achieve this the last 6 categories are collapsed into one and compute the new chi-square value.


The next line defines a function that sums the last k entries in a list and replaces those entries by the sum.  


The collapse function is applied to the observeds and expecteds to form newobserveds and newexpecteds, and then the chi-square value is calculated for this new list.  This calculated chi-square has 10-1-2=7 degrees of freedom.  The p-value is found from a chi-square table to be


The p-value corresponding to a calculated chi-square of 46.4 with 7 degrees of freedom is still approximately 0.  Thus the null hypothesis is rejected.

The Kolomogorov-Smirnov Method

In the chi-square goodness of fit method, the number of intervals could be chosen in various ways.  In the above example we chose to consider unit intervals from 0 to 16.  Would the result have been different if intervals of width 2 or width 4 had been used.  The Kolomogorov-Smirnov (KS) method avoids this problem by using only the given data values.  The idea is to compute the cumulative normal probability function corresponding to the given data, and then compute the 'empirical distribution function' based on the data values.  The maximum difference between these two functions is obtained.  If the probability, under the assumption that the null hypothesis of normality of the data, is very small, i.e. the p-value is small, the null hypothesis is rejected.  Otherwise, it is accepted.  The probability is computed based on tables of the KS statistic.

First, the cumulative distribution function assuming the data come from a normal distribution with the mean and standard deviation computed above is compiled and graphed.  To do this, the CDF is computed at each of the sorted datapoints.  The next statement sorts these data and following that, the CDF is computed at each sorted value.


A plot with points whose x-coordinates are the sorted datavalues and whose y-coordinates are the normal CDF of these sorted datavalues is shown next.  



Next, the empirical CDF for the datapoints needs to be computed.  Given one of the sorted datavalues, the empirical CDF is simply the fraction of datapoints less than or equal to this datavalue.  The empirical CDF is computed at each of the sorted datavalues by the next two Mathematica functions.  A list of these datavalues follows the two functions.


A graph is made with x-coordinates the sorted datavalues and y-coordinates the calculated empirical cdf values just shown.



The two graphs, the normal cdf and the empirical cdf are shown together.



The KS statistic is based on the maximum of the differences between the normal cdf and empirical cdf over all of the sorted datavalues.  The KS statistic measures the probability of the observed maximum provided by the next statement or something more extreme.  This is the p-value.  A function that computes the KS statistic can be found on the internet at http://www.io.com/~ritter/JAVASCRP/NORMCHIK.HTM#KolSmir    Go to that site and use the Java applet to compute the p-value of the KS statistic shown next.


Converted by Mathematica      May 15, 2001