Kolmogorov-Smirnov (KS) Test: Quick Guide to Normality

Imagine you are conducting research on customer wait times at a restaurant. You need to analyze this data statistically, but first, you must determine if it follows a normal distribution to decide whether to use parametric or non-parametric tests.

You’ve already heard of the Shapiro-Wilk Test, but your dataset is large, so you decide to use the Kolmogorov-Smirnov (KS) Test, a powerful method for comparing your dataset against a theoretical distribution.

This blog will explain everything you need to know about the KS Test—its history, theory, hypotheses, and practical applications with real-life examples and implementation in Python, R, and SPSS.

History of the Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (KS) Test is named after two Russian mathematicians, Andrey Kolmogorov and Nikolai Smirnov, who developed it in the 1930s. Originally, it was designed to compare a sample’s distribution with a reference distribution (one-sample KS test) or compare two different samples (two-sample KS test).

Unlike many other normality tests, the KS Test is non-parametric, meaning it does not assume any specific data distribution. This makes it versatile for various applications, including finance, medicine, and machine learning.

kolmogorov-smirnov test

Theory Behind the KS Test

The KS Test is based on the concept of the empirical cumulative distribution function (ECDF), which represents the proportion of observations less than or equal to a given value. It measures the largest difference (D-statistic) between the observed ECDF and the expected theoretical cumulative distribution function (CDF).

  • If this difference (D) is small, the sample is likely from the theoretical distribution.
  • If this difference is large, the sample significantly deviates from the theoretical distribution.

Mathematically, the test statistic is:

Dn=sup⁡∣Fn(x)−F(x)∣D_n = \sup | F_n(x) – F(x) |

where:

  • F_n(x) = Empirical cumulative distribution function (ECDF) of the sample
  • F(x) = Theoretical cumulative distribution function (CDF)
  • sup = The supremum (maximum difference)

Hypotheses of the KS Test

The KS Test evaluates whether a dataset follows a specific distribution (usually normal).

  • Null Hypothesis (H₀): The sample data follows the specified distribution.
  • Alternative Hypothesis (H₁): The sample data does not follow the specified distribution.

If the p-value > 0.05, we fail to reject the null hypothesis, meaning the data follows the assumed distribution. If p < 0.05, we reject the null hypothesis, meaning the data does not follow the assumed distribution.

Real-Life Example: Testing Wait Times at a Restaurant

Imagine you are a restaurant manager analyzing customer wait times to optimize staffing. You collected the following data (in minutes):

Dataset: [3, 5, 7, 9, 4, 6, 8, 5, 7, 6, 4, 7, 8, 6, 5, 6, 7, 8, 5, 6]

You want to determine if these wait times follow a normal distribution before applying statistical methods.

Kolmogorov-Smirnov Test Implementation

Python Implementation

from scipy.stats import kstest, norm

import numpy as np

wait_times = [3, 5, 7, 9, 4, 6, 8, 5, 7, 6, 4, 7, 8, 6, 5, 6, 7, 8, 5, 6]

# Normalize data to mean 0 and std 1 (as KS test is sensitive to parameters)

wait_times_norm = (wait_times – np.mean(wait_times)) / np.std(wait_times)

# Apply KS test

stat, p = kstest(wait_times_norm, ‘norm’)

print(f’Statistic={stat:.3f}, p={p:.3f}’)

Result: If the p-value = 0.12 (greater than 0.05), the data follows a normal distribution.

R Implementation

wait_times <- c(3, 5, 7, 9, 4, 6, 8, 5, 7, 6, 4, 7, 8, 6, 5, 6, 7, 8, 5, 6)

ks.test(wait_times, “pnorm”, mean=mean(wait_times), sd=sd(wait_times))

If p > 0.05, the data is normally distributed.

SPSS Implementation

  1. Enter Data: Input the wait times in SPSS.
  2. Run the KS Test:
    • Click Analyze → Nonparametric Tests → Legacy Dialogs → 1-Sample K-S.
    • Select Wait Times as the test variable.
    • Choose Normal as the test distribution.
    • Click OK.
  3. Interpret Results: If p > 0.05, the data is normally distributed.

Alternative Tests for Normality

If the KS Test gives inconclusive results, consider these alternatives:

  • Shapiro-Wilk Test (for small datasets)
  • Anderson-Darling Test (more sensitive than KS)
  • Lilliefors Test (KS test adaptation for unknown parameters)

Frequently Asked Questions (FAQs)

1. What is the Kolmogorov-Smirnov Test used for?

It is used to determine whether a dataset follows a specific distribution, such as normal distribution.

2. When should I use the KS Test instead of the Shapiro-Wilk Test?

Use KS for large samples (n > 50) and Shapiro-Wilk for small samples (n < 50).

3. Can the KS Test be used for non-normal distributions?

Yes, it can test any distribution, such as uniform, exponential, or Poisson.

4. What does a high p-value mean in the KS Test?

A high p-value (greater than 0.05) suggests that the data follows the assumed distribution.

5. Is the KS Test sensitive to sample size?

Yes, larger samples detect even small deviations, while small samples may not detect significant differences.

6. Can I use the KS Test for non-continuous data?

No, the KS Test is best suited for continuous data.

7. What is the D-statistic in the KS Test?

The D-statistic measures the maximum deviation between the observed and theoretical distributions.

8. How do I interpret KS Test results?

If p > 0.05, the data follows the assumed distribution; otherwise, it does not.

9. What are the limitations of the KS Test?

It is less powerful when estimating parameters from the sample.

10. Is the KS Test suitable for skewed data?

No, it assumes symmetric distributions. Use the Shapiro-Wilk or Anderson-Darling Test instead.

Conclusion

The Kolmogorov-Smirnov Test is a powerful tool for checking normality, especially for large datasets. By ensuring normality before statistical analysis, you can choose the right test and make more reliable conclusions.

 

Recent Post

Most Read Post

Feature Post

Scroll to Top