The Mysterious Case Of The Lost Pages: Goodness-Of-Fit

6 min readFeb 23, 2020

This is one in a series of articles to demystify some common techniques and practices in data science. Please support with claps, a follow or a comment down below!

When we observe two different sets of data, it is very tempting to assume that they are independent groups i.e. they’re not related to one another. If we make this assumption, we can easily analyze each group individually, isolated, without evaluating the potential relationships. Is this a good practice?

The answer is yes and no. It all comes down to our knowledge of the phenomena we’re measuring. Every assumption we make as data scientists needs to be backed by evidence and tests. Today, we prove our assumptions using the goodness-of-fit test.

Where in the world did these pages come from?

Let’s imagine we have three pages. We don’t know if the pages are from the same book, or from separate books. In reality, two of them come from the same book, a book of biology, and the third page comes from a cooking book. However, we don’t know that yet. And we’re going to prove it now.

Emphasis on hypothesis. We may either accept or reject this premise. Now back to the mysterious book pages. What information could we possibly use from the pages to prove our hypothesis? A wizard suddenly appears!

The cute wizard gives us two ominous yet convenient clues we can check:

Word Length
Unique Words

In the first part of this mystery, we’ll visit goodness-of-fit for the length of words in the mislabeled pages.

But what is goodness-of-fit? As the name suggests, it measures how much a test function or sample matches our expectations of what it should look like i.e. another sample or function. Today, that’s what we’ll find out through the Kolmogorov-Smirnoff test (KS for short).

What is the Kolmogorov-Smirnoff Statistic?

KS measures the distance of the Cumulative Density Function (CDF) between two samples. CDF is how likely it is for us to get a particular value if we were to pick one randomly. What makes CDF special is it sums previous values as it moves along. As a result, it is less sensible to noisy data. What does this mean?

Let’s look at two toy Probability Density Functions (PDFs), which maps a given value to how often it appears in our data:

The differences I introduced between Sample 1 and Sample 2 were:

In Sample 2 PDF, there are some missing points (value 0), translating to the fact that there were no words with that length in that sample.
Non-zero values in Sample 1 PDF begin at word length 0, whereas Sample 2 PDF they start at word length 2.
Sample 1 PDF dives to 0 in word length 10, whereas Sample 2 PDF continues.

If we were to compare point by point the distance between each word length using the PDF, the relative distance at each point would be big, especially at word length 10 value in the previous graph. We would conclude that the two samples don’t share a similar probability distribution. However, in real-life applications, samples have bias manifested through:

Missing information in some variable values, like the zeroes in Sample 2 PDF
Shifted probability functions, or PDFs that share a similar distribution shape between them but that differ because of a slight offset, as seen when values in Sample 2 PDF pick up after Sample 1 PDF.

Performing any mechanical or statistical comparison to analyze the similarity between two samples as is is highly sensible to bias and noise. We need a strategy which gives confidence to a certain extent and relax our comparison. We need the Cumulative Density Function (CDF):

The relative difference between the CDFs appears significantly smaller than their source PDFs. And if we compare the Mean Squared Error (MSE), we find that after normalizing the PDF values (so that its highest value is the same in magnitude as the CDF), the CDF MSE of 0.04 beats the normalized PDF MSE of 0.2 (notebook). Great, now let’s obtain the lost pages’ EDFs:

PDF and EDF for each of the unknown fragments

Beautiful. Now what did we do (notebook)?

Converted the words into lengths ✅
Counted how many times a particular length appeared in our samples ✅
Convert this to a PDF by mapping a word length to a probability value based on the previous information ✅
Converted to an empirical CDF, or EDF for short, by mapping each word length to the sum of its current probability value in the PDF with the values before it in the EDF. ✅

Should we use MSE, we would draw the conclusion that all fragments (pages) come from the same book. Is this true?

Here comes the wizard with another tidbit!

Not quite. 🐱

We need a more granular comparison. After all, we don’t want to relax our comparison too much. While making our comparison resilient to bias, it would be ideal if a different statistic than MSE was used for this comparison. Luckily KS got us covered! Present in most Python libraries like Scipy, KS yields a D and a p-value when comparing two samples:

In other words, the largest difference at a given point between the compared samples.

The p-value is a little bit more nuanced (not within the scope of this article), which makes use of the D statistic previously. Fortunately for us, the p-value is calculated by our lovely libraries. Here are the results (notebook):

The p-value could be interpreted as the probability that the pages come from the same book. However, what does accepting this statistic really mean? It means simply that the words present in fragment 1 are present in the same manner as fragment 2 and 3. Is the frequency of short or long words a determinant factor to identify the pages?

The short answer is no. This is a classic example of a feature that isn’t able to capture the complete variability of the problem domain. Will the identity of these pages be always covered in doubt and uncertainty?

Find out in the next issue!

Please be kind to leave a clap, comment, or share with your stats-savvy friends! It helps improving this content and reaching more people! ✨

References

Empirical Distribution Function

An apparently more flexible interpretation of the idea of inequality measurement is the idea of an inequality ranking…

www.sciencedirect.com

Denis Moreira dos Reis, Peter Flach, Stan Matwin, and Gustavo Batista. 2016. Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1545–1554.

Karin Kandananond. 2017. The Application of Non-parametric Method and Time Series Analysis to Predict the Machine Repair Times. In Proceedings of the 3rd International Conference on Mechatronics and Robotics Engineering (ICMRE 2017). Association for Computing Machinery, New York, NY, USA, 130–133.

The Mysterious Case Of The Lost Pages: Goodness-Of-Fit

Empirical Distribution Function

An apparently more flexible interpretation of the idea of inequality measurement is the idea of an inequality ranking…

Written by Luis Zul

No responses yet