This is the last part of the series in which we are going to talk about converging on a representative sample. If you missed the previous parts, before going further take your time and read them:
By now, we can measure parameter variability, but two problems remain: we still donâ€™t know what the variability should be and what to do to take relationships between parameters into account.
Converging on a representative sample
Until now we have been working with the whole set of data (with the population) â€” the Titanic passengers list is finite and small enough to be efficiently mined. But most of the time, the situation is quite different. In real projects the population is too big (sometimes even infinite) and/or constantly changing, to be measured (think of all people as potential customers or all orders made since the company was founded). To put it simply, in reality there is not all data available or there is too much of it. For these reasons most of the time we have to deal with sample data, the data that represents only some part of the population.
Note. Even if the whole population is available, we still need to divide it into at least two datasets â€” the training one and the test one. So, sampling is necessary. However, as long as the sample is representative there is no drawback if using it instead of the population.
The problem is how to assess the representatives of a sample when there is no population available? This problem can be solved with well known (at least to the statisticians) phenomenon called convergence. We had already seen this phenomenon in action in part two, when we drew curves that represented the value distributions in five datasets. If you remember, as the number of rows in a sample was growing, the shape of the curves was getting more and more alike.
To see what I mean please execute the following queries multiple times â€” you will find that the first result varies from 26 to 36 thousand while the second one stays most of the time around 31-33 thousand:
This settling down is the key. What it means is that when the sample is small, each new record can greatly change the value distribution, and in consequence, the standard deviation. But as the sample gets bigger, adding new records makes barely any difference to the value distribution. Take notice that the final value distribution (the final shape of curves from part two) is not important anymore, only the amount of changes is. In other words, we donâ€™t have to know all peopleâ€™s salary, all we need is a large enough sample. And now we can check if the sample is large enough by increasing the samplesâ€™ size and measuring their variability. When changes between standard deviation are small enough (or even nonexistent) the sample will be representative.
Note. How small this difference should be depends on the required confidence. To be 100 percent certain that all the variability of a variable was captured, we need to use a population. But this is impractical or even impossible. Therefore we have to determine with our customers some level of confidence. For example, 95 percent confidence means that we will be wrong 1 time in 20.
Now, it is time to apply our little trick from the previous part and replace the standard deviation of the values as such with the standard deviation of their relative frequency. This allows us to measure the standard deviation of categorical variables (variables like sex or color donâ€™t have a mean or standard deviation). Whatâ€™s more, this way we achieve one of our goals that by duplicating data we wonâ€™t change our measure.
To do this, let me create a temporary table and insert into it the sample size and standard deviation for samples of increasing sizes:
The collected data looks like this:
Thanks to the new analytical functions available in SQL Server 2012, checking how much (by how many percent) deviation has changed between samples is easy â€” all you have to do is use LAG() function to get the value from the previous row to be subtracted from the current row value, divide the result of subtraction by the current row value and multiple the result by 100:
As you can see, the difference drops from 85 percent between the third and the second sample to 2 percent between last ones. You can also see, that after the sample size exceeded 11 hundred, the differences never exceed 5 percent.
Unfortunately, this does not mean that for this sample size the confidence will be 95 percent. All we can say based on those results is the minimum sample size is 1.1 thousand cases, otherwise the income will not be representative. And even this number probably will be too low, as we will see next.
But before we go any further, let me show you that this technique can also be applied to categorical (discrete) variables:
Lesson learned â€” there is no easy way to say how much data you need for a particular model. However, you can check if the sample is representative by measuring, in the way showed, the differences in the variability. This test should be done for the most important, if nor for all, variables. At least you should check the representativeness of all predicted variables as well as all input variables strongly correlated with the predicted ones not to mention those that are important from business perspective.
Remember, this has nothing to do with the data mining algorithm as such (although some of them require more data than others). But as long as you plan to use the data mining model to solve some real world (i.e. business) problems, you have to train it using representative data.
What about correlations between variables?
So far, we have learned that all we can do is to check if there is enough data. But we only checked single variable representativeness. And what if we are unlucky and in our sample some correlations between variables (i.e. between people under 18 who have very high salary) are not properly represented? Well, in this case you can do exactly the same check, only this time you should group rows by multiple columns and check the group, not the whole variable, variability. The OVER clause can do this easily.
First, let me check the variability of many variables at once:
Based on this query you can easily add additional partitions. However, in my opinion the result we have achieved so far are good enough. Nevertheless, if the data mining results are worse than expected, especially if the problem arises only for some range of values, you really should check the representativeness of source data.
Hope this help