In the first part of this series, the amount of data is rather irrelevant for data mining algorithms, we saw that raw amount of data in not so important for data mining algorithms. Now we are going to find out what really matters. And in the upcoming article we will see how to apply this knowledge to real world problems.
Model parameters are also called variables, because each of them can take on a variety of values. Those values contain some sort of pattern, which means that they are distributed across the variable’s range in some specific way.
All you have to do to see it, is count the number of each value occurrences, like this (I am casting the values as integers to reduce the number of unique values):
Probably it would be easier to see any pattern if it was displayed graphically, for example as a histogram. But there is even a better way to show them — because histograms tend to be hard to analyze as the number of columns grows (remember, each column represents a count of distinct values), and because grouping data changes data distribution, a continuous line (a curve) is my favorite way of visualization this kind of data:
The actual shape of this curve is not important here. I mean it can be any shape, and each one will be equally valid. The important part is that this shape represents variability of the chosen variable.
Before we go on, we have to make one important adjustment, and normalize the number of each value occurrences, so it will stay the same where the number of instances (the number of rows) has changed. The simplest solution to this problem is to divide the counts of distinct values by the row numbers and multiple the result by 100 (the multiplication by 1. is done to implicitly cast integers to decimals):
As you can see, the normalized shape looks exactly as the previous (non-normalized) one:
Variability as a data quantity measure
This is a huge oversimplification, but data mining algorithms work by analyzing statistical relationship between variables. Hence, the values’ distribution of each parameter is the single, most important factor that determines the results. This has an important consequence — if this distribution is the same, the results of data mining also will be similar. And this is exactly what we saw in the previous article.
In other words, if the variability (and the curve which represents it) stays the same, the source data is large enough for mining. I know, this statement needs clarifying, but for now let me constrain to one obvious (but crucial) observation — the variability changes with the amount of the analyzed data (the number of rows).
To see this, let me get random samples of 10, 100, 500 and finally 1000 rows and compare their variability with the variability measured for the whole table (to follow my example you just have to change @sampleCount accordingly and execute this statement four times):
This comparison is easy to do with curves we already discussed. The following chart shows all five curves, and the additional one represents the variability of values measured for table with duplicate values:
There are some important observations to notice:
· First, the curves that represent whole populations (both the original and multiplied by 50) are identical, and you can only sometimes see two lines because the second one is plotted using different scale.
· Second, as the number of rows in a sample increases, the shape of a curve looks more and more like the shape of the curve that represents the whole population.
The first observation is important because it means that this way of measuring corresponds with the results from the previous part of the series. But the second one is even more important, as it will allow me to answer the initial question about how much data we need.
As you can see, the 10 rows sample is clearly too small, because it doesn’t represent the whole data correctly — not only there are a lot of missing points, but the overall shape of the curve is quite different. The 100 rows sample has fewer missing points, but the shape of the curve is still far from expected. The 500 rows sample looks better, but there are ranges where the shape doesn’t look good (i.e. for ages between 20 and 40). But the 1000 rows sample looks almost identical with the whole population ones.
Seems like by measuring variability we can check if there is enough data to mine.
Checking variability by drawing curves for all variables is not practical. Fortunately, statisticians have already solved this problem for us. They have found several measures for describing variables, among others mean and deviation. The first one simply points to some central value, the second one is a sort of average distant between values and the mean.
Note. Because some values are smaller than the mean, this distance would be negative, otherwise sum of all distances would not sum up to 0. To get rid of the minus sign we can square (multiplying a number by itself) these distances, then add them together, and finally divide by the number of distances. In order to make this measure more meaningful, we also should take a square root of it and come up with the standard deviation.
We will look at the standard deviation from a slightly different perspective than statisticians though. For us it will be the measure of the variability. It is simple to calculate and if calculated in the following way, it has all the necessary attributes:
First, we should check if the result is the same for multiplied values:
Then, we should also check if the results are getting closer as the number of rows in a sample is growing (to do this, execute the following statement several times, each time with the increased @sampleCount):
And yes, the results correspond with the curves we saw earlier.
Lesson learned — instead of counting rows we should check the parameters’ variability.
But two problems remain:
1. We don’t know what the variability (the standard deviation) should be.
2. Until now we have been working with only one variable. However, in real life there would be dozens or even hundreds of them.
Both those problems will be tackled in the last part of the series, so stay tuned.