In the first part of this series, the amount of data is rather irrelevant for data mining algorithms, we saw that raw amount of data in not so important for data mining algorithms. Now we are going to find out what really matters. And in the upcoming article we will see how to apply this knowledge to real world problems.

## Variability

Model parameters are also called variables, because each of them can take on a variety of values. Those values contain some sort of pattern, which means that they are distributed across the variable’s range in some specific way.

All you have to do to see it, is count the number of each value occurrences, like this (I am casting the values as integers to reduce the number of unique values):

SELECT CAST([Age] AS Int) AS Age, COUNT(*) AS Freq FROM [dbo].[Titanic] GROUP BY CAST([Age] AS Int) ORDER BY CAST([Age] AS Int); GO ----- Age Freq NULL 263 0 11 1 11 2 12 3 7 4 10 5 5 6 6 7 4 8 6 9 10 10 4 11 5 12 3 13 5 14 10 15 6 16 19 17 20 18 42 19 29 20 24 21 41 22 44 23 27 24 48 25 34 26 31 27 30 28 35 29 30 30 42 31 23 32 28 33 21 34 18 35 23 36 33 37 9 38 15 39 20 40 21 41 11 42 18 43 9 44 10 45 23 46 6 47 14 48 14 49 9 50 15 51 8 52 6 53 4 54 10 55 9 56 4 57 5 58 6 59 3 60 8 61 5 62 5 63 4 64 5 65 3 66 1 67 1 70 3 71 2 74 1 76 1 80 1

Probably it would be easier to see any pattern if it was displayed graphically, for example as a histogram. But there is even a better way to show them — because histograms tend to be hard to analyze as the number of columns grows (remember, each column represents a count of distinct values), and because grouping data changes data distribution, a continuous line (a curve) is my favorite way of visualization this kind of data:

The actual shape of this curve is not important here. I mean it can be any shape, and each one will be equally valid. The important part is that this shape represents variability of the chosen variable.

Before we go on, we have to make one important adjustment, and normalize the number of each value occurrences, so it will stay the same where the number of instances (the number of rows) has changed. The simplest solution to this problem is to divide the counts of distinct values by the row numbers and multiple the result by 100 (the multiplication by 1. is done to implicitly cast integers to decimals):

SELECT CAST([Age] AS Int) AS Age, COUNT(*) *1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq FROM [dbo].[Titanic] GROUP BY CAST([Age] AS Int) ORDER BY CAST([Age] AS Int); GO ----- Age Freq NULL 20.09167303200 0 0.84033613400 1 0.84033613400 2 0.91673032800 3 0.53475935800 4 0.76394194000 5 0.38197097000 6 0.45836516400 7 0.30557677600 8 0.45836516400 9 0.76394194000 10 0.30557677600 11 0.38197097000 12 0.22918258200 13 0.38197097000 14 0.76394194000 15 0.45836516400 16 1.45148968600 17 1.52788388000 18 3.20855614900 19 2.21543162700 20 1.83346065600 21 3.13216195500 22 3.36134453700 23 2.06264323900 24 3.66692131300 25 2.59740259700 26 2.36822001500 27 2.29182582100 28 2.67379679100 29 2.29182582100 30 3.20855614900 31 1.75706646200 32 2.13903743300 33 1.60427807400 34 1.37509549200 35 1.75706646200 36 2.52100840300 37 0.68754774600 38 1.14591291000 39 1.52788388000 40 1.60427807400 41 0.84033613400 42 1.37509549200 43 0.68754774600 44 0.76394194000 45 1.75706646200 46 0.45836516400 47 1.06951871600 48 1.06951871600 49 0.68754774600 50 1.14591291000 51 0.61115355200 52 0.45836516400 53 0.30557677600 54 0.76394194000 55 0.68754774600 56 0.30557677600 57 0.38197097000 58 0.45836516400 59 0.22918258200 60 0.61115355200 61 0.38197097000 62 0.38197097000 63 0.30557677600 64 0.38197097000 65 0.22918258200 66 0.07639419400 67 0.07639419400 70 0.22918258200 71 0.15278838800 74 0.07639419400 76 0.07639419400 80 0.07639419400

As you can see, the normalized shape looks exactly as the previous (non-normalized) one:

## Variability as a data quantity measure

This is a huge oversimplification, but data mining algorithms work by analyzing statistical relationship between variables. Hence, the values’ distribution of each parameter is the single, most important factor that determines the results. This has an important consequence — if this distribution is the same, the results of data mining also will be similar. And this is exactly what we saw in the previous article.

In other words, if the variability (and the curve which represents it) stays the same, the source data is large enough for mining. I know, this statement needs clarifying, but for now let me constrain to one obvious (but crucial) observation — the variability changes with the amount of the analyzed data (the number of rows).

To see this, let me get random samples of 10, 100, 500 and finally 1000 rows and compare their variability with the variability measured for the whole table (to follow my example you just have to change @sampleCount accordingly and execute this statement four times):

DECLARE @sampleCount int = 100; WITH [Sample] AS (SELECT TOP (@sampleCount) [Age] FROM [dbo].[Titanic] ORDER BY NEWID() ) SELECT CAST([Age] AS Int) AS Age, COUNT(*)*1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq FROM [Sample] GROUP BY CAST([Age] AS Int); GO ---- Age Freq NULL 1.06951871600 1 0.15278838800 9 0.07639419400 12 0.07639419400 14 0.07639419400 16 0.15278838800 17 0.07639419400 18 0.30557677600 19 0.22918258200 20 0.15278838800 21 0.15278838800 22 0.22918258200 23 0.30557677600 24 0.30557677600 25 0.22918258200 26 0.15278838800 27 0.22918258200 28 0.15278838800 29 0.07639419400 30 0.22918258200 31 0.22918258200 32 0.30557677600 33 0.30557677600 34 0.15278838800 35 0.45836516400 36 0.07639419400 38 0.30557677600 40 0.07639419400 41 0.15278838800 42 0.07639419400 43 0.07639419400 44 0.07639419400 45 0.15278838800 47 0.07639419400 48 0.07639419400 50 0.07639419400 51 0.07639419400 54 0.07639419400 57 0.07639419400 58 0.07639419400 59 0.15278838800 62 0.07639419400

This comparison is easy to do with curves we already discussed. The following chart shows all five curves, and the additional one represents the variability of values measured for table with duplicate values:

There are some important observations to notice:

· First, the curves that represent whole populations (both the original and multiplied by 50) are identical, and you can only sometimes see two lines because the second one is plotted using different scale.

· Second, **as the number of rows in a sample increases, the shape of a curve looks more and more like the shape of the curve that represents the whole population**.

The first observation is important because it means that this way of measuring corresponds with the results from the previous part of the series. But the second one is even more important, as it will allow me to answer the initial question about how much data we need.

As you can see, the 10 rows sample is clearly too small, because it doesn’t represent the whole data correctly — not only there are a lot of missing points, but the overall shape of the curve is quite different. The 100 rows sample has fewer missing points, but the shape of the curve is still far from expected. The 500 rows sample looks better, but there are ranges where the shape doesn’t look good (i.e. for ages between 20 and 40). But the 1000 rows sample looks almost identical with the whole population ones.

Seems like by measuring variability we can check if there is enough data to mine.

## Deviation

Checking variability by drawing curves for all variables is not practical. Fortunately, statisticians have already solved this problem for us. They have found several measures for describing variables, among others mean and deviation. The first one simply points to some central value, the second one is a sort of average distant between values and the mean.

Note. Because some values are smaller than the mean, this distance would be negative, otherwise sum of all distances would not sum up to 0. To get rid of the minus sign we can square (multiplying a number by itself) these distances, then add them together, and finally divide by the number of distances. In order to make this measure more meaningful, we also should take a square root of it and come up with the standard deviation.

We will look at the standard deviation from a slightly different perspective than statisticians though. For us it will be the measure of the variability. It is simple to calculate and if calculated in the following way, it has all the necessary attributes:

WITH cte AS (SELECT CAST([Age] AS Int) AS Age, COUNT(*) *1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq FROM [dbo].[Titanic] GROUP BY CAST([Age] AS Int)) SELECT STDEV (Freq) AS StandardDeviation FROM CTE; GO ---- StandardDeviation 2,39532195656265

First, we should check if the result is the same for multiplied values:

WITH cte AS (SELECT CAST([Age] AS Int) AS Age, COUNT(*) *1. /(SELECT COUNT(*) FROM [Titanic50Times])*100 AS Freq FROM [dbo].[Titanic50Times] GROUP BY CAST([Age] AS Int)) SELECT STDEV (Freq) AS StandardDeviation FROM CTE; GO ---- StandardDeviation 2,39532195656265

Then, we should also check if the results are getting closer as the number of rows in a sample is growing (to do this, execute the following statement several times, each time with the increased @sampleCount):

DECLARE @sampleCount int = 10; WITH [Sample] AS (SELECT TOP (@sampleCount) [Age] FROM [dbo].[Titanic] ORDER BY NEWID() ), Freqency AS (SELECT CAST([Age] AS Int) AS Age, COUNT(*)*1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq FROM [Sample] GROUP BY CAST([Age] AS Int)) SELECT STDEV (Freq) AS StandardDeviation FROM Freqency; GO /* Deviation 10 - 0,0540188526206807 Deviation 100 - 0,212706588322339 Deviation 500 - 0,889344911234561 Deviation 1000 - 1,79285475617734 Deviation All - 2,39532195656265 Deviation *50 - 2,39532195656265 */

And yes, the results correspond with the curves we saw earlier.

Lesson learned — instead of counting rows we should check the parameters’ variability.

But two problems remain:

1. We don’t know what the variability (the standard deviation) should be.

2. Until now we have been working with only one variable. However, in real life there would be dozens or even hundreds of them.

Both those problems will be tackled in the last part of the series, so stay tuned.