How Much Data do You Need? Part 2 — Measuring Variability

In the first part of this series, the amount of data is rather irrelevant for data mining algorithms, we saw that raw amount of data in not so important for data mining algorithms. Now we are going to find out what really matters. And in the upcoming article we will see how to apply this knowledge to real world problems.

Variability

Model parameters are also called variables, because each of them can take on a variety of values. Those values contain some sort of pattern, which means that they are distributed across the variable’s range in some specific way.

All you have to do to see it, is count the number of each value occurrences, like this (I am casting the values as integers to reduce the number of unique values):

SELECT CAST([Age] AS Int) AS Age, COUNT(*) AS Freq
FROM [dbo].[Titanic]
GROUP BY CAST([Age] AS Int)
ORDER BY CAST([Age] AS Int);
GO
-----
Age    Freq
NULL    263
0    11
1    11
2    12
3    7
4    10
5    5
6    6
7    4
8    6
9    10
10    4
11    5
12    3
13    5
14    10
15    6
16    19
17    20
18    42
19    29
20    24
21    41
22    44
23    27
24    48
25    34
26    31
27    30
28    35
29    30
30    42
31    23
32    28
33    21
34    18
35    23
36    33
37    9
38    15
39    20
40    21
41    11
42    18
43    9
44    10
45    23
46    6
47    14
48    14
49    9
50    15
51    8
52    6
53    4
54    10
55    9
56    4
57    5
58    6
59    3
60    8
61    5
62    5
63    4
64    5
65    3
66    1
67    1
70    3
71    2
74    1
76    1
80    1

 

Probably it would be easier to see any pattern if it was displayed graphically, for example as a histogram. But there is even a better way to show them — because histograms tend to be hard to analyze as the number of columns grows (remember, each column represents a count of distinct values), and because grouping data changes data distribution, a continuous line (a curve) is my favorite way of visualization this kind of data:

image

The actual shape of this curve is not important here. I mean it can be any shape, and each one will be equally valid. The important part is that this shape represents variability of the chosen variable.

Before we go on, we have to make one important adjustment, and normalize the number of each value occurrences, so it will stay the same where the number of instances (the number of rows) has changed. The simplest solution to this problem is to divide the counts of distinct values by the row numbers and multiple the result by 100 (the multiplication by 1. is done to implicitly cast integers to decimals):

SELECT CAST([Age] AS Int) AS Age, COUNT(*) *1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq
FROM [dbo].[Titanic]
GROUP BY CAST([Age] AS Int)
ORDER BY CAST([Age] AS Int);
GO
-----
Age    Freq
NULL    20.09167303200
0    0.84033613400
1    0.84033613400
2    0.91673032800
3    0.53475935800
4    0.76394194000
5    0.38197097000
6    0.45836516400
7    0.30557677600
8    0.45836516400
9    0.76394194000
10    0.30557677600
11    0.38197097000
12    0.22918258200
13    0.38197097000
14    0.76394194000
15    0.45836516400
16    1.45148968600
17    1.52788388000
18    3.20855614900
19    2.21543162700
20    1.83346065600
21    3.13216195500
22    3.36134453700
23    2.06264323900
24    3.66692131300
25    2.59740259700
26    2.36822001500
27    2.29182582100
28    2.67379679100
29    2.29182582100
30    3.20855614900
31    1.75706646200
32    2.13903743300
33    1.60427807400
34    1.37509549200
35    1.75706646200
36    2.52100840300
37    0.68754774600
38    1.14591291000
39    1.52788388000
40    1.60427807400
41    0.84033613400
42    1.37509549200
43    0.68754774600
44    0.76394194000
45    1.75706646200
46    0.45836516400
47    1.06951871600
48    1.06951871600
49    0.68754774600
50    1.14591291000
51    0.61115355200
52    0.45836516400
53    0.30557677600
54    0.76394194000
55    0.68754774600
56    0.30557677600
57    0.38197097000
58    0.45836516400
59    0.22918258200
60    0.61115355200
61    0.38197097000
62    0.38197097000
63    0.30557677600
64    0.38197097000
65    0.22918258200
66    0.07639419400
67    0.07639419400
70    0.22918258200
71    0.15278838800
74    0.07639419400
76    0.07639419400
80    0.07639419400

 

As you can see, the normalized shape looks exactly as the previous (non-normalized) one:

image

Variability as a data quantity measure

This is a huge oversimplification, but data mining algorithms work by analyzing statistical relationship between variables. Hence, the values’ distribution of each parameter is the single, most important factor that determines the results. This has an important consequence — if this distribution is the same, the results of data mining also will be similar. And this is exactly what we saw in the previous article.

In other words, if the variability (and the curve which represents it) stays the same, the source data is large enough for mining. I know, this statement needs clarifying, but for now let me constrain to one obvious (but crucial) observation — the variability changes with the amount of the analyzed data (the number of rows).

To see this, let me get random samples of 10, 100, 500 and finally 1000 rows and compare their variability with the variability measured for the whole table (to follow my example you just have to change @sampleCount accordingly and execute this statement four times):

DECLARE @sampleCount int = 100;
WITH [Sample] AS
    (SELECT TOP (@sampleCount) [Age]
    FROM [dbo].[Titanic]
    ORDER BY NEWID()
    )
SELECT CAST([Age] AS Int) AS Age, COUNT(*)*1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq
    FROM [Sample]
    GROUP BY CAST([Age] AS Int);
GO
----
Age    Freq
NULL    1.06951871600
1    0.15278838800
9    0.07639419400
12    0.07639419400
14    0.07639419400
16    0.15278838800
17    0.07639419400
18    0.30557677600
19    0.22918258200
20    0.15278838800
21    0.15278838800
22    0.22918258200
23    0.30557677600
24    0.30557677600
25    0.22918258200
26    0.15278838800
27    0.22918258200
28    0.15278838800
29    0.07639419400
30    0.22918258200
31    0.22918258200
32    0.30557677600
33    0.30557677600
34    0.15278838800
35    0.45836516400
36    0.07639419400
38    0.30557677600
40    0.07639419400
41    0.15278838800
42    0.07639419400
43    0.07639419400
44    0.07639419400
45    0.15278838800
47    0.07639419400
48    0.07639419400
50    0.07639419400
51    0.07639419400
54    0.07639419400
57    0.07639419400
58    0.07639419400
59    0.15278838800
62    0.07639419400

 

This comparison is easy to do with curves we already discussed. The following chart shows all five curves, and the additional one represents the variability of values measured for table with duplicate values:

image

There are some important observations to notice:

· First, the curves that represent whole populations (both the original and multiplied by 50) are identical, and you can only sometimes see two lines because the second one is plotted using different scale.

· Second, as the number of rows in a sample increases, the shape of a curve looks more and more like the shape of the curve that represents the whole population.

The first observation is important because it means that this way of measuring corresponds with the results from the previous part of the series. But the second one is even more important, as it will allow me to answer the initial question about how much data we need.

As you can see, the 10 rows sample is clearly too small, because it doesn’t represent the whole data correctly — not only there are a lot of missing points, but the overall shape of the curve is quite different. The 100 rows sample has fewer missing points, but the shape of the curve is still far from expected. The 500 rows sample looks better, but there are ranges where the shape doesn’t look good (i.e. for ages between 20 and 40). But the 1000 rows sample looks almost identical with the whole population ones.

Seems like by measuring variability we can check if there is enough data to mine.

Deviation

Checking variability by drawing curves for all variables is not practical. Fortunately, statisticians have already solved this problem for us. They have found several measures for describing variables, among others mean and deviation. The first one simply points to some central value, the second one is a sort of average distant between values and the mean.

Note. Because some values are smaller than the mean, this distance would be negative, otherwise sum of all distances would not sum up to 0. To get rid of the minus sign we can square (multiplying a number by itself) these distances, then add them together, and finally divide by the number of distances. In order to make this measure more meaningful, we also should take a square root of it and come up with the standard deviation.

We will look at the standard deviation from a slightly different perspective than statisticians though. For us it will be the measure of the variability. It is simple to calculate and if calculated in the following way, it has all the necessary attributes:

WITH cte AS
    (SELECT CAST([Age] AS Int) AS Age, COUNT(*) *1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq
    FROM [dbo].[Titanic]
    GROUP BY CAST([Age] AS Int))
SELECT STDEV (Freq) AS StandardDeviation 
FROM CTE;
GO
----
StandardDeviation
2,39532195656265

 

First, we should check if the result is the same for multiplied values:

WITH cte AS
    (SELECT CAST([Age] AS Int) AS Age, COUNT(*) *1. /(SELECT COUNT(*) FROM [Titanic50Times])*100 AS Freq
    FROM [dbo].[Titanic50Times]
    GROUP BY CAST([Age] AS Int))
SELECT STDEV (Freq) AS StandardDeviation 
FROM CTE;
GO
----
StandardDeviation
2,39532195656265

 

Then, we should also check if the results are getting closer as the number of rows in a sample is growing (to do this, execute the following statement several times, each time with the increased @sampleCount):

DECLARE @sampleCount int = 10;
WITH [Sample] AS
    (SELECT TOP (@sampleCount) [Age]
    FROM [dbo].[Titanic]
    ORDER BY NEWID()
    ),
Freqency AS
(SELECT CAST([Age] AS Int) AS Age, COUNT(*)*1. /(SELECT COUNT(*) FROM [Titanic])*100 AS Freq
    FROM [Sample]
    GROUP BY CAST([Age] AS Int))
SELECT STDEV (Freq) AS StandardDeviation 
FROM Freqency;
GO
/*
Deviation 10    - 0,0540188526206807
Deviation 100    - 0,212706588322339
Deviation 500    - 0,889344911234561
Deviation 1000    - 1,79285475617734
Deviation All    - 2,39532195656265
Deviation *50    - 2,39532195656265
*/

 

And yes, the results correspond with the curves we saw earlier.

Lesson learned — instead of counting rows we should check the parameters’ variability.

But two problems remain:

1. We don’t know what the variability (the standard deviation) should be.

2. Until now we have been working with only one variable. However, in real life there would be dozens or even hundreds of them.

Both those problems will be tackled in the last part of the series, so stay tuned.

Share this article ...

Google Plus
Ihren LinkedIn Kontakten zeigen



This entry was posted in Analysis Services, Data mining and tagged , by Marcin Szeliga. Bookmark the permalink.
Marcin Szeliga

About Marcin Szeliga

Since 2006 invariably awarded Microsoft Most Valuable Professional title in the SQL category. A consultant, lecturer, authorized Microsoft trainer with 15 years’ experience, and a database systems architect. He prepared Microsoft partners for the upgrade to SQL Server 2008 and 2012 versions within the Train to Trainers program. A speaker at numerous conferences, including Microsoft Technology Summit, SQL Saturday, SQL Day, Microsoft Security Summit, Heroes Happen {Here}, as well as at user groups meetings. The author of many books and articles devoted to SQL Server.

Leave a Reply

Connect with:
  • This field its required.
  • This field its required.
    • Message is required