How Much Data do You Need? Part 3 —Data Samples and Convergence

This is the last part of the series in which we are going to talk about converging on a representative sample. If you missed the previous parts, before going further take your time and read them:

1. How much data do you need? Part 1 — The Amount of Data is Rather Irrelevant for Data Mining Algorithms.

2. How much data do you need? Part 2 — Measuring Variability.

By now, we can measure parameter variability, but two problems remain: we still don’t know what the variability should be and what to do to take relationships between parameters into account.

Converging on a representative sample

Until now we have been working with the whole set of data (with the population) — the Titanic passengers list is finite and small enough to be efficiently mined. But most of the time, the situation is quite different. In real projects the population is too big (sometimes even infinite) and/or constantly changing, to be measured (think of all people as potential customers or all orders made since the company was founded). To put it simply, in reality there is not all data available or there is too much of it. For these reasons most of the time we have to deal with sample data, the data that represents only some part of the population.

Note. Even if the whole population is available, we still need to divide it into at least two datasets — the training one and the test one. So, sampling is necessary. However, as long as the sample is representative there is no drawback if using it instead of the population.

The problem is how to assess the representatives of a sample when there is no population available? This problem can be solved with well known (at least to the statisticians) phenomenon called convergence. We had already seen this phenomenon in action in part two, when we drew curves that represented the value distributions in five datasets. If you remember, as the number of rows in a sample was growing, the shape of the curves was getting more and more alike.

To see what I mean please execute the following queries multiple times — you will find that the first result varies from 26 to 36 thousand while the second one stays most of the time around 31-33 thousand:

USE AdventureWorksDW2008R2;

WITH Samples AS
        (SELECT TOP (100) [YearlyIncome]
        FROM [dbo].[DimCustomer]
        ORDER BY NEWID()
SELECT STDEV([YearlyIncome]) AS StandardDeviation
FROM Samples;

WITH Samples AS
        (SELECT TOP (1000) [YearlyIncome]
        FROM [dbo].[DimCustomer]
        ORDER BY NEWID()
SELECT STDEV([YearlyIncome]) AS StandardDeviation
FROM Samples;



This settling down is the key. What it means is that when the sample is small, each new record can greatly change the value distribution, and in consequence, the standard deviation. But as the sample gets bigger, adding new records makes barely any difference to the value distribution. Take notice that the final value distribution (the final shape of curves from part two) is not important anymore, only the amount of changes is. In other words, we don’t have to know all people’s salary, all we need is a large enough sample. And now we can check if the sample is large enough by increasing the samples’ size and measuring their variability. When changes between standard deviation are small enough (or even nonexistent) the sample will be representative.

Note. How small this difference should be depends on the required confidence. To be 100 percent certain that all the variability of a variable was captured, we need to use a population. But this is impractical or even impossible. Therefore we have to determine with our customers some level of confidence. For example, 95 percent confidence means that we will be wrong 1 time in 20.

Now, it is time to apply our little trick from the previous part and replace the standard deviation of the values as such with the standard deviation of their relative frequency. This allows us to measure the standard deviation of categorical variables (variables like sex or color don’t have a mean or standard deviation). What’s more, this way we achieve one of our goals that by duplicating data we won’t change our measure.

To do this, let me create a temporary table and insert into it the sample size and standard deviation for samples of increasing sizes:

CREATE TABLE #YearlyIncomeVariability (
    [Sample] int,
    StandardDeviation decimal(10,4));

DECLARE @sampleCount int = 100;
WHILE @sampleCount<20000
    WITH Samples AS
            (SELECT TOP (@sampleCount) [YearlyIncome]
            FROM [dbo].[DimCustomer]
            ORDER BY NEWID()
        Frequency AS 
            (SELECT COUNT(*)*1. / (SELECT COUNT(*) FROM [dbo].[DimCustomer])*100 AS Freq
            FROM Samples
            GROUP BY [YearlyIncome])
    INSERT INTO #YearlyIncomeVariability ([Sample], StandardDeviation)
    SELECT @sampleCount, STDEV(Freq) AS StandardDeviation
    FROM Frequency;
    SET @sampleCount+=500;


The collected data looks like this:

FROM #YearlyIncomeVariability;
Sample    StandardDeviation
100    0.0269
600    0.1801
1100    0.3104
1600    0.4607
2100    0.6039
2600    0.7681
3100    0.9175
3600    1.0501
4100    1.1817
4600    1.3378
5100    1.4741
5600    1.6438
6100    1.7890
6600    1.9283
7100    2.0937
7600    2.2225
8100    2.3643
8600    2.5214
9100    2.6722
9600    2.8184
10100    2.9331
10600    3.1190
11100    3.2543
11600    3.4228
12100    3.5303
12600    3.7036
13100    3.8566
13600    3.9908
14100    4.1051
14600    4.3103
15100    4.4655
15600    4.5687
16100    4.7404
16600    4.9146
17100    5.0207
17600    5.1580
18100    5.3171
18600    5.4261
19100    5.4261
19600    5.4261


Thanks to the new analytical functions available in SQL Server 2012, checking how much (by how many percent) deviation has changed between samples is easy — all you have to do is use LAG() function to get the value from the previous row to be subtracted from the current row value, divide the result of subtraction by the current row value and multiple the result by 100:

SELECT [Sample], StandardDeviation, LAG (StandardDeviation) OVER (ORDER BY  [Sample]) AS PreviousStDev,
    ((StandardDeviation -
    LAG (StandardDeviation) OVER (ORDER BY  [Sample]))/StandardDeviation)*100 AS DiffInPct
FROM #YearlyIncomeVariability;
Sample    StandardDeviation    PreviousStDev    DiffInPct
100    0.0269    NULL    NULL
600    0.1801    0.0269    85.063853414769500
1100    0.3104    0.1801    41.978092783505100
1600    0.4607    0.3104    32.624267419144700
2100    0.6039    0.4607    23.712535187945000
2600    0.7681    0.6039    21.377424814477200
3100    0.9175    0.7681    16.283378746594000
3600    1.0501    0.9175    12.627368822016900
4100    1.1817    1.0501    11.136498265211100
4600    1.3378    1.1817    11.668410823740400
5100    1.4741    1.3378    9.246319788345400
5600    1.6438    1.4741    10.323640345540800
6100    1.7890    1.6438    8.116266070430400
6600    1.9283    1.7890    7.223979671212900
7100    2.0937    1.9283    7.899890146630300
7600    2.2225    2.0937    5.795275590551100
8100    2.3643    2.2225    5.997546842617200
8600    2.5214    2.3643    6.230665503291800
9100    2.6722    2.5214    5.643290172891200
9600    2.8184    2.6722    5.187340334941800
10100    2.9331    2.8184    3.910538338276900
10600    3.1190    2.9331    5.960243667842200
11100    3.2543    3.1190    4.157576130043300
11600    3.4228    3.2543    4.922870164777300
12100    3.5303    3.4228    3.045066991473800
12600    3.7036    3.5303    4.679231018468500
13100    3.8566    3.7036    3.967225016854200
13600    3.9908    3.8566    3.362734288864300
14100    4.1051    3.9908    2.784341428954200
14600    4.3103    4.1051    4.760689511170900
15100    4.4655    4.3103    3.475534654573900
15600    4.5687    4.4655    2.258848250049200
16100    4.7404    4.5687    3.622057210361900
16600    4.9146    4.7404    3.544540756114400
17100    5.0207    4.9146    2.113251140279200
17600    5.1580    5.0207    2.661884451337700
18100    5.3171    5.1580    2.992232608000600
18600    5.4261    5.3171    2.008809273695600
19100    5.4261    5.4261    0.000000000000000
19600    5.4261    5.4261    0.000000000000000


As you can see, the difference drops from 85 percent between the third and the second sample to 2 percent between last ones. You can also see, that after the sample size exceeded 11 hundred, the differences never exceed 5 percent.

Unfortunately, this does not mean that for this sample size the confidence will be 95 percent. All we can say based on those results is the minimum sample size is 1.1 thousand cases, otherwise the income will not be representative. And even this number probably will be too low, as we will see next.

But before we go any further, let me show you that this technique can also be applied to categorical (discrete) variables:

CREATE TABLE #EducationVariability (
    [Sample] int,
    StandardDeviation decimal(10,4));

DECLARE @sampleCount int = 100;
WHILE @sampleCount<20000
    WITH Samples AS
            (SELECT TOP (@sampleCount) [EnglishEducation]
            FROM [dbo].[DimCustomer]
            ORDER BY NEWID()
        Frequency AS 
            (SELECT COUNT(*)*1. / (SELECT COUNT(*) FROM [dbo].[DimCustomer])*100 AS Freq
            FROM Samples
            GROUP BY [EnglishEducation])
    INSERT INTO #EducationVariability ([Sample], StandardDeviation)
    SELECT @sampleCount, STDEV(Freq) AS StandardDeviation
    FROM Frequency;
    SET @sampleCount+=500;

SELECT [Sample], ((StandardDeviation -
    LAG (StandardDeviation) OVER (ORDER BY  [Sample]))/StandardDeviation)*100 AS DiffInPct
FROM #EducationVariability;
Sample    DiffInPct
100    NULL
600    84.316495937831100
1100    44.814814814814800
1600    25.867052023121300
2100    27.592340692685900
2600    17.803388664315800
3100    14.814272107846700
3600    20.213947506868500
4100    4.510186994138900
4600    11.648665976229200
5100    10.846816742877200
5600    10.772852098862200
6100    9.760328530463400
6600    2.107087159937600
7100    9.774553641224400
7600    7.960399458946000
8100    3.809207430169100
8600    6.976205191594500
9100    6.202898550724600
9600    3.293622985283800
10100    6.886201496432900
10600    6.904790831409700
11100    3.094401820928900
11600    3.034742570113000
12100    2.091987854175500
12600    7.145575313083700
13100    1.055982474456100
13600    5.855340541063100
14100    2.330668513156200
14600    3.525339325296800
15100    4.258968544682800
15600    2.786570201220800
16100    3.245741161029200
16600    2.263616441284900
17100    2.562366543409100
17600    3.449489616332200
18100    2.811274419968400
18600    1.927915837906400
19100    0.000000000000000
19600    0.000000000000000


Lesson learned — there is no easy way to say how much data you need for a particular model. However, you can check if the sample is representative by measuring, in the way showed, the differences in the variability. This test should be done for the most important, if nor for all, variables. At least you should check the representativeness of all predicted variables as well as all input variables strongly correlated with the predicted ones not to mention those that are important from business perspective.

Remember, this has nothing to do with the data mining algorithm as such (although some of them require more data than others). But as long as you plan to use the data mining model to solve some real world (i.e. business) problems, you have to train it using representative data.

What about correlations between variables?

So far, we have learned that all we can do is to check if there is enough data. But we only checked single variable representativeness. And what if we are unlucky and in our sample some correlations between variables (i.e. between people under 18 who have very high salary) are not properly represented? Well, in this case you can do exactly the same check, only this time you should group rows by multiple columns and check the group, not the whole variable, variability. The OVER clause can do this easily.

First, let me check the variability of many variables at once:

USE DataInDataMining

    (SELECT [Sex], [Boat], [Age], [Survived],
        COUNT(*) OVER (PARTITION BY [Sex]) *1. / (SELECT COUNT(*) FROM [Titanic])*100 AS SexFreq,
        COUNT(*) OVER (PARTITION BY [Boat]) *1./ (SELECT COUNT(*) FROM [Titanic])*100 AS BoatFreq,
        COUNT(*) OVER (PARTITION BY [Age]) *1./ (SELECT COUNT(*) FROM [Titanic])*100 AS AgeFreq,
        COUNT(*) OVER (PARTITION BY [Survived]) *1./ (SELECT COUNT(*) FROM [Titanic])*100 AS SurvivedFreq
    FROM [Titanic])
SELECT STDEV(SexFreq) AS SexDev,STDEV([BoatFreq]) AS BoatDev, STDEV([AgeFreq]) AS AgeDev, STDEV([SurvivedFreq]) AS SurvivedDev
FROM Freq;
SexDev            BoatDev            AgeDev                SurvivedDev
13,7954145037419    22,8129388069104    7,40418386753172    11,473724048845


Based on this query you can easily add additional partitions. However, in my opinion the result we have achieved so far are good enough. Nevertheless, if the data mining results are worse than expected, especially if the problem arises only for some range of values, you really should check the representativeness of source data.

Hope this help

Share this article ...

Google Plus
Ihren LinkedIn Kontakten zeigen

This entry was posted in Analysis Services, Data mining and tagged , , by Marcin Szeliga. Bookmark the permalink.
Marcin Szeliga

About Marcin Szeliga

Since 2006 invariably awarded Microsoft Most Valuable Professional title in the SQL category. A consultant, lecturer, authorized Microsoft trainer with 15 years’ experience, and a database systems architect. He prepared Microsoft partners for the upgrade to SQL Server 2008 and 2012 versions within the Train to Trainers program. A speaker at numerous conferences, including Microsoft Technology Summit, SQL Saturday, SQL Day, Microsoft Security Summit, Heroes Happen {Here}, as well as at user groups meetings. The author of many books and articles devoted to SQL Server.

Leave a Reply

Connect with:
  • This field its required.
  • This field its required.
    • Message is required