How much data do you need? Part 1 — the amount of data is rather irrelevant for data mining algorithms

Do data mining algorithms need a ton of data (like millions of cases) to find hidden patterns? Absolutely not.

Does this means that you can use only a handful of data to mine? Probably not either.

In this article I am going to explain the first statement. In the upcoming one we will see why the difference between those two statements is so important.

Who had the best chance to survive?

Suppose that our goal is to find who and why the best chance to survive from Titanic catastrophe had. Let me use for this task the best known data mining algorithm — the decision tress one. I got Titanic passenger list (you can easily find it on the Internet), which looks like this:

SELECT *
FROM [Titanic];
------
ID    Name                            Sex    Class    Age    Boat    Parents/Children     Siblings/Spouses     Deck    Embarked    Home/Destination        Survived
1    Abbing, Mr. Anthony              male    Lower    42    NULL    0                     0                            Southampton    NULL                    No
2    Abbott, Master. Eugene Joseph    male    Lower    13    NULL    2                     0                            Southampton    East Providence, RI     No
3    Abbott, Mr. Rossmore Edward      male    Lower    16    NULL    1                     1                            Southampton    East Providence, RI     No
4    Abbott, Mrs. Stanton (Rosa Hunt) female  Lower    35    21      1                     1                            Southampton    East Providence, RI     Yes
5    Abelseth, Miss. Karen Marie      female  Lower    16    NULL    0                     0                            Southampton    Norway Los Angeles, CA  Yes
6    Abelseth, Mr. Olaus Jorgensen    male    Lower    25    21      0                     0                   F        Southampton    Perkins County, SD      Yes

 

On this list is 1 309 passengers, so I have about 1.3 thousand cases — not a huge number at all. For simplicity, I am not going to prepare or enhance this data in any way.

The model looks like this — I used Age, Boat (boat number assigns to this person), Class, Deck, Parents/Children, Sex and Siblings/Spouses as input parameters, and Survived as input/output (aka predict) one:

image

After processing the model the dependency network was check — as expected, the stronger predictor was sex, then class, boat and finally deck:

image

Also have a look at the decision tree (the all four levels) — you can read from it that if you had been a man, you really should have had a boat assigned:

image

Let’s see what will happen when we duplicate the same data 50 times:

CREATE TABLE [dbo].[Titanic50Times](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [Name] [nvarchar](255) NULL,
    [Sex] [nvarchar](255) NULL,
    [Class] [nvarchar](255) NULL,
    [Age] [float] NULL,
    [Boat] [int] NULL,
    [Parents/Children ] [float] NULL,
    [Siblings/Spouses ] [float] NULL,
    [Deck] [nvarchar](255) NULL,
    [Embarked] [nvarchar](255) NULL,
    [Home/Destination] [nvarchar](255) NULL,
    [Survived] [nvarchar](255) NULL,
 CONSTRAINT [PK_Titanic50Times] PRIMARY KEY CLUSTERED ([ID]));
 GO

 INSERT INTO [Titanic50Times]
SELECT [Name]
      ,[Sex]
      ,[Class]
      ,[Age]
      ,[Boat]
      ,[Parents/Children ]
      ,[Siblings/Spouses ]
      ,[Deck]
      ,[Embarked]
      ,[Home/Destination]
      ,[Survived]
 FROM [dbo].[Titanic];
 GO 50

 

At first look the dependency network looks a little bit different — this time all 7 attributes show up, but their relevance is more or less the same:

image

The decision tree also seems to be different (most importantly — a much bigger one). But if you look closely, you will find that similar rules were discovered, only this time they are more specific:

image

Now it’s time to tweak the first model a bit, by lowering two algorithm parameters: COMPLEXITY_PENALTY and MINIMUM_SUPPORT. The first one is used to inhibit the growth of the decision tree, the second one determines the minimum number of cases required to generate a split in the decision tree. In an essence, we just adjust the algorithm to much lower number of cases:

image

Now, both dependency networks look really similar — only deck and number of parents/children turn out to be irrelevant:

image

The decision trees also show the same story, i.e. if you had been a men, you better would have a boat assigned and have been no more than 24 years old. If not, being a boy under 8 had been your last chance to survive:

image

Lessons learned:

1. The raw amount of data is mostly irrelevant for data mining algorithms. Only the information hidden in this data matters.

2. Algorithms default configuration is suitable only for reasonably large datasets.

See you next week, when we are going to answer the initial question and find out how much data do you really need.

Share this article ...

Google Plus
Ihren LinkedIn Kontakten zeigen



This entry was posted in Analysis Services, Data mining and tagged , by Marcin Szeliga. Bookmark the permalink.
Marcin Szeliga

About Marcin Szeliga

Since 2006 invariably awarded Microsoft Most Valuable Professional title in the SQL category. A consultant, lecturer, authorized Microsoft trainer with 15 years’ experience, and a database systems architect. He prepared Microsoft partners for the upgrade to SQL Server 2008 and 2012 versions within the Train to Trainers program. A speaker at numerous conferences, including Microsoft Technology Summit, SQL Saturday, SQL Day, Microsoft Security Summit, Heroes Happen {Here}, as well as at user groups meetings. The author of many books and articles devoted to SQL Server.

Comments are closed.