Do data mining algorithms need a ton of data (like millions of cases) to find hidden patterns? Absolutely not.
Does this means that you can use only a handful of data to mine? Probably not either.
In this article I am going to explain the first statement. In the upcoming one we will see why the difference between those two statements is so important.
Who had the best chance to survive?
Suppose that our goal is to find who and why the best chance to survive from Titanic catastrophe had. Let me use for this task the best known data mining algorithm — the decision tress one. I got Titanic passenger list (you can easily find it on the Internet), which looks like this:
SELECT * FROM [Titanic]; ------ ID Name Sex Class Age Boat Parents/Children Siblings/Spouses Deck Embarked Home/Destination Survived 1 Abbing, Mr. Anthony male Lower 42 NULL 0 0 Southampton NULL No 2 Abbott, Master. Eugene Joseph male Lower 13 NULL 2 0 Southampton East Providence, RI No 3 Abbott, Mr. Rossmore Edward male Lower 16 NULL 1 1 Southampton East Providence, RI No 4 Abbott, Mrs. Stanton (Rosa Hunt) female Lower 35 21 1 1 Southampton East Providence, RI Yes 5 Abelseth, Miss. Karen Marie female Lower 16 NULL 0 0 Southampton Norway Los Angeles, CA Yes 6 Abelseth, Mr. Olaus Jorgensen male Lower 25 21 0 0 F Southampton Perkins County, SD Yes
On this list is 1 309 passengers, so I have about 1.3 thousand cases — not a huge number at all. For simplicity, I am not going to prepare or enhance this data in any way.
The model looks like this — I used Age, Boat (boat number assigns to this person), Class, Deck, Parents/Children, Sex and Siblings/Spouses as input parameters, and Survived as input/output (aka predict) one:
After processing the model the dependency network was check — as expected, the stronger predictor was sex, then class, boat and finally deck:
Also have a look at the decision tree (the all four levels) — you can read from it that if you had been a man, you really should have had a boat assigned:
Let’s see what will happen when we duplicate the same data 50 times:
CREATE TABLE [dbo].[Titanic50Times]( [ID] [int] IDENTITY(1,1) NOT NULL, [Name] [nvarchar](255) NULL, [Sex] [nvarchar](255) NULL, [Class] [nvarchar](255) NULL, [Age] [float] NULL, [Boat] [int] NULL, [Parents/Children ] [float] NULL, [Siblings/Spouses ] [float] NULL, [Deck] [nvarchar](255) NULL, [Embarked] [nvarchar](255) NULL, [Home/Destination] [nvarchar](255) NULL, [Survived] [nvarchar](255) NULL, CONSTRAINT [PK_Titanic50Times] PRIMARY KEY CLUSTERED ([ID])); GO INSERT INTO [Titanic50Times] SELECT [Name] ,[Sex] ,[Class] ,[Age] ,[Boat] ,[Parents/Children ] ,[Siblings/Spouses ] ,[Deck] ,[Embarked] ,[Home/Destination] ,[Survived] FROM [dbo].[Titanic]; GO 50
At first look the dependency network looks a little bit different — this time all 7 attributes show up, but their relevance is more or less the same:
The decision tree also seems to be different (most importantly — a much bigger one). But if you look closely, you will find that similar rules were discovered, only this time they are more specific:
Now it’s time to tweak the first model a bit, by lowering two algorithm parameters: COMPLEXITY_PENALTY and MINIMUM_SUPPORT. The first one is used to inhibit the growth of the decision tree, the second one determines the minimum number of cases required to generate a split in the decision tree. In an essence, we just adjust the algorithm to much lower number of cases:
Now, both dependency networks look really similar — only deck and number of parents/children turn out to be irrelevant:
The decision trees also show the same story, i.e. if you had been a men, you better would have a boat assigned and have been no more than 24 years old. If not, being a boy under 8 had been your last chance to survive:
Lessons learned:
1. The raw amount of data is mostly irrelevant for data mining algorithms. Only the information hidden in this data matters.
2. Algorithms default configuration is suitable only for reasonably large datasets.
See you next week, when we are going to answer the initial question and find out how much data do you really need.