Knowledge Discovery from Databases, Decision Trees

lecture 5

IIE Almere

 

Home Page

Lecture 5 – KDD, Decision Trees

Workshop

Homework assignment 5

Course Lecturers

Christian Gibson, Tiberiu Lupascu

E-mail addresses

c.h.s.gibson@hva.nl  t.d.lupascu@hva.nl

 

Lecture 5

 

Download lecture notes, decision tree documents and applets for creating datasets and building decision trees

 

Study the decision tree Tutorial

 

This lecture examined Datamining from an AI perspective. In particular we took a look at how to build a decision tree - an AI tool which learns from past experience. Essentially, a tree is a basic data structure. Previously we handled the tree data structures and traversing the tree in the IIE course modules Data Structures M2.06 (see also Deitel & Deitel, Java How to Program 6th edition, chapter 17) and KDD M4.03. In this lecture we showed how you can build a decision tree by choosing the attributes of a dataset as nodes for the tree.

 

To start of with you need to build up a data set with experience of the subject for which you need to make intelligent predictions. Perhaps you want to make sales forecast for a new car model or maybe you want to predict whether a share value is likely to increase or decrease during the coming year. To demonstrate the possibilities with a decision tree we use the test data set containing 20 basket ball match records from a basket ball team called the Mall rats. The test dataset and the applets which you can use to build the decision trees are provided by the University of Alberta – which explains the choice of basket ball matches as an example. 

 

The decision tree is probably the most frequently used tool for KDD software programs. The effectiveness of the tree in making accurate predictions is based on the choice of the ‘splitting algorithm’ and on the quality of the recorded data. We are assuming that past experience will help us to predict the future. In other words we hold that the way events transpire in the world is not random. The age of the players will probably be one factor which influences their performance. A team of 6-year old children would be unlikely to win a football match when pitted against the number 1 adult team of the national football league. I also wouldn’t bet on a team of ‘seniors’ of 50+ years of age playing against the 1st team of Manchester United or Barcelona.

 

When we build our dataset, we therefore want to record as much information as possible about attributes of the subject which we expect to be relevant in determining the performance. The next trick is to select these attributes to build a decision tree.

 

Take the test data set for the basket ball matches:

Where

When

Fred Starts

Joe offense

Joe defense

Opp C

OutCome

Home

7pm

Yes

Center

Forward

Tall

Won

Home

7pm

Yes

Forward

Center

Short

Won

Away

7pm

Yes

Forward

Forward

Tall

Won

Home

5pm

No

Forward

Center

Tall

Lost

Away

9pm

Yes

Forward

Forward

Short

Lost

Away

7pm

No

Center

Forward

Tall

Won

Home

7pm

No

Forward

Center

Tall

Lost

Home

7pm

Yes

Center

Center

Talls

Won

Away

7pm

Yes

Center

Center

Short

Won

Home

9pm

No

Forward

Center

Short

Lost

Away

7pm

No

Forward

Forward

Short

Lost

Away

5pm

No

Center

Forward

Tall

Won

Home

7pm

No

Center

Center

Tall

Lost

Home

9pm

No

Forward

Forward

Short

Lost

Away

9pm

Yes

Center

Forward

Short

Lost

Home

7pm

Yes

Center

Center

Short

Won

Home

7pm

Yes

Center

Forward

Short

Won

Home

5pm

No

Forward

Center

Short

Lost

Home

7pm

Yes

Center

Forward

Tall

Won

Away

5pm

No

Center

Center

Tall

Lost

The dataset consists of 20 records and 7 attributes per record. The last attribute of each record is the target attribute which we want to predict. Each record represents a ‘case in hand’ or an ‘instance’. In a new match the data in the first six attributes is known. What we don’t know is the result of the match which still has to be played.

In order to build a decision tree which can predict this result we have to choose an attribute as the root of the tree.

In this data set we see that all the matches which started at 7pm resulted in a win for the Mall Rats. Perhaps this is a good indicator for future results? We could choose this attribute as the root node of the tree and then continue, level by level to choose attributes to fit int our tree. Actually, a random selection of attributes is not a good idea. That is, the resulting tree is unlikely to have any real predictive value. How then should we select attributes to function as nodes in our decision tree?

To discover how to do this I recommend that you download the document I have made using material from the Alberta University website.

The document provides a guide to building a decision tree using various techniques for selecting attributes to function as a node in the tree. Most selection procedures are in one way or another connected with the information gain principle.

Information Gain

The rules are simple. A tree consists of:

  • a root node

  • branches

  • terminal nodes (leaves)

Attributes which have been selected from the dataset as a node in the tree may not be repeated in the same branch. If we chose the time of the match as the root node, then this attribute may not be used again in any branch of the tree since all branches originate from the root node. Also attributes should have a discrete value. If a non discrete value (e.g. temperature or price) is recorded in the dataset then we must classify the values into discrete classes in order to be able to use the data in a decision tree.

Entropy - the 2nd law of thermodynamics

During the lecture we briefly mention the concept of entropy - a mathematical concept which basically refers to the ammount of  'disorder' present in a universum.

The second law of thermodynamics holds the everything in the universe is heading towards a maximum state of 'disorder' (chaos). We can get a feeling of what this means by considering a bathtub full of cold water. If I run the hot water tap the hot water will slowly mix with the cold water until finally the average temperature of the water is the same. Initially however there will be a difference. In some parts of the bath, the water will be warmer than in other parts of the bath - that is, the water molecules show a certain amount of order. There will be concentrations of rapidly moving molecules (warm water) and slowly moving molecules (cold water) which are not perfectly mixed. In the end however the individual concentrations will disappear. The ordering of the molecules in separate concentrations will have completely disappeared.

 

The idea of information gain is that we should ideally chose nodes which produce the maximum reduction in entropy. The entropy measure of a leaf node is 0. That is, all the instances in the data set which reach this leaf node have the same properties. The principle of Occam's razor dictates that a smaller tree is better than a larger tree. By achieving a maximum reduction in entropy at each level in the tree, we will reach zero entropy (the leaf node) with a minimum number of levels. The entropy formula is:

entropy equation

and the information gain formula (entropy reduction between two levels) is:

information gain equation

The first term on the right hand side of the expression is the entropy of a node at level n and the second term is the average entropy of the nodes in the same branch at level n+1; the bigger the drop in entropy between the two levels, the greater the information gain.

 

I have included several documents in the files which can be downloaded for this lecture, which explain the relationship between entropy and information gain in more detail. It is not necessary to understand the mathematics in order to apply the gain algorithm into a procedure for building a tree, but for readers who are interested I recommend you try to follow the explanations given. It is less difficult than it looks. Actually with a reasonable command of school mathematics you should be able to follow it. The tutorial link at the start of this summary provides an excellent explanation and is well recommended by a number of sites.

5 minutes break

Workshop

After the lecture students started on the homework assignment. The task was to:

  • build a dataset on a given subject

  • build a decsion tree using this dataset

To carry out this assignment you will probably need to refer to the documents and applets which can be downloaded for this lecture. To run the applets you will have to have the Java runtime engine installed on your computer.

Assignment 5

 

 

Homework

The homework assignment associated with this lecture is an extension of the workshop exercises. In all there are 6 different versions:

 

 

For assistance in completing this assignment you can refer to tutorials and calculation applets provided by:

http://www.cs.ubc.ca/labs/lci/CIspace/Version4/dTree/index.html

 

and

 

http://www.cs.ualberta.ca/~aixplore/learning/DecisionTrees/index.html

 

 

1) Create a data set

Create a data set for predicting whether a team will win or lose a football match.

Ø      Each record (instance) in the data set should consist of 5 attributes (input parameters) and 1 label (output parameter).

Ø      The dataset should consist of 20 records.

 

2) Build a decision tree for predicting the outcome of new unlabelled instances, using 15 instances from your existing data set

Ø      Use the information gain algorithm you have used for node splitting

Ø      Specify the entropy reduction values generated by each node choice.

 

3) Test the decision tree with the remaining 5 instances

 

The alternative homework assignments were the same as shown here with the exception of the first part of step 1. The other versions were:

  • Create a data set for predicting if a person is likely to vote for the Dutch liberal party (VVD)

  • Create a data set for predicting if a person is likely to suffer from AIDS

  • Create a data set for prediction if a share is likely to increase in value by more than 20% in the coming year

  • Create a data set for predicting if a job applicant for a sales function is likely to be a good salesperson

  • Create a data set for predicting if a person is a good credit risk for a new mobile phone

The homework assignment should be submitted at the latest by the last day prior to lecture 6 (hybrid systems).

Return