Decision trees and rule-based classifiers
152 ,
Decision tree: An example
I Idea: Ask a sequence of questions (as in the ‘20 questions’
game) to infer the class
includes ‘netflix prize’
includes ‘viagra’
includes ‘millions’
includes ‘meds’
no yes
not spam spam
yes spam
yes no
no
spam yes no
not spam
Decision tree: A second example
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">
!"#$$%&%'#(%)*+,-'.*%/0-$
O5),*?*3%"#1))"+$?)4"6)(-34?
O@/A)B+$?)4"6)(-34?
O6)031C"+$?)4"1)$?3%*%7
OD)/1$A"D)(E31F?
OD$GH)"I$C)? $%4"I$C)?*$%"I)A*)J"D)(E31F?
O'/KK31("L),(31"6$,-*%)?
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""M
12#34"-+)&+#+5-'%$%)*+,6--
!"# !"#$%& '()*+(,
-+(+$. /(0(1,"
2%345" 67"(+
: N)? '*%7A) :<>. 84
< D3 6$11*)4 :==. 84
O D3 '*%7A) P=. 84
8 N)? 6$11*)4 :<=. 84
> D3 5*H31,)4 Q>. 9".
M D3 6$11*)4 M=. 84
P N)? 5*H31,)4 <<=. 84
; D3 '*%7A) ;>. 9".
Q D3 6$11*)4 P>. 84
:= D3 '*%7A) Q=. 9".
:=
3(+":4)*3(, 3(+":4)*3(,
34%+*%$4$.
3,(..
!"#$%&
'()-+
/(02%3
8< 9;- 8<
8<
N)? D3
6$11*)4 '*%7A)&"5*H31,)4
R";=. S";=.
!"#$%%$&'()%%*$+,%-.
/)(*%*%:=>(+( '4&",?==>"3*.*4%=/)""
I There can be many different trees that all work equally well!
154 ,
Decision tree: Structure
I Structure of the tree:
I A singleroot nodewith no incoming edges, and zero or more outgoing edges
I Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges
I Leafor terminalnodes, each of which has exactly one incoming edge and no outgoing edges
I Node contents:
I Each terminal node is assigned a prediction (here, for simplicity: a definite class label).
I Each non-terminal node defines a test, with the outgoing edges representing the various possible results of the test (here, for simplicity: a test only involves a single attribute)
Decision tree: 2D example
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%$&'()&*'+,-.
?"@"=ABBC
"""""D"=
"""""D"B
"""""D"8
"""""D"=
?"@"=A8>C
""""D"8
""""D"=
"""""D"=
"""""D"8 E"@"=A8BC F)G
F)G
H3
H3 F)G H3
= =A: =A< =AB =A8 =AI =AJ =A> =A; =AK :
=
=A:
=A<
=AB
=A8
=AI
=AJ
=A>
=A;
=AK :
E
?
L!"#$%#&'()%&*%+,%%)&+,"&)%(-.*"#()-&#%-(")/&"0&$(00%#%)+&1'2//%/&(/&
3)",)&2/&$%1(/(")&*"4)$2#5
L6%1(/(")&*"4)$2#5&(/&72#2''%'&+"&28%/&*%124/%&+%/+&1")$(+(")&()9"'9%/&
2&/()-'%&2++#(*4+%&2+:2:+(;%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
/01$2*"(!"#$%$&'(3-""%
8&<&5&=&>
?'2//&@&< ?'2//&@&&&&&
LA%/+&1")$(+(")&;25&()9"'9%&;4'+(7'%&2++#(*4+%/
LB"#%&%87#%//(9%&#%7#%/%)+2+(")
LC()$()-&"7+(;2'&+%/+&1")$(+(")&(/&1";74+2+(")2''5&%87%)/(9%
I Notation: In this figurex andy are two continuous-valued attributes (i.e.y is notthe class label in this figure!)
I Decision boundary consists of parts which all are parallel to the axes because each decision depends only on a single attribute
156 ,
Learning a decision tree from data: General idea
I Simple idea: Recursively divide up the space into pieces which are as pureas possible
x1
x2
0 1 2 3 4 5
0 1 2 3
Learning a decision tree from data: General idea
I Simple idea: Recursively divide up the space into pieces which are as pureas possible
A
x1
x2
0 1 2 3 4 5
0 1 2 3
x1>2.8
157 ,
Learning a decision tree from data: General idea
I Simple idea: Recursively divide up the space into pieces which are as pureas possible
A
B
x1
x2
0 1 2 3 4 5
0 1 2 3
x1>2.8
x2>2.0
Learning a decision tree from data: General idea
I Simple idea: Recursively divide up the space into pieces which are as pureas possible
A
B B
x1
x2
0 1 2 3 4 5
0 1 2 3
x1>2.8
x2>2.0 x2>1.2
157 ,
Learning a decision tree from data: General idea
I Simple idea: Recursively divide up the space into pieces which are as pureas possible
A
B B
C x1
x2
0 1 2 3 4 5
0 1 2 3
x1>2.8
x2>2.0 x2>1.2
x1>4.3
Learning a decision tree from data: Hunt’s algorithm
I Notation: Let Dt denote the set of records corresponding to node t. For the root node,Dt is the set of all training data.
I Hunt’s algorithm:
1. If all the records inDt belong to the same classyt, thent is a leaf node labeled asyt
2. IfDt contains records that belong to more than one class, select an attribute test conditionthat partitions the records into smaller subsets. Create a child node for each outcome and distribute the records inDt to the children. Apply the
algorithm recursively to each child node.
Dt is an empty set⇒use majority vote among parent records All records in Dt are identical, but labels not⇒use majority vote This general method is also known asTDIDT(Top-Down Induction of Decision Trees).
158 ,
Attribute test conditions
I Binary attributes: yes / no (two children only)
I Nominal (categorical) attributes withL states:
I Multiway split (Lchildren)
I Binary split (2 children, any of the 2L−1−1 ways of splitting)
I Ordinal attributes with Lstates:
I Multiway or binary split
I Must respect the ordering (only combine contiguous values)
I Continuous attributes:
I Multiway or binary split
I Defined using breakpoints
Impurity measures
How do we measure whether a subset of the records is ‘pure’ or
‘impure’ ?
I Denote by p(i |t) the fraction of records belonging to class i of all the records at nodet.
I Impurity measures:
Entropy(t) = −
K−1
X
i=0
p(i|t) logp(i|t)
Gini(t) = 1−
K−1
X
i=0
p(i|t)2 Classification error(t) = 1−max
i p(i|t)
whereK is the total number of classes.
160 ,
Impurity measures: Binary classification
0.0 0.2 0.4 0.6 0.8 1.0
0.00.10.20.30.40.5
p0
entropy
Entropy/2 Gini
Misclassification error
I Qualitatively, the three measures agree. However, some differences in the selection of test attributes do occur.
Selecting the best split
I Test all valid splits
I Select the split which maximizes the gain∆:
∆ =I(parent)−
k
X
j=1
N(vj)
N I(vj), (18) where
I I(·) is the impurity of a given node
I N is the total number of records of the parent
I N(vj) is the number of records of childvj
I Note: Essentially just minimizing a weighted sum of the impurities of the children in the split
162 ,
I Computational issues:
I Binary attributes: Just one split to test per attribute
I Discrete attributes: Finite number of possible splits to test per attribute
I Continuous attributes: In principle, any valuex ∈R could be chosen as a breakpoint, but we only need to test midpoints between adjacent points. Computationally feasible by first sorting (complexityO(NlogN)) the records, then stepping through in order while updating the necessary statistics.
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">?
!"#$%#&"&'()$$*%+&$,'-(!"./&$%#0(1%#% 2#3,4555
O @31")AA*,*)%(",30B/($(*3%C"A31")$,-"$((1*+/()&
D '31("(-)"$((1*+/()"3%"E$F/)G
D H*%)$1FI"G,$%"(-)G)"E$F/)G&")$,-"(*0)"/B4$(*%7"(-)",3/%("0$(1*J"
$%4",30B/(*%7"7*%* *%4)J
D K-33G)"(-)"GBF*("B3G*(*3%"(-$("-$G"(-)"F)$G("7*%* *%4)J
!"#$% &' &' &' (#) (#) (#) &' &' &' &'
*$+$,-#./01'2#
34 54 56 76 84 86 944 9:4 9:6 ::4
66 36 5: 74 75 8: 85 994 9:: 95: :;4
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= L
(#) 4 ; 4 ; 4 ; 4 ; 9 : : 9 ; 4 ; 4 ; 4 ; 4 ; 4
&' 4 5 9 3 : 6 ; ? ; ? ; ? ; ? ? ; 6 : 3 9 5 4
@A0A 4B?:4 4B?44 4B;56 4B;?; 4B?95 4B?44 !"#!! 4B;?; 4B;56 4B?44 4B?:4 CD-A%.E')A%A'0)
C'F%#G.H$-I#)
)6$,*#7$%8,(9/6%$$%#0(!*%$,*%7(+7',3("#(2:;<
O M%(13BI"$("$"7*E)%"%34)"(C
NOP#MC"!"#$#%#&'#*G"(-)"1)F$(*E)"A1)Q/)%,I"3A",F$GG"R"$("%34)"(ST
D6)$G/1)G"-3037)%)*(I"3A"$"%34)T"
6$J*0/0"NF37"%,S"U-)%"1),314G"$1)")Q/$FFI"4*G(1*+/()4"
$03%7"$FF",F$GG)G"*0BFI*%7"F)$G("*%A310$(*3%
6*%*0/0"N=T=S"U-)%"$FF"1),314G"+)F3%7"(3"3%)",F$GG&"
*0BFI*%7"03G("*%A310$(*3%
DM%(13BI"+$G)4",30B/($(*3%G"$1)"G*0*F$1"(3"(-)"
V2O2"*%4)J",30B/($(*3%G
¦$
! $ & ! $ &
&
()&*+!, # ! # " ! $%& # " !
163 ,
Example: Web Robot Detection
164 ,
Example: Web Robot Detection
I Resulting decision tree:
(more details in the book)
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
166 ,
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
166 ,
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
166 ,
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
166 ,
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
Characteristics of Decision Tree Induction
I Nonparametric approach:
I Can in the limit approximateanydecision boundary to arbitrary precision
⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.
x1
x2
0 1 2 3 4 5
0 1 2 3
166 ,
Characteristics of Decision Tree Induction (cont.)
I Local, greedy learning to find a reasonable solution in reasonable time (as opposed to finding a globally optimal solution)
I Relativelyeasy to interpret (by experts or regular users of the system)
I cf. ”Why was I recommended this?” on amazon.com (helps build confidence in the system)
Characteristics of Decision Tree Induction (cont.)
I Data fragmentation problem:
I The nodes far down the tree are based on a very small fraction of the data, even only on a few data points⇒typically not very reliable information
I Example: Divide any of the existing leaves into purer partitions:
A
B B
C x1
x2
0 1 2 3 4 5
0 1 2 3
x1>2.8
x2>2.0 x2>1.2
x1>4.3
168 ,
Rule-based classifier
I e.g. Mac OS X ‘Mail’ application:
Rule-based classifier: Example
I Example:
r1: (‘webmail’)∩ (‘give password’)→spam
r2: (‘important’)∩ (sender = teacher)→ not spam r3: (‘viagra’)∩(‘extra strength’) → spam
r4: (‘millions’) ∩(‘netflix challenge’)→ not spam r5: (‘you have won’)→ spam
I Idea:
‘Many local classification models make up one global classifier’
I If we can find reliable rules of the form r1, . . . ,rn we can build a working classifier
170 ,
Rule-based classifier: Definition
I Rule set R={r1, . . . ,rn}
I Each rule ri is of the form
ri : (Conditioni)→yi, (19) where the left-hand-side is the rule antecedent(a logical expression that evaluates to true or false depending on the datapoint) and the right-hand-side is the rule consequence, i.e. a prediction (here, for simplicity, one of the classes).
I A ruleri is said tocover a record if its rule antecedent evaluates to true. Conversely, the record is said totriggerthe rule.
Properties of a single rule r
iI The coverageof a rule is the fraction of records (in the training data) which it covers
I The accuracyof a rule is the fraction of covered records for which the rule consequence equals the true class
For instance, if the rule
r5: (‘you have won’)→ spam
has a coverage of 0.1 and an accuracy of 0.85, it means that 10% of all emails received included the phrase ‘you have won’, and out of all those emails 85% were truly spam and 15%
were non-spam.
172 ,
Properties of the rule set R as a whole
I The rules in rule set R are mutually exclusiveif no two rules can be triggered by any single record.
I The rules in rule set R are exhaustive if at least one rule is triggered by any record.
(Note that we are here consideringany possible record, not just the records in the training dataset.)
I Together, the two properties imply that any record can be classified into a unique class. However, not all rule sets R have these properties.
Example
I The rules:
r1: (‘webmail’)∩ (‘give password’)→spam
r2: (‘important’)∩ (sender = teacher)→ not spam r3: (‘viagra’)∩(‘extra strength’) → spam
r4: (‘millions’) ∩(‘netflix challenge’)→ not spam r5: (‘you have won’)→ spam
are not mutually exclusive because the email: “you have won the netflix challenge, please come to the bank tomorrow to collect your millions” triggers both ruler4 and r5.
They are also not exhaustive because the email “hello from an old friend” is not covered by any rule.
174 ,
2D example
x1
x2
0 1 2 3 4 5
0 1 2 3
green red
red
r1 r2
r3
I Rule r1: (0.2≤x1 ≤2.2)∩(1.3≤x2 ≤4.2)→ red
has a coverage of 8/40 = 0.2 and an accuracy of 7/8 = 0.875.
I The rules in rule set R={r1,r2,r3} are mutually exclusive because the rectangles defining the rules are non-overlapping, but they are not exhaustive because there are areas of the space not covered by any of the rectangles
Default rule
I A simple way to handle a non-exhaustive rule set is to add a default rule which has an empty antecedent (always true):
rd: ()→yd
I Note that since this rule is always triggered the resulting rule set is necessarilynotmutually exclusive (as long as any other rule is sometimes triggered)
⇒ Some mechanism is needed to handle non-mutually exclusive cases (see next slide)
176 ,
Ordering or voting
I When the rule set is not mutually exclusive, conflicts need to be handled by
I Ruleordering: Instead of an unordered collection of rules, the order is taken to be important. Hence, a new record is classified by thefirstrule that it triggers.
I Voting: Each triggered rule votes for its consequence class.
(The votes can also be weighed by the rule’s accuracy.)
I Note that an ordered list of rules (a decision list) behaves quite a bit like a decision tree. In fact, it is easy to show that a decision tree can be written in the form of an ordered list of rules, and vice versa.
Sequential covering algorithm
I Learning a rule-based classifier can be performed by
identifying good rules and then removing the covered records from the training set:
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>
!"#$%&'()*(+',-'./0#&(1)2'30.4
?*@"A1*7*%$B"5$($ ?**@"'()C":
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":D
!"#$%&'()*(+',-'./0#&(1)2'30.45
?***@"'()C"<
E:
?*F@"'()C"G E:
E<
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>
!"#$%&'()*(+',-'./0#&(1)2'30.4
?*@"A1*7*%$B"5$($ ?**@"'()C":
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":D
!"#$%&'()*(+',-'./0#&(1)2'30.45
?***@"'()C"<
E:
?*F@"'()C"G E:
E<
178 ,
Selecting the next rule
I We want to select a rule that has bothhigh coverage and high accuracy. (Of course, typically this is a trade-off which has to be made.)
I Searching for such a rule:
I Exhaustive search of all possible rules infeasible for all but the very smallest of problems
I ‘General-to-specific’ greedy search: Start from an empty antecedent, add conditions one at a time to improve accuracy (but take care to still have decent coverage as well)
I ‘Specific-to-general’ greedy search: Start from a random record, then generalize by removing conditions one-by-one to obtain better coverage while not sacrificing too much in accuracy
Characteristics of rule-based classifiers
I Expressiveness similar to that of decision trees
I Rectilinear partitioning (though more complex decision boundaries with unordered rule sets and voting)
I Ordered rule sets can be written as decision trees and vice versa
I Similar also in terms of the underlying principles of learning the models
I Non-parametric, need for regularization
I Can produce descriptive models
I Easy to interpret (when the number of rules is not too large)
180 ,
Decision trees and rule-based classifiers: Summary
I Can be easy to understand/interpret by experts/users (for small models)
I Classification generally very fast (worst case is linear in depth of decision tree or rule list length)
I Non-parametric method (can approximate optimal classifier for any underlying distribution) in the large sample limit (but may require a verylarge number of rules)
I Need to regularize / find a good stopping criterion when learning to avoid overfitting