Data
38 ,
Data: Outline
I
Types of data
I data objects and attributes
I unordered vs ordered data
I
Quality of data
I measurement errors
I missing values, inconsistent values, duplicates
I
Preprocessing
I aggregation, sampling, feature extraction
I discretization and variable transformations
I
Similarity and dissimilarity
I desirable properties
I common measures
I
Summary statistics and visualization
39 ,
Standard data model
A data set can often be viewed as a
collection of data objects:I
A
data object(also called a record, data point, data vector, case, sample point, observation, entity) is described by a set of attributes
I
An
attribute(also called a variable, characteristic, field, feature, or dimension) describes one aspect of a data object Example:
Student ID Name Date of Birth Credits Average . . .
2328193 Matti 23.09.1985 123 3.6 . . .
9819234 Tuuli 12.03.1986 98 3.8 . . .
. . . . . . . . . . . . . . . . . .
Here: each row is one data object, each column is an attribute.
40 ,
Attributes: Number of values
I
How many distinct values can an attribute take
1. Binary attributes: only 2 possible states(e.g. on/off, pass/fail)
2. Discrete attributes withN<∞states (e.g. course grade: 0, 1, 2, 3, 4, or 5)
3. Discrete attributes with (countably) infinite states (e.g. counts: 0,1,2,3, . . . )
4. Continuous attributes (uncountably infinite) (e.g. length or weight: ∈R)
I
Asymmetric attributes
1. Binary sparse attributes (‘on’ infrequent and significant, ‘off’
common and insignificant)
2. Other attributes may also have ‘special’ states, need to take into account
41 ,
Attributes: Measurement scales
I
What operations are valid?
1. Distinctness: = vs6= 2. Order: <,≤,>, ≥
3. Addition and subtraction: +,− 4. Multiplication and division: ∗and/
I
These define four types of attribute measurement scales:
1. Nominal (categorical) 2. Ordinal
3. Interval 4. Ratio
Each measurement scale allows all the operations with number
smaller than or equal tothe number of the measurement scale. Example: ‘Interval’ attributes allow all operations except multiplication and division.
42 ,
Attributes: Measurement scales – examples
!""#$%&"'(
)*+'
,'-.#$+"$/0 1234+5'- 6+'#3"$/0-
!"#$%&' ()*+,&'-*.+"/+&+%"#$%&'+&001$2-0*+&1*+
3-.0+4$//*1*%0+%&#*.5+$6*65+%"#$%&'+
&001$2-0*.+71",$4*+"%'8+*%"-9)+
$%/"1#&0$"%+0"+4$.0$%9-$.)+"%*+"23*:0+
/1"#+&%"0)*16+;<5+z=
>$7+:"4*.5+*#7'"8**+
?@+%-#2*1.5+*8*+:"'"15+
.*AB+C!"#$%&'$!"#$D
#"4*5+*%01"785+
:"%0$%9*%:8+
:"11*'&0$"%5+FE0*.0
F14$%&' ()*+,&'-*.+"/+&%+"14$%&'+&001$2-0*+
71",$4*+*%"-9)+$%/"1#&0$"%+0"+"14*1+
"23*:0.6+;G5+H=
)&14%*..+"/+#$%*1&'.5+
C())*%&+$,,$-%&+$.,D5+
91&4*.5+.01**0+%-#2*1.
#*4$&%5+7*1:*%0$'*.5+
1&%I+:"11*'&0$"%5+
1-%+0*.0.5+.$9%+0*.0.
?%0*1,&' J"1+$%0*1,&'+&001$2-0*.5+0)*+
4$//*1*%:*.+2*0K**%+,&'-*.+&1*+
#*&%$%9/-'5+$6*65+&+-%$0+"/+
#*&.-1*#*%0+*A$.0.6++
;L5+M =
:&'*%4&1+4&0*.5+
0*#7*1&0-1*+$%+N*'.$-.+
"1+J&)1*%)*$0
#*&%5+.0&%4&14+
4*,$&0$"%5+O*&1."%P.+
:"11*'&0$"%5+,&%4+/
0*.0.
Q&0$" J"1+1&0$"+,&1$&2'*.5+2"0)+4$//*1*%:*.+
&%4+1&0$".+&1*+#*&%$%9/-'6+;R5+S= 0*#7*1&0-1*+$%+T*',$%5+
#"%*0&18+U-&%0$0$*.5+
:"-%0.5+&9*5+#&..5+
'*%90)5+*'*:01$:&'+
:-11*%0
9*"#*01$:+#*&%5+
)&1#"%$:+#*&%5+
7*1:*%0+,&1$&0$"%
!""#$%&"'(
7'8'5
)#30-9/#43"$/0 :/44'0"-
!"#$%&' V%8+7*1#-0&0$"%+"/+,&'-*. ?/+&''+*#7'"8**+?@+%-#2*1.+
K*1*+1*&..$9%*45+K"-'4+$0+
#&I*+&%8+4$//*1*%:*W
F14$%&' V%+"14*1+71*.*1,$%9+:)&%9*+"/+
,&'-*.5+$6*65+
0$123"#4$&5&'6)#*23"#4$7&
K)*1*+'$.+&+#"%"0"%$:+/-%:0$"%6
V%+&001$2-0*+*%:"#7&..$%9+
0)*+%"0$"%+"/+9""45+2*00*1+
2*.0+:&%+2*+1*71*.*%0*4+
*U-&''8+K*''+28+0)*+,&'-*.+
CX5+E5+YD+"1+28+C+Z6[5+X5+
XZD6
?%0*1,&' 0$123"#4$&5"&8&)#*23"#4$&9&+&
K)*1*+&+&%4+2+&1*+:"%.0&%0.
()-.5+0)*+J&)1*%)*$0+&%4+
N*'.$-.+0*#7*1&0-1*+.:&'*.+
4$//*1+$%+0*1#.+"/+K)*1*+
0)*$1+>*1"+,&'-*+$.+&%4+0)*+
.$>*+"/+&+-%$0+;4*91**=6 Q&0$" 0$123"#4$&5&"&8&)#*23"#4$ \*%90)+:&%+2*+#*&.-1*4+$%+
#*0*1.+"1+/**06
43 ,
General characteristics of data
I
Number of data points vs dimensionality (number of attributes)
1. Most traditional data analysis methods assume (many) more data points than dimensions
2. Many interesting datasets have extremely high dimensions and few data objects
I
Sparsity (relatively few non-zero values)
1. Efficient storage and computation, for some methods 2. May be important for modeling
I
Resolution (spatial, temporal, . . . )
1. How small details can the sensors reliably detect 2. How large datasets can the methods handle
44 ,
Examples – Record data
I
Data ‘matrix’
I
Sparse matrix representation?
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""::
!"#$%&'(&)*+'%',&-%./&.,/)$0)1&%2,&2%-3)4'&'
> !"#$%&"'%()"*+
,-.&$/'0/!"#$%&"'%()"*+
> 12(.&"*+
3%)+/2.$&$%4$/4'-%*&
> 5$&')-*"'%
6(**$.%&/7$2$%7/'%/*8$/&4()$/
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":<
5-,$%3)4'&')
O
5$($"(-$(",3%?*?(?"3@"$",3AA),(*3%"3@"1),314?&")$,-"
3@"B-*,-",3%?*?(?"3@"$"@*C)4"?)("3@"$((1*+/()?"
!"#$ 5$0-%7/ 9(."*()/
1*(*-&/ :(;(<)$/
=%4'#$/ ,8$(*
:" D)?" '*%7A)" :<E." >'/
<" F3" 6$11*)4" :==." >'/
G" F3" '*%7A)" H=." >'/
8" D)?" 6$11*)4" :<=." >'/
E" F3" 5*I31,)4" JE." ?$&/
K" F3" 6$11*)4" K=." >'/
H" D)?" 5*I31,)4" <<=." >'/
;" F3" '*%7A)" ;E." ?$&/
J" F3" 6$11*)4" HE." >'/
:=" F3" '*%7A)" J=." ?$&/
:="
!
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>
!"#"$%"#&'($
O2?"4$($"3+@),(A"-$B)"(-)"A$0)"?*C)4"A)("3?"%/0)1*,"
$((1*+/()A&"(-)%"(-)"4$($"3+@),(A",$%"+)"(-3/7-("3?"$A"
D3*%(A"*%"$"0/E(*F4*0)%A*3%$E"AD$,)&"G-)1)")$,-"
4*0)%A*3%"1)D1)A)%(A"$"4*A(*%,("$((1*+/()"
O'/,-"4$($"A)(",$%"+)"1)D1)A)%()4"+H"$%"0"+H"%"0$(1*C&"
G-)1)"(-)1)"$1)"0"13GA&"3%)"?31")$,-"3+@),(&"$%4"%"
,3E/0%A&"3%)"?31")$,-"$((1*+/()
:I:
<I<
:JI<<
JI<K :<IJK
:I<
<IL :KI<<
KI<L :=I<>
!"#$%&'(()
*+,- .#(/,&$' 01+2'$/#+&)
+3)4)5+,- 01+2'$/#+&)
+3)6)*+,-
:I:
<I<
:JI<<
JI<K :<IJK
:I<
<IL :KI<<
KI<L :=I<>
!"#$%&'(()
*+,- .#(/,&$' 01+2'$/#+&)
+3)4)5+,- 01+2'$/#+&)
+3)6)*+,-
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":8
!)*+,-.#$!"#"
OM$,-"43,/0)%("+),30)A"$"N()10O"B),(31&"
P )$,-"()10"*A"$",30D3%)%("Q$((1*+/()R"3?"(-)"B),(31&
P (-)"B$E/)"3?")$,-",30D3%)%("*A"(-)"%/0+)1"3?"(*0)A"
(-)",311)AD3%4*%7"()10"3,,/1A"*%"(-)"43,/0)%(I"
53,/0)%(":
A)$A3%
(*0)3/(
E3A(
G*%
7$0)
A,31)
+$EE
DE$H
,3$,-
()$0
53,/0)%("<
53,/0)%(">
> = K = < J = < = <
=
=
L = < : = = > = =
: = = : < < = > =
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>
!"#$%#&'()$*+#'#
O?"@A),*$B"(CA)"3D"1),314"4$($&"E-)1)"
F )$,-"1),314"G(1$%@$,(*3%H"*%I3BI)@"$"@)("3D"*()0@J""
F K31")L$0AB)&",3%@*4)1"$"713,)1C"@(31)J""#-)"@)("3D"
A134/,(@"A/1,-$@)4"+C"$",/@(30)1"4/1*%7"3%)"
@-3AA*%7"(1*A",3%@(*(/()"$"(1$%@$,(*3%&"E-*B)"(-)"
*%4*I*4/$B"A134/,(@"(-$("E)1)"A/1,-$@)4"$1)"(-)"*()0@J"
!"#$ "%&'($
!" #$%&'(")*+%(",-.+!
/" #%%$("#$%&'!
0" #%%$(")*+%("1-&2%$(",-.+!
3" #%%$("#$%&'("1-&2%$(",-.+!
4" )*+%("1-&2%$(",-.+!
!
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":M
,"#-.*+#'#*
ONL$0AB)@O"P)%)1*,"71$A-"$%4"Q#6R"R*%S@"
>
<
:
"<
>
T$"-1)DUVA$A)1@9A$A)1@J-(0BW++++VX 5$($"6*%*%7"T9$X TB*X
T$"-1)DUVA$A)1@9A$A)1@J-(0BW$$$$VX P1$A-"Y$1(*(*3%*%7"T9$X TB*X
T$"-1)DUVA$A)1@9A$A)1@J-(0BW$$$$VX
Y$1$BB)B"'3B/(*3%"3D"'A$1@)"R*%)$1"'C@()0"3D"NZ/$(*3%@"T9$X TB*X
T$"-1)DUVA$A)1@9A$A)1@J-(0BWDDDDVX
[\]34C"^30A/($(*3%"$%4"5)%@)"R*%)$1"'C@()0"'3BI)1@
45 ,
Examples – Graph data
I
Relationships between objects
I World wide web
I Facebook users
I Scientific papers
I
Objects are graphs
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>
!"#$%&'()*'+')
O?)%@)%)"63A),/A)B"CDED
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":;
,-.#-#.)*'+')
O')F/)%,)G"3H"(1$%G$,(*3%G
!"#$%$&$"'#()#
'*$#+$,-$".$
/'$&+012$"'+
46 ,
Examples – Ordered data
I
Data objects with natural ordering
I Sequential data (e.g. logins, credit card purchases, . . . )
I Sequences (e.g. genome)
I Time-series (e.g. climate science, financial data, . . . )
I Spatial data (e.g. journey planner data, surface/sea temperatures, multi-spectral images)
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>
!"#$"$#%&'('%
O ?)%30*,"@)A/)%,)"4$($
!!""##!##""#$!####!#!##
#!#$!!!###!####!#!##!"#
!$!$$!!!###!##"!!#!!!#!
!!!!!$!!#!!!!##!###!$!#
##$$##!$!"##!$##$!!"!##
###"#"!#"#!!##"$!$##"!$
!#"#$""$!!#!!#$!#!!$#$!
!##$$!"$!$$#$#!#!$$!#!#
"!!!#"!##"!#"!#!$##$!!!
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""<=
!"#$"$#%&'('
O'B$(*3C#)0B31$D"5$($
!"#$%&#'()*+,-.' /#01#$%+2$#')3' -%*4'%*4')5#%*
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>
!"#$"$#%&'('%
O?)%30*,"@)A/)%,)"4$($
!!""##!##""#$!####!#!##
#!#$!!!###!####!#!##!"#
!$!$$!!!###!##"!!#!!!#!
!!!!!$!!#!!!!##!###!$!#
##$$##!$!"##!$##$!!"!##
###"#"!#"#!!##"$!$##"!$
!#"#$""$!!#!!#$!#!!$#$!
!##$$!"$!$$#$#!#!$$!#!#
"!!!#"!##"!#"!#!$##$!!!
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""<=
!"#$"$#%&'(' O'B$(*3C#)0B31$D"5$($
!"#$%&#'()*+,-.' /#01#$%+2$#')3' -%*4'%*4')5#%*
47 ,
Data quality
I
Measurement error (‘noise’)
I Continuous data: Gaussian, Student’s t distribution, . . .
I Binary data: Bernoulli, . . .
I
Precision and bias
I Precision: closeness of repeated measurements to each other
I Bias: systematic deviation from the true underlying value
Left: high precision, high bias. Right: low precision, low bias.
48 ,
I
Outliers i.e. ‘anomalous’ objects:
(Objects that are very different from the others)
I Noise?
I Data collection error?
I Legitimate, interesting objects?
I
Missing values:
(Some attribute values are missing for some data objects)
I Missing at random? Need to model the process?
I Just eliminating such data objects or attributes?
I Estimating and imputing missing values?
I Ignoring or explicitly taking them into account?
I
Duplicate data, inconsistent data
I
Timeliness, relevance, application-specific prior knowledge
49 ,
Data preprocessing
I
Aggregation
⇒ fewer, less noisy, data points
I Example: monthly analysis of daily (or hourly) data
I Loss of resolution
I
Sampling (with/without replacement, stratified or not)
⇒ fewer data points
I Visualization, applying computationally demanding data analysis procedures
I Loss of precision of data statistics
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"
C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?
031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)" C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/? 031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)" C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/? 031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
50 ,
Data preprocessing
I
Aggregation
⇒ fewer, less noisy, data points
I Example: monthly analysis of daily (or hourly) data
I Loss of resolution
I
Sampling (with/without replacement, stratified or not)
⇒ fewer data points
I Visualization, applying computationally demanding data analysis procedures
I Loss of precision of data statistics
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"
C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?
031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"
C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?
031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)" C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/? 031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
50 ,
Data preprocessing
I
Aggregation
⇒ fewer, less noisy, data points
I Example: monthly analysis of daily (or hourly) data
I Loss of resolution
I
Sampling (with/without replacement, stratified or not)
⇒ fewer data points
I Visualization, applying computationally demanding data analysis procedures
I Loss of precision of data statistics
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"
C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?
031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"
C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?
031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:
!"#$%&'(&)*+#,-./
O
'*0?@)"A$%430"'$0?@*%7
B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0
O
'$0?@*%7"G*(-3/("1)?@$,)0)%(
B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%
O
'$0?@*%7"G*(-"1)?@$,)0)%(
B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"
C)@),()4"F31"(-)"C$0?@)L"""
2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?
031)"(-$%"3%,)
O
'(1$(*F*)4"C$0?@*%7
B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><
)*+#,$&)-0$
!"""#$%&'() *"""#+%&'() ,""#+%&'()
50 ,
The ‘curse of dimensionality’: As the dimensionality grows, the data becomes increasingly sparse in the space (e.g.
ndimensions of binary variables has 2
njoint states)
I
Feature subset selection
(selecting only some of the attributes, manually or automatically)
I
Feature extraction
(computing new ‘attributes’ to replace existing ones, e.g.
image color/structure features)
I
Dimensionality reduction
(principal component analysis, other unsupervised methods)
51 ,
I
Discretization of continuous attributes (required for some machine learning algorithms)
I e.g. Ris divided into (−∞,x1],(x1,x2],(x2,x3],(x3,∞)
I Unsupervised or supervised
I Ideally: use application-specific knowledge
0 1 2 3 4 5 6
52 ,
I
Binarization
I Some algorithmsrequirebinary attributes
I HowNOTto represent nominal variables:
Categorical value Integer value x1 x2 x3
Toyota 0 0 0 0
Volkswagen 1 0 0 1
Chevrolet 2 0 1 0
Saab 3 0 1 1
Volvo 4 1 0 0
I How better to represent nominal variables:
Categorical value Integer value x1 x2 x3 x4 x5
Toyota 0 1 0 0 0 0
Volkswagen 1 0 1 0 0 0
Chevrolet 2 0 0 1 0 0
Saab 3 0 0 0 1 0
Volvo 4 0 0 0 0 1
I Useful even when binary variables are not strictly required!
53 ,
I
Binarization
I Some algorithmsrequirebinary attributes
I HowNOTto represent nominal variables:
Categorical value Integer value x1 x2 x3
Toyota 0 0 0 0
Volkswagen 1 0 0 1
Chevrolet 2 0 1 0
Saab 3 0 1 1
Volvo 4 1 0 0
I How better to represent nominal variables:
Categorical value Integer value x1 x2 x3 x4 x5
Toyota 0 1 0 0 0 0
Volkswagen 1 0 1 0 0 0
Chevrolet 2 0 0 1 0 0
Saab 3 0 0 0 1 0
Volvo 4 0 0 0 0 1
I Useful even when binary variables are not strictly required!
53 ,
I
Variable transformations
I Simple functions (e.g. log(x) often used for strictly positive variables)
I Ideally: use application-specific knowledge
I Normalization / standardization:
x0= x−¯x
sx , (1)
where ¯x is the mean of the attribute values andsx is the standard deviation. This yieldsx0 with zero mean and a standard deviation of 1.
Useful to ensure that the scale (units) of attribute values do not affect the results.
54 ,
Similarity and dissimilarity
I
Many machine learning algorithms use measures of similarity or dissimilarity between data objects
I
Examples:
I Handwritten letters
I Segments of DNA
I Text documents
“Parliament overwhelmingly approved amendments to the Firearms Act on Wednesday. The new law requires physicians to inform authorities of individuals they consider unfit to own guns. It also increases the age for handgun ownership from 18 to 20.”
“Parliament's Committee for Constitutional Law says that persons applying for handgun licences should not be required to join a gun club in the future. Government is proposing that handgun users be part of a gun association for at least two years.”
“The cabinet on Wednesday will be looking at a controversial package of proposed changes in the curriculum in the nation's comprehensive schools.
The most divisive issue is expected to be the question of expanded language teaching.”
ACCTGTCG ATCCTGTG
TCGATTGC
55 ,
I
Similarity:
sI Numerical measure of the degree to which two objects arealike
I Higher for objects that are alike
I Typically between 0 (no similarity) and 1 (completely similar)
I
Dissimilarity:
dI Numerical measure of the degree to which two objects are different
I Higher for objects that are different
I Typically between 0 (no difference) and∞(completely different)
I
Transformations
I Converting from one to the other
I
Use similarity or dissimilarity measures?
I Method-specific
56 ,
I
Proximity between
attribute values!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""8>
!"#"$%&"'()%*+),"--"#"$%&"'(
O
'*0*?$1*(@
A B/0)1*,$?"0)$C/1)"3D"-3E"$?*F)"(E3"4$($"3+G),(C"$1)H A 2C"-*7-)1"E-)%"3+G),(C"$1)"031)"$?*F)H
A ID()%"D$??C"*%"(-)"1$%7)"J=&:K O
5*CC*0*?$1*(@
A B/0)1*,$?"0)$C/1)"3D"-3E"4*DD)1)%("$1)"(E3"4$($"
3+G),(C
A L3E)1"E-)%"3+G),(C"$1)"031)"$?*F) A 6*%*0/0"4*CC*0*?$1*(@"*C"3D()%"=
A MNN)1"?*0*("O$1*)C
O
P13Q*0*(@"1)D)1C"(3"$"C*0*?$1*(@"31"4*CC*0*?$1*(@
!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""8;
!"#"$%&"'(.,"--"#"$%&"'()/0&)!"#1$2)3''&"45'2-
!$%4""$1)"(-)"$((1*+/()"O$?/)C"D31"(E3"4$($"3+G),(CH
57 ,
Dissimilarities between
objects:I
Minkowski distance for vectors in
Rn d(x,
y) =n
X
k=1
|xk −yk|r
!1/r
(2)
I r = 1: City-block, hamming
I r = 2: Euclidean
I r =∞: Supremum
I
Distance metric
d(x,
y) properties:I Positivity: d(x,y)≥0 for all xandy, withd(x,y) = 0 if and only ifx=y
I Symmetry: d(x,y) =d(y,x) for allxandy
I Triangle inequality: d(x,z)≤d(x,y) +d(y,z) for allx,y,z
I
Most dissimilarity measures satisfy positivity. Depending on the application, it may or may not be important to satisfy symmetry and the triangle inequality.
58 ,
Similarities between objects:
I
Typical properties of
s(x,
y):I 0≤s(x,y)≤1, withs(x,y) = 1 if only ifx=y
I s(x,y) =s(y,x) for allxandy(symmetry)
I
Examples (binary vectors):
(f
abis the number of attributes for which
x=
aand
y=
b)I Simple matching coefficient
SMC = f11+f00
f11+f00+f01+f10
(3)
I Jaccard coefficient
J= f11
f11+f01+f10 (4)
59 ,
I
Examples (arbitrary vectors):
I Cosine similarity (angle between vectors) cos(x,y) = xTy
||x|| ||y|| (5)
I (Pearson’s) linear correlation coefficient (goes from -1 to 1)
I (Spearman’s) rank correlation coefficient (also -1 to 1)
I
Issues to consider:
I Proximity measures ideally application-specific
I Use well-known, established measures for the particular type of objects
I Standardization to avoid arbitrary units affecting results
I Try a number of different possibilities and evaluate the results
60 ,
Course dataset 1
I
Handwritten digits:
http://yann.lecun.com/exdb/mnist/index.html
4 1. Introduction
FIGURE 1.2.Examples of handwritten digits from U.S. postal envelopes.
prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy.
The goal is to predict the log of PSA (lpsa) from a number of measure- ments including log cancer volume (lcavol), log prostate weightlweight, age, log of benign prostatic hyperplasia amountlbph, seminal vesicle in- vasionsvi, log of capsular penetrationlcp, Gleason scoregleason, and percent of Gleason scores 4 or 5pgg45. Figure 1.1 is a scatterplot matrix of the variables. Some correlations withlpsaare evident, but a good pre- dictive model is difficult to construct by eye.
This is a supervised learning problem, known as aregression problem, because the outcome measurement is quantitative.
Example 3: Handwritten Digit Recognition
The data from this example come from the handwritten ZIP codes on envelopes from U.S. postal mail. Each image is a segment from a five digit ZIP code, isolating a single digit. The images are 16×16 eight-bit grayscale maps, with each pixel ranging in intensity from 0 to 255. Some sample images are shown in Figure 1.2.
The images have been normalized to have approximately the same size and orientation. The task is to predict, from the 16×16 matrix of pixel intensities, the identity of each image (0,1, . . . ,9) quickly and accurately. If it is accurate enough, the resulting algorithm would be used as part of an automatic sorting procedure for envelopes. This is a classification problem for which the error rate needs to be kept very low to avoid misdirection of
I
each image 28
×28 pixels, each pixel a gray value between 0 and 255.
61 ,
Course dataset 2
I
Collection of texts:
http://people.csail.mit.edu/jrennie/20Newsgroups/
I Messages from 20 different usenet newsgroups
I Preprocessed into bag-of-words representation
62 ,
Course dataset 3
I
Movielens:
http://movielens.umn.edu/
http://www.grouplens.org/node/73
4 5 5 1 2
3 4 3
1 4 1 5 1
4 1 2 1 1 5
1 1 4 5
4 5 5
2 3 3
Fargo
Seven Aliens Leon Avatar
Bill Jack Linda
John Lucy
63 ,