Types of data

(1)

Data

38 ,

(2)

Data: Outline

I

Types of data

I data objects and attributes

I unordered vs ordered data

I

Quality of data

I measurement errors

I missing values, inconsistent values, duplicates

I

Preprocessing

I aggregation, sampling, feature extraction

I discretization and variable transformations

I

Similarity and dissimilarity

I desirable properties

I common measures

I

Summary statistics and visualization

39 ,

(3)

Standard data model

A data set can often be viewed as a

collection of data objects:

I

A

data object

(also called a record, data point, data vector, case, sample point, observation, entity) is described by a set of attributes

I

An

attribute

(also called a variable, characteristic, field, feature, or dimension) describes one aspect of a data object Example:

Student ID Name Date of Birth Credits Average . . .

2328193 Matti 23.09.1985 123 3.6 . . .

9819234 Tuuli 12.03.1986 98 3.8 . . .

. . . . . . . . . . . . . . . . . .

Here: each row is one data object, each column is an attribute.

40 ,

(4)

Attributes: Number of values

I

How many distinct values can an attribute take

1. Binary attributes: only 2 possible states

(e.g. on/off, pass/fail)

2. Discrete attributes withN<∞states (e.g. course grade: 0, 1, 2, 3, 4, or 5)

3. Discrete attributes with (countably) infinite states (e.g. counts: 0,1,2,3, . . . )

4. Continuous attributes (uncountably infinite) (e.g. length or weight: ∈R)

I

Asymmetric attributes

1. Binary sparse attributes (‘on’ infrequent and significant, ‘off’

common and insignificant)

2. Other attributes may also have ‘special’ states, need to take into account

41 ,

(5)

Attributes: Measurement scales

I

What operations are valid?

1. Distinctness: = vs6= 2. Order: <,≤,>, ≥

3. Addition and subtraction: +,− 4. Multiplication and division: ∗and/

I

These define four types of attribute measurement scales:

1. Nominal (categorical) 2. Ordinal

3. Interval 4. Ratio

Each measurement scale allows all the operations with number

smaller than or equal to

the number of the measurement scale. Example: ‘Interval’ attributes allow all operations except multiplication and division.

42 ,

(6)

Attributes: Measurement scales – examples

!""#$%&"'(

)*+'

,'-.#$+"$/0 1234+5'- 6+'#3"$/0-

!"#$%&' ()*+,&'-*.+"/+&+%"#$%&'+&001$2-0*+&1*+

3-.0+4$//*1*%0+%&#*.5+$6*65+%"#$%&'+

&001$2-0*.+71",$4*+"%'8+*%"-9)+

$%/"1#&0$"%+0"+4$.0$%9-$.)+"%*+"23*:0+

/1"#+&%"0)*16+;<5+z=

>$7+:"4*.5+*#7'"8**+

?@+%-#2*1.5+*8*+:"'"15+

.*AB+C!"#$%&'$!"#$D

#"4*5+*%01"785+

:"%0$%9*%:8+

:"11*'&0$"%5+F^E0*.0

F14$%&' ()*+,&'-*.+"/+&%+"14$%&'+&001$2-0*+

71",$4*+*%"-9)+$%/"1#&0$"%+0"+"14*1+

"23*:0.6+;G5+H=

)&14%*..+"/+#$%*1&'.5+

C())*%&+$,,$-%&+$.,D5+

91&4*.5+.01**0+%-#2*1.

#*4$&%5+7*1:*%0$'*.5+

1&%I+:"11*'&0$"%5+

1-%+0*.0.5+.$9%+0*.0.

?%0*1,&' J"1+$%0*1,&'+&001$2-0*.5+0)*+

4$//*1*%:*.+2*0K**%+,&'-*.+&1*+

#*&%$%9/-'5+$6*65+&+-%$0+"/+

#*&.-1*#*%0+*A$.0.6++

;L5+M =

:&'*%4&1+4&0*.5+

0*#7*1&0-1*+$%+N*'.$-.+

"1+J&)1*%)*$0

#*&%5+.0&%4&14+

4*,$&0$"%5+O*&1."%P.+

:"11*'&0$"%5+,&%4+/

0*.0.

Q&0$" J"1+1&0$"+,&1$&2'*.5+2"0)+4$//*1*%:*.+

&%4+1&0$".+&1*+#*&%$%9/-'6+;R5+S= 0*#7*1&0-1*+$%+T*',$%5+

#"%*0&18+U-&%0$0$*.5+

:"-%0.5+&9*5+#&..5+

'*%90)5+*'*:01$:&'+

:-11*%0

9*"#*01$:+#*&%5+

)&1#"%$:+#*&%5+

7*1:*%0+,&1$&0$"%

!""#$%&"'(

7'8'5

)#30-9/#43"$/0 :/44'0"-

!"#$%&' V%8+7*1#-0&0$"%+"/+,&'-*. ?/+&''+*#7'"8**+?@+%-#2*1.+

K*1*+1*&..$9%*45+K"-'4+$0+

#&I*+&%8+4$//*1*%:*W

F14$%&' V%+"14*1+71*.*1,$%9+:)&%9*+"/+

,&'-*.5+$6*65+

0$123"#4$&5&'6)#*23"#4$7&

K)*1*+'$.+&+#"%"0"%$:+/-%:0$"%6

V%+&001$2-0*+*%:"#7&..$%9+

0)*+%"0$"%+"/+9""45+2*00*1+

2*.0+:&%+2*+1*71*.*%0*4+

*U-&''8+K*''+28+0)*+,&'-*.+

CX5+E5+YD+"1+28+C+Z6[5+X5+

XZD6

?%0*1,&' 0$123"#4$&5"&8&)#*23"#4$&9&+&

K)*1*+&+&%4+2+&1*+:"%.0&%0.

()-.5+0)*+J&)1*%)*$0+&%4+

N*'.$-.+0*#7*1&0-1*+.:&'*.+

4$//*1+$%+0*1#.+"/+K)*1*+

0)*$1+>*1"+,&'-*+$.+&%4+0)*+

.$>*+"/+&+-%$0+;4*91**=6 Q&0$" 0$123"#4$&5&"&8&)#*23"#4$ \*%90)+:&%+2*+#*&.-1*4+$%+

#*0*1.+"1+/**06

43 ,

(7)

General characteristics of data

I

Number of data points vs dimensionality (number of attributes)

1. Most traditional data analysis methods assume (many) more data points than dimensions

2. Many interesting datasets have extremely high dimensions and few data objects

I

Sparsity (relatively few non-zero values)

1. Efficient storage and computation, for some methods 2. May be important for modeling

I

Resolution (spatial, temporal, . . . )

1. How small details can the sensors reliably detect 2. How large datasets can the methods handle

44 ,

(8)

Examples – Record data

I

Data ‘matrix’

I

Sparse matrix representation?

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""::

**!"#$%&'(&)*+'%',&-%./&.,/)$0)1&%2,&2%-3)4'&'**

> **!"#$%&"'%()"*+**

,-.&$/'0/!"#$%&"'%()"*+

> **12(.&"*+**

3%)+/2.$&$%4$/4'-%*&

> **5$&')-*"'%**

6(**$.%&/7$2$%7/'%/*8$/&4()$/

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":<

5-,$%3)4'&')

O

5$($"(-$(",3%??(?"3@"$",3AA),(3%"3@"1),314?&")$,-"

3@"B-,-",3%??(?"3@"$"@C)4"?)("3@"$((1+/()?"

!"#$ 5$0-%7/ 9(."*()/

1*(*-&/ :(;(<)$/

=%4'#$/ ,8$(*

:" D)?" '*%7A)" :<E." >'/

<" F3" 6$11*)4" :==." >'/

G" F3" '*%7A)" H=." >'/

8" D)?" 6$11*)4" :<=." >'/

E" F3" 5*I31,)4" JE." ?$&/

K" F3" 6$11*)4" K=." >'/

H" D)?" 5*I31,)4" <<=." >'/

;" F3" '*%7A)" ;E." ?$&/

J" F3" 6$11*)4" HE." >'/

:=" F3" '*%7A)" J=." ?$&/

:="

!

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#"$%"#&'($

O2?"4$($"3+@),(A"-$B)"(-)"A$0)"?*C)4"A)("3?"%/0)1*,"

$((1*+/()A&"(-)%"(-)"4$($"3+@),(A",$%"+)"(-3/7-("3?"$A"

D3*%(A"*%"$"0/E(*F4*0)%A*3%$E"AD$,)&"G-)1)")$,-"

4*0)%A*3%"1)D1)A)%(A"$"4*A(*%,("$((1*+/()"

O'/,-"4$($"A)(",$%"+)"1)D1)A)%()4"+H"$%"0"+H"%"0$(1*C&"

G-)1)"(-)1)"$1)"0"13GA&"3%)"?31")$,-"3+@),(&"$%4"%"

,3E/0%A&"3%)"?31")$,-"$((1*+/()

:I:

<I<

:JI<<

JI<K :<IJK

:I<

<IL :KI<<

KI<L :=I<>

!"#$%&'(()

*+,- .#(/,&$' 01+2'$/#+&)

+3)4)5+,- 01+2'$/#+&)

+3)6)*+,-

:I:

<I<

:JI<<

JI<K :<IJK

:I<

<IL :KI<<

KI<L :=I<>

!"#$%&'(()

*+,- .#(/,&$' 01+2'$/#+&)

+3)4)5+,- 01+2'$/#+&)

+3)6)*+,-

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":8

!)*+,-.#$!"#"

OM$,-"43,/0)%("+),30)A"$"N()10O"B),(31&"

P )$,-"()10"*A"$",30D3%)%("Q$((1*+/()R"3?"(-)"B),(31&

P (-)"B$E/)"3?")$,-",30D3%)%("*A"(-)"%/0+)1"3?"(*0)A"

(-)",311)AD3%4*%7"()10"3,,/1A"*%"(-)"43,/0)%(I"

53,/0)%(":

A)$A3%

(*0)3/(

E3A(

G*%

7$0)

A,31)

+$EE

DE$H

,3$,-

()$0

53,/0)%("<

53,/0)%(">

> = K = < J = < = <

=

L = < : = = > = =

: = = : < < = > =

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$%#&'()$*+#'#

O?"@A),*$B"(CA)"3D"1),314"4$($&"E-)1)"

F )$,-"1),314"G(1$%@$,(*3%H"*%I3BI)@"$"@)("3D"*()0@J""

F K31")L$0AB)&",3%@*4)1"$"713,)1C"@(31)J""#-)"@)("3D"

A134/,(@"A/1,-$@)4"+C"$",/@(30)1"4/1*%7"3%)"

@-3AA*%7"(1*A",3%@(*(/()"$"(1$%@$,(*3%&"E-*B)"(-)"

*%4*I*4/$B"A134/,(@"(-$("E)1)"A/1,-$@)4"$1)"(-)"*()0@J"

!"#$ "%&'($

!" #$%&'(")*+%(",-.+!

/" #%%$("#$%&'!

0" #%%$(")*+%("1-&2%$(",-.+!

3" #%%$("#$%&'("1-&2%$(",-.+!

4" )*+%("1-&2%$(",-.+!

!

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":M

,"#-.*+#'#*

ONL$0AB)@O"P)%)1*,"71$A-"$%4"Q#6R"R*%S@"

>

<

:

"<

>

T$"-1)DUVA$A)1@9A$A)1@J-(0BW++++VX 5$($"6*%*%7"T9$X TB*X

T$"-1)DUVA$A)1@9A$A)1@J-(0BW$$$$VX P1$A-"Y$1(*(*3%*%7"T9$X TB*X

T$"-1)DUVA$A)1@9A$A)1@J-(0BW$$$$VX

Y$1$BB)B"'3B/(*3%"3D"'A$1@)"R*%)$1"'C@()0"3D"NZ/$(*3%@"T9$X TB*X

T$"-1)DUVA$A)1@9A$A)1@J-(0BWDDDDVX

[\]34C"^30A/($(*3%"$%4"5)%@)"R*%)$1"'C@()0"'3BI)1@

45 ,

(9)

Examples – Graph data

I

Relationships between objects

I World wide web

I Facebook users

I Scientific papers

I

Objects are graphs

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$%&'()*'+')

O?)%@)%)"63A),/A)B"CDED

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":;

,-.#-#.)*'+')

O')F/)%,)G"3H"(1$%G$,(*3%G

!"#$%$&$"'#()#

'*$#+$,-$".$

/'$&+012$"'+

46 ,

(10)

Examples – Ordered data

I

Data objects with natural ordering

I Sequential data (e.g. logins, credit card purchases, . . . )

I Sequences (e.g. genome)

I Time-series (e.g. climate science, financial data, . . . )

I Spatial data (e.g. journey planner data, surface/sea temperatures, multi-spectral images)

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$"$#%&'('%

O ?)%30*,"@)A/)%,)"4$($

!!""##!##""#$!####!#!##

#!#$!!!###!####!#!##!"#

!$!$$!!!###!##"!!#!!!#!

!!!!!$!!#!!!!##!###!$!#

##$$##!$!"##!$##$!!"!##

###"#"!#"#!!##"$!$##"!$

!#"#$""$!!#!!#$!#!!$#$!

!##$$!"$!$$#$#!#!$$!#!#

"!!!#"!##"!#"!#!$##$!!!

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""<=

!"#$"$#%&'('

O'B$(*3C#)0B31$D"5$($

!"#$%&#'()*+,-.' /#01#$%+2$#')3' -%*4'%*4')5#%*

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$"$#%&'('%

O?)%30*,"@)A/)%,)"4$($

!!""##!##""#$!####!#!##

#!#$!!!###!####!#!##!"#

!$!$$!!!###!##"!!#!!!#!

!!!!!$!!#!!!!##!###!$!#

##$$##!$!"##!$##$!!"!##

###"#"!#"#!!##"$!$##"!$

!#"#$""$!!#!!#$!#!!$#$!

!##$$!"$!$$#$#!#!$$!#!#

"!!!#"!##"!#"!#!$##$!!!

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""<=

!"#$"$#%&'(' O'B$(*3C#)0B31$D"5$($

!"#$%&#'()*+,-.' /#01#$%+2$#')3' -%*4'%*4')5#%*

47 ,

(11)

Data quality

I

Measurement error (‘noise’)

I Continuous data: Gaussian, Student’s t distribution, . . .

I Binary data: Bernoulli, . . .

I

Precision and bias

I Precision: closeness of repeated measurements to each other

I Bias: systematic deviation from the true underlying value

Left: high precision, high bias. Right: low precision, low bias.

48 ,

(12)

I

Outliers i.e. ‘anomalous’ objects:

(Objects that are very different from the others)

I Noise?

I Data collection error?

I Legitimate, interesting objects?

I

Missing values:

(Some attribute values are missing for some data objects)

I Missing at random? Need to model the process?

I Just eliminating such data objects or attributes?

I Estimating and imputing missing values?

I Ignoring or explicitly taking them into account?

I

Duplicate data, inconsistent data

I

Timeliness, relevance, application-specific prior knowledge

49 ,

(13)

Data preprocessing

I

Aggregation

⇒ fewer, less noisy, data points

I Example: monthly analysis of daily (or hourly) data

I Loss of resolution

I

Sampling (with/without replacement, stratified or not)

⇒ fewer data points

I Visualization, applying computationally demanding data analysis procedures

I Loss of precision of data statistics

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"

C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?

031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)" C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/? 031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)" C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/? 031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

50 ,

(14)

Data preprocessing

I

Aggregation

I

Sampling (with/without replacement, stratified or not)

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"

C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?

031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"

C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?

031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)" C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/? 031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

50 ,

(15)

Data preprocessing

I

Aggregation

I

Sampling (with/without replacement, stratified or not)

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"

C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?

031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"

C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?

031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%&'(&)*+#,-./**

O

'0?@)"A$%430"'$0?@%7

B #-)1)"*C"$%")D/$@"?13+$+*@*(E"3F"C)@),(*%7"$%E"?$1(*,/@$1"*()0

O

'$0?@%7"G(-3/("1)?@$,)0)%(

B HC")$,-"*()0"*C"C)@),()4&"*("*C"1)03I)4"F130"(-)"?3?/@$(*3%

O

'$0?@%7"G(-"1)?@$,)0)%(

B J+K),(C"$1)"%3("1)03I)4"F130"(-)"?3?/@$(*3%"$C"(-)E"$1)"

C)@),()4"F31"(-)"C$0?@)L"""

2%"C$0?@*%7"G*(-"1)?@$,)0)%(&"(-)"C$0)"3+K),(",$%"+)"?*,M)4"/?

031)"(-$%"3%,)

O

'(1$(F)4"C$0?@*%7

B '?@*("(-)"4$($"*%(3"C)I)1$@"?$1(*(*3%CN"(-)%"41$G"1$%430"C$0?@)C F130")$,-"?$1(*(*3%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**)*+#,$&)-0$**

!"""#$%&'() *"""#+%&'() ,""#+%&'()

50 ,

(16)

The ‘curse of dimensionality’: As the dimensionality grows, the data becomes increasingly sparse in the space (e.g.

n

dimensions of binary variables has 2

ⁿ

joint states)

I

Feature subset selection

(selecting only some of the attributes, manually or automatically)

I

Feature extraction

(computing new ‘attributes’ to replace existing ones, e.g.

image color/structure features)

I

Dimensionality reduction

(principal component analysis, other unsupervised methods)

51 ,

(17)

I

Discretization of continuous attributes (required for some machine learning algorithms)

I e.g. Ris divided into (−∞,x1],(x1,x2],(x2,x3],(x3,∞)

I Unsupervised or supervised

I Ideally: use application-specific knowledge

0 1 2 3 4 5 6

52 ,

(18)

I

Binarization

I Some algorithmsrequirebinary attributes

I HowNOTto represent nominal variables:

Categorical value Integer value x1 x2 x3

Toyota 0 0 0 0

Volkswagen 1 0 0 1

Chevrolet 2 0 1 0

Saab 3 0 1 1

Volvo 4 1 0 0

I How better to represent nominal variables:

Categorical value Integer value x₁ x₂ x₃ x₄ x₅

Toyota 0 1 0 0 0 0

Volkswagen 1 0 1 0 0 0

Chevrolet 2 0 0 1 0 0

Saab 3 0 0 0 1 0

Volvo 4 0 0 0 0 1

I Useful even when binary variables are not strictly required!

53 ,

(19)

I

Binarization

I Some algorithmsrequirebinary attributes

I HowNOTto represent nominal variables:

Categorical value Integer value x1 x2 x3

Toyota 0 0 0 0

Volkswagen 1 0 0 1

Chevrolet 2 0 1 0

Saab 3 0 1 1

Volvo 4 1 0 0

I How better to represent nominal variables:

Categorical value Integer value x1 x2 x3 x4 x5

Toyota 0 1 0 0 0 0

Volkswagen 1 0 1 0 0 0

Chevrolet 2 0 0 1 0 0

Saab 3 0 0 0 1 0

Volvo 4 0 0 0 0 1

I Useful even when binary variables are not strictly required!

53 ,

(20)

I

Variable transformations

I Simple functions (e.g. log(x) often used for strictly positive variables)

I Ideally: use application-specific knowledge

I Normalization / standardization:

x⁰= x−¯x

s_x , (1)

where ¯x is the mean of the attribute values ands_x is the standard deviation. This yieldsx⁰ with zero mean and a standard deviation of 1.

Useful to ensure that the scale (units) of attribute values do not affect the results.

54 ,

(21)

Similarity and dissimilarity

I

Many machine learning algorithms use measures of similarity or dissimilarity between data objects

I

Examples:

I Handwritten letters

I Segments of DNA

I Text documents

“Parliament overwhelmingly approved amendments to the Firearms Act on Wednesday. The new law requires physicians to inform authorities of individuals they consider unfit to own guns. It also increases the age for handgun ownership from 18 to 20.”

“Parliament's Committee for Constitutional Law says that persons applying for handgun licences should not be required to join a gun club in the future. Government is proposing that handgun users be part of a gun association for at least two years.”

“The cabinet on Wednesday will be looking at a controversial package of proposed changes in the curriculum in the nation's comprehensive schools.

The most divisive issue is expected to be the question of expanded language teaching.”

ACCTGTCG ATCCTGTG

TCGATTGC

55 ,

(22)

I

Similarity:

s

I Numerical measure of the degree to which two objects arealike

I Higher for objects that are alike

I Typically between 0 (no similarity) and 1 (completely similar)

I

Dissimilarity:

d

I Numerical measure of the degree to which two objects are different

I Higher for objects that are different

I Typically between 0 (no difference) and∞(completely different)

I

Transformations

I Converting from one to the other

I

Use similarity or dissimilarity measures?

I Method-specific

56 ,

(23)

I

Proximity between

attribute values

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""8>

**!"#"$%&"'()%*+),"--"#"$%&"'(**

O

'0?$1*(@

A B/0)1*,$?"0)$C/1)"3D"-3E"$?*F)"(E3"4$($"3+G),(C"$1)H A 2C"-*7-)1"E-)%"3+G),(C"$1)"031)"$?*F)H

A ID()%"D$??C"*%"(-)"1$%7)"J=&:K O

5CC0?$1(@

A B/0)1*,$?"0)$C/1)"3D"-3E"4*DD)1)%("$1)"(E3"4$($"

3+G),(C

A L3E)1"E-)%"3+G),(C"$1)"031)"$?*F) A 6*%*0/0"4*CC*0*?$1*(@"*C"3D()%"=

A MNN)1"?*0*("O$1*)C

O

P13Q0(@"1)D)1C"(3"$"C0?$1(@"31"4CC0?$1*(@

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""8;

!"#"$%&"'(.,"--"#"$%&"'()/0&)!"#1$2)3''&"45'2-

!$%4""$1)"(-)"$((1*+/()"O$?/)C"D31"(E3"4$($"3+G),(CH

57 ,

(24)

Dissimilarities between

objects:

I

Minkowski distance for vectors in

Rⁿ d

(x,

y) =

n

X

k=1

|x_k −y_k|^r

!1/r

(2)

I r = 1: City-block, hamming

I r = 2: Euclidean

I r =∞: Supremum

I

Distance metric

d

(x,

y) properties:

I Positivity: d(x,y)≥0 for all xandy, withd(x,y) = 0 if and only ifx=y

I Symmetry: d(x,y) =d(y,x) for allxandy

I Triangle inequality: d(x,z)≤d(x,y) +d(y,z) for allx,y,z

I

Most dissimilarity measures satisfy positivity. Depending on the application, it may or may not be important to satisfy symmetry and the triangle inequality.

58 ,

(25)

Similarities between objects:

I

Typical properties of

s

(x,

y):

I 0≤s(x,y)≤1, withs(x,y) = 1 if only ifx=y

I s(x,y) =s(y,x) for allxandy(symmetry)

I

Examples (binary vectors):

(f

ab

is the number of attributes for which

x

=

a

and

y

=

b)

I Simple matching coefficient

SMC = f11+f00

f11+f00+f01+f10

(3)

I Jaccard coefficient

J= f11

f₁₁+f₀₁+f₁₀ (4)

59 ,

(26)

I

Examples (arbitrary vectors):

I Cosine similarity (angle between vectors) cos(x,y) = x^Ty

||x|| ||y|| (5)

I (Pearson’s) linear correlation coefficient (goes from -1 to 1)

I (Spearman’s) rank correlation coefficient (also -1 to 1)

I

Issues to consider:

I Proximity measures ideally application-specific

I Use well-known, established measures for the particular type of objects

I Standardization to avoid arbitrary units affecting results

I Try a number of different possibilities and evaluate the results

60 ,

(27)

Course dataset 1

I

Handwritten digits:

http://yann.lecun.com/exdb/mnist/index.html

4 1. Introduction

FIGURE 1.2.Examples of handwritten digits from U.S. postal envelopes.

prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy.

The goal is to predict the log of PSA (lpsa) from a number of measurements including log cancer volume (lcavol), log prostate weightlweight, age, log of benign prostatic hyperplasia amountlbph, seminal vesicle in- vasionsvi, log of capsular penetrationlcp, Gleason scoregleason, and percent of Gleason scores 4 or 5pgg45. Figure 1.1 is a scatterplot matrix of the variables. Some correlations withlpsaare evident, but a good pre- dictive model is difficult to construct by eye.

This is a supervised learning problem, known as aregression problem, because the outcome measurement is quantitative.

Example 3: Handwritten Digit Recognition

The data from this example come from the handwritten ZIP codes on envelopes from U.S. postal mail. Each image is a segment from a five digit ZIP code, isolating a single digit. The images are 16×16 eight-bit grayscale maps, with each pixel ranging in intensity from 0 to 255. Some sample images are shown in Figure 1.2.

The images have been normalized to have approximately the same size and orientation. The task is to predict, from the 16×16 matrix of pixel intensities, the identity of each image (0,1, . . . ,9) quickly and accurately. If it is accurate enough, the resulting algorithm would be used as part of an automatic sorting procedure for envelopes. This is a classification problem for which the error rate needs to be kept very low to avoid misdirection of

I

each image 28

×

28 pixels, each pixel a gray value between 0 and 255.

61 ,

(28)

Course dataset 2

I

Collection of texts:

http://people.csail.mit.edu/jrennie/20Newsgroups/

I Messages from 20 different usenet newsgroups

I Preprocessed into bag-of-words representation

62 ,

(29)

Course dataset 3

I

Movielens:

http://movielens.umn.edu/

http://www.grouplens.org/node/73

4 5 5 1 2

3 4 3

1 4 1 5 1

4 1 2 1 1 5

1 1 4 5

4 5 5

2 3 3

Fargo

Seven Aliens Leon Avatar

Bill Jack Linda

John Lucy

63 ,

Types of data

Data

Data: Outline

Types of data

Quality of data

Preprocessing

Similarity and dissimilarity

Summary statistics and visualization

Standard data model

A data set can often be viewed as a

A

(also called a record, data point, data vector, case, sample point, observation, entity) is described by a set of attributes

An

(also called a variable, characteristic, field, feature, or dimension) describes one aspect of a data object Example:

Student ID Name Date of Birth Credits Average . . .

2328193 Matti 23.09.1985 123 3.6 . . .

9819234 Tuuli 12.03.1986 98 3.8 . . .

. . . . . . . . . . . . . . . . . .

Here: each row is one data object, each column is an attribute.

Attributes: Number of values

How many distinct values can an attribute take

Asymmetric attributes

Attributes: Measurement scales

What operations are valid?

These define four types of attribute measurement scales:

Each measurement scale allows all the operations with number

the number of the measurement scale. Example: ‘Interval’ attributes allow all operations except multiplication and division.

Attributes: Measurement scales – examples

General characteristics of data

Number of data points vs dimensionality (number of attributes)

Sparsity (relatively few non-zero values)

Resolution (spatial, temporal, . . . )

Examples – Record data

Data ‘matrix’

Sparse matrix representation?

!"#$%&'(&)*+'%',&-%./&.,/)$0)1&%2,&2%-3)4'&'

> !"#$%&"'%()"*+

> 12(.&"*+

> 5$&')-*"'%

5-,$%3)4'&')

5$($"(-$(",3%?*?(?"3@"$",3AA),(*3%"3@"1),314?&")$,-"

3@"B-*,-",3%?*?(?"3@"$"@*C)4"?)("3@"$((1*+/()?"

Examples – Graph data

Relationships between objects

Objects are graphs

Examples – Ordered data

Data objects with natural ordering

Data quality

Measurement error (‘noise’)

Precision and bias

Left: high precision, high bias. Right: low precision, low bias.

Outliers i.e. ‘anomalous’ objects:

(Objects that are very different from the others)

Missing values:

(Some attribute values are missing for some data objects)

Duplicate data, inconsistent data

Timeliness, relevance, application-specific prior knowledge

Data preprocessing

Aggregation

Sampling (with/without replacement, stratified or not)

!"#$%&'(&)*+#,-./

'*0?@)"A$%430"'$0?@*%7

'$0?@*%7"G*(-3/("1)?@$,)0)%(

'$0?@*%7"G*(-"1)?@$,)0)%(

'(1$(*F*)4"C$0?@*%7

)*+#,$&)-0$

!"#$%&'(&)*+#,-./

'*0?@)"A$%430"'$0?@*%7

'$0?@*%7"G*(-3/("1)?@$,)0)%(

'$0?@*%7"G*(-"1)?@$,)0)%(

'(1$(*F*)4"C$0?@*%7

)*+#,$&)-0$

!"#$%&'(&)*+#,-./

'*0?@)"A$%430"'$0?@*%7

'$0?@*%7"G*(-3/("1)?@$,)0)%(

'$0?@*%7"G*(-"1)?@$,)0)%(

'(1$(*F*)4"C$0?@*%7

)*+#,$&)-0$

Data preprocessing

Aggregation

**!"#$%&'(&)*+'%',&-%./&.,/)$0)1&%2,&2%-3)4'&'**

> **!"#$%&"'%()"*+**

> **12(.&"*+**

> **5$&')-*"'%**

5$($"(-$(",3%??(?"3@"$",3AA),(3%"3@"1),314?&")$,-"

3@"B-,-",3%??(?"3@"$"@C)4"?)("3@"$((1+/()?"

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#$%&'(&)*+#,-./**

'0?@)"A$%430"'$0?@%7

'$0?@%7"G(-3/("1)?@$,)0)%(

'$0?@%7"G(-"1)?@$,)0)%(

'(1$(F)4"C$0?@*%7

**)*+#,$&)-0$**

**!"#"$%&"'()%*+),"--"#"$%&"'(**

'0?$1*(@

5CC0?$1(@

P13Q0(@"1)D)1C"(3"$"C0?$1(@"31"4CC0?$1*(@