The GUHA Method in Data Mining

(1)

TAM002

Data mining with GUHA – Part 1

Does my data contain something interesting?

Esko Turunen¹

1Tampere University of Technology

(2)

The aim of data mining is to give answers to a question ’Does my data contain something interesting?’

In this chapter we introduce the following basic concepts:

•knowledge discovery in databases and data mining

•data, typical data mining tasks and data mining tasks outputs

•GUHA and B–course: two dissimilar approaches to data mining

We also get acquainted with a real life data collected from Indonesia. We use this data to illustrate issues all over during this course.

These push buttons♣open short verbal comments that might be useful.

(3)

♣Knowledge discovery in databases(KDD) was initially defined as the ’non–trivial extraction of implicit, previously unknown, and potentially useful information from data’ [1]. A revised version of this definition states that ’KDD is the

non–trivial process ofidentifying valid, novel, potentially useful, and ultimately understandablepatterns in data’ [2] .

According to this definition, data mining is a step in the KDD process concerned withapplying computational techniques (i.e., data mining algorithms implemented as computer programs) to actually find patters in the data.

In a sense,data mining is the central step in the KDD process.

The other steps in the KDD process are concerned with preparing data for data mining, as well as evaluating the discovered patterns, the results of data mining.

(4)

♣I DataThe input to a data mining algorithm is most commonly asingle flat tablecomprising a number of fields (columns) and records (rows). In general, each row represents an object and columns represent properties of objects.

II Typical data mining tasks

•One task is to predict the value of one field from other fields.

If the class is continuous, the task is calledregression. If the class is discrete the task is calledclassification.

•Clusteringis concerned with grouping objects into classes of similar objects. A cluster is a collection of objects that are similar to each other and are dissimilar to objects in other clusters.

•Association analysisis the discovery of association rules.

Association rules specify correlation between frequent item sets.

•Data characterizationsums up the general characteristics or features of the target class of data: this class is typically collected by a database query.

(5)

•Outlier detectionis concerned with finding data objects that do not fit the general behavior or model of the data: these are called outliers.

•Evaluation analysisdescribes and models regularities or trends whose behavior changes over time.

III Outputsof data mining procedures can be

•Equationse.g. TotalSpent=189.5275×Age+7146[$]

•Predictive rulese.g. IF income is≤100.000[$]and Gender = Male THEN Not a Big Spender

•Association rulese.g.

{Gender=Female, Age≥52} ⇒ {Big Spender =Yes}

•Probabilistic modelse.g. Bayesian networks

•Distance and similarity measures, decision trees

•Many others

(6)

♣Our aim is to study in detail a particular data mining method calledGUHA– its principle was formulated in a paper by Hájek, Havel and Chytil already in 1966 [3]. GUHA is the acronym for General Unary Hypotheses Automaton – and its computer implementation calledLISpMinerdeveloped in Prague University of Economics by Jan Rauch and Milan Šimunek.

LISpMiner is freely downloadable fromhttp://lispminer.vse.cz/. GUHA approach is suitable e.g. for association analysis, classification, clustering and outlier detection tasks.

♣We start be introducing a real life data which will serve as a benchmark data test set during the whole course. To show how GUHA differs from a Bayesian approach we briefly take a quick look atB–course, seehttp://b-course.cs.helsinki.fi/obc/.

(7)

The data we use is Tjen-Sien Lim’s publicly available data set from the 1987 National Indonesia Contraceptive Prevalence Survey. These are the responses from interviews ofm=1473 married women who were not pregnant at the time of interview.

The challenge is to learn to predict a woman’s contraceptive method from knowledge about her demographic and

socio-economic characteristics. The 10 survey response variables and their types are

Age integer 16–49

Education 4 categories

Husband’s education 4 categories Number of children borne integer 0–15

Islamic binary (yes/no)

Working binary (yes/no)

Husband’s occupation 4 categories Standard of living 4 categories Good media exposure binary (yes/no)

Contraceptive method used 3 categories (None, Long-term, Short-term)

(8)

♣

http://b-course.cs.helsinki.fi/obc/

(9)

http://b-course.cs.helsinki.fi/obc/

(10)

(11)

(12)

♣

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

Speculating about causalities

Remember that dependencies are not necessarily causalities. However, the theory of inferred causation makes it possible to speculate about the causalities that have caused the dependencies of the model.

There are two different speculations (called naive model and not so naive model) which are based on different background assumptions.

(21)

How to read naive causal model ?

Naive causal models are easy to read, but they are built on assumptions that are many times unrealistic, namely that there are no latent (unmeasured) variables in the domain that causes the dependencies between variables. A simple example of the situation where this assumption is violated can be placed to Finland where cold winter makes lakes and sea ice covered. Because of that most drowning accidents happen in summertime. The warm summer also makes people eat much more ice-cream than in wintertime. If you measure both the number of drowning accidents and the ice- cream consumption, but don't include the variable indicating the season there is clear dependency between ice-cream consumption and drowning. Evidently this dependency is not causal (ice cream does not cause drowning or other way round), but due to the excluded variable summer (technically this is called confounding). Naive causal models are built on the assumption that there is no confounding.

In naive causal models there may be two kind of connections between variables: undirected arcs and directed arcs. Directed arcs denote the causal influence from cause to effect and the undirected arcs denote the causal influence directionality of which cannot be automatically inferred from the data.

You can also read the naive causal models as representing the set of dependency models sharing the same directed arcs. Unfortunately, this does not give you the freedom to re-orient the undirected arcs any way you want. You are free to re-orient the undirected arcs as long as re-orienting them does not create new V-structures in a graph. V-structure is the system of three variables A B C such that there is directed arc from A to B and there is directed arc from C to B, but there is no arc (neither directed nor undirected) between A and C.

(22)

♣

How to read causal graph produced by B-course?

Causal models are not difficult to read once you learn the difference between different kinds of arcs. There are two kinds of lines in arcs, solid and dashed. With solid lines we indicate relations that can be determined from the data. Dashed lines are used when we know that there is a dependency, but we are not sure about its exact nature. The table below lists the different types of arcs that can be found in causal models.

Solid arc from A to B A has direct causal influence to B (direct meaning that causal influence is not mediated by any other variable that is included in the study) Dashed arc from A to B. There are two possibilities, but we do not know which holds. Either A

is cause of B or there is a latent cause for both A and B.

Dashed line without any arrow heads between A and B.

There is a dependency but we do not know whether A causes B or if B causes A or if there is a latent cause of them both the dependency (confounding).

(23)

W. Frawley, G. Piatetsky-Shapiro and C. Matheus:

Knowledge Discovery in Databases: An Overview. In Knowledge Discovery in Databases, eds. G.

Piatetsky-Shapiro and W. Frawley (1991) 1–27. Cambridge, Mass.: AAAI Press / The MIT Press.

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and P.

Uthurusamy: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996).

I. Havel , M. Chytil M.and P. Hájek: The GUHA–Method of Automatic Hypotheses Determination. Computing, Vol. 1, (1966) 293–308.

(24)

TAM002

Data mining with GUHA – Part 2 GUHA produces hypothesis

Esko Turunen¹

(25)

GUHA does not test hypothesis, instead GUHA produces them In this chapter we learn the outlines of GUHA e.g. that

•GUHA is suitable for exploratory analysis of large data

•GUHA goes through 2×2 contingency tables and picks up the interesting ones

•in logic terms, GUHA is based on first order monoidal logic whose models are finite

•a fundamental reference, the ‘GUHA book’, is free to be downloaded from www.cs.cas.cz/hajek/guhabook/

(26)

♣1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method of data mining.

•GUHA is one of the oldest methods of data mining – GUHA was introduced by Hájek, Havel and Chytil inThe GUHA method of automatic hypotheses determinationpublished in Computing 1 (1966) 293–308.

•GUHA still develops: there are dozens of new features added to GUHA during the past ten years.

•GUHA is a kind of automated exploratory data analysis.

Instead of testing given hypothesissupported by a data,GUHA procedure generates systematically hypothesessupported by the data. By GUHA it is possible to find all interesting

dependences hidden in a data.

(27)

♣2. GUHA is suitable for exploratory analysis of large data.

•The processed data form a rectangle matrix, where rows corresponds to objects belonging to the sample and each column corresponds to one investigated variable. A typical data matrix processed by GUHA has hundreds or thousands of rows and tens of columns.

•Cells of the analyzed data matrix can contain whatever symbols, however, before a concrete data mining task the data must be categorized to contain only0s or1s (or are empty).

•Exploratory analysismeans that there is no single specific hypothesis that should be tested by our data; rather, our aim is to get orientation in the domain of investigation, analyze the behavior of chosen variables, interactions among them etc.

Such inquiry is not blind but directed by some general (possibly vague) direction of research (some general problem).

(28)

♣3. GUHA systematically creates all hypotheses interesting from the point of view of a given general problem and on the base of given data.

•This is the main principle: all interesting hypotheses. Clearly, this contains a dilemma:allmeans most possible,only

interestingmeansnot too many. To cope with this dilemma, one may use different GUHA procedures, implemented in LISp–Miner system, and having selected one, by fixing in various ways its numerous parameters.

•The LISp–Miner system leads the user and makes the

selection of parameters relatively easy but somewhat laborious.

LISp–Miner cannot be as automatized as much as e.g.

B–course is automatized, however, the results of LISp–Miner are much more illustrative and detailed.

(29)

Three remarks:

•GUHA procedures polyfactorial hypotheses i.e. not only hypotheses relating one variable with another one, but expressing relations among single variables, pairs, triples, quadruples of variables etc.

•GUHA offers hypotheses. Exploratory character implies that the hypotheses produced by the computer (numerous in number: typically tens or hundreds of hypotheses) are just supported by the data, not verified. You are assumed to use this offer as inspiration, and possibly select some few hypotheses for further testing.

•GUHA is not suitable for testing a single hypothesis: routine packages are good for this.

(30)

♣4. 4ft–miner procedure generateshypotheses(or

observational statements) on association between complex Booleanformulae(attributes). These formulae are constructed fromunary predicates(corresponding to the columns of the processed0/1–data matrix) by logical connectives∧,∨,¬ (conjunction, disjunction, negation).

Examples of predicates are

TEMPERATURE :≥38^◦C,PRESURE: high,DIAGNOSIS: infection

Examples of formulae are

[TEMPERATURE :≥38^◦C]∧[PRESURE: high]and[DIAGNOSIS: infection]

An example of a hypothesis is

[TEMPERATURE :≥38^◦C]∧[PRESURE: high]≈[DIAGNOSIS: infection]

More generally, hypotheses are of formφ≈ψ.

Notice that our terminology differs from the original GUHA approach: we want to keep things as simple as possible!

(31)

Given the 0/1–data matrix, each pair of Boolean attributesφ,ψ determines itsfour-fold frequency table; the association of with is defined by choosing anassociational or generalized

quantifieri.e. a function assigning to each four-fold table either 1 (associated ortruein the data) or 0 (not associated or falsein the data) and satisfying some natural monotonicity conditions.

♣The four–fold table has the form:

ψ ¬ψ

φ a b a+b=r

¬φ c d c+d =s

a+c =k b+d =l m wherea+b+c+d =mand

•ais the number of objects satisfying bothφandψ,

•bis the number of objects satisfyingφbut notψ,

•c is the number of objects not satisfyingφbut satisfyingψ,

(32)

♣6. There are various types of generalized quantifiers formalizing various kinds of associations:

•Implicational quantifiersformalize the associationmanyφare ψ, they do not depend on the valuesc,d.

•Comparative quantifiersformalize the associationφmakesψ more likely than¬ψ.

•Some quantifiers just express observations on the data, some others serve as tests of statistical hypotheses on unknown probabilities.

•Some quantifiers onessymmetric:φ≈ψimpliesψ≈φ, some admit negation:φ≈ψimplies¬φ≈ ¬ψ.

4ft-Miner procedure contains dozen of generalized quantifiers, a novel procedure Ac4ft–Miner offers alsoaction quantifiers. An advantage of GUHA is that new quantifiers can be defined and their properties can be analyzed in a well establish logic framework.

(33)

♣7. After preparing the original data matrix to a 0/1-format, the user can start the 4ft–Miner procedure. The input of such a procedure consists of

(1) the 0/1–data matrix

(2) parameters determining symbolic restriction to the pairsψ, φ of Boolean attributes to be generated;ψis calledantecedent andφis calledsuccedent

(3) the quantifier to be used and some parameters of the quantifier

(4) some other determinations, for example the user has to declare predicates that can occur in the antecedent and the succedent, the use of logic connectives, minimal and maximal length of antecedent and succedent, way to process of missing data etc.

(34)

8. 4ft–Miner produces all associationsφ≈ψsatisfying the syntactic restrictions and true in the data.The generation is not done blindly but uses various techniques serving to avoid exhaustive search. The found associations together with various parameters are not mechanically printed but saved in a solution file for further processing.

9. 4ft–Result module for interpretation of results enables the user to browse the associations’ format, sort them according to various criteria, select reasonably defined subsets and output concise information of various kinds.

There are however other procedures implemented in the LISp–Miner system that do mine for other kinds of patterns, even for more complex one than associational rules.

(35)

10. The GUHA method has deep logical and statistical

foundations. GUHA is being further developed at the institute of Computer Science of the Academy of Sciences of the Czech Republic (Petr Hájek and his group), at the Prague University of Economics (Jan Rauch and his group) and at Tampere

University of Technology (Esko Turunen and his group).

(36)

Example

Assume we are observing children who have an allergic reaction to, say, tomato, apple, orange, cheese or milk. These observations are presented in the following table:

Child Tomato Apple Orange Cheese Milk

Anna 1 1 0 1 1

Aina 1 1 1 0 0

Naima 1 1 1 1 1

Rauha 0 1 1 0 1

Kai 0 1 0 1 1

Kille 1 1 0 0 1

Lempi 0 1 1 1 1

Ville 1 0 0 0 0

Ulle 1 1 0 1 1

Dulle 1 0 1 0 0

Dof 1 0 1 0 1

Kinge 0 1 1 0 1

Laade 0 1 0 1 1

Koff 1 1 0 0 1

Olvi 0 1 1 1 1

(37)

Thus, we have observations asChildx is allergic to milkand Childy is allergic to cheese, We write shorter Milk(x) and Cheese(y).

•Milk(-), Cheese(-), Tomato(-), Orange(-) and Apple(-) are unary predicatesof our observational language andx,y,z,· · · arevariables.

•Expressions like Milk(x) or Cheese(y) areatomic (open) formulae. Combine formulae by logical connectives¬(not),∧ (and)∨(or). E.g. Milk(x)∧¬Cheese(x) would mean

Childx is allergic to milk and is not allergic to cheese.

•However, in stead of open formulae, we are more interested in universal closed formulae, e.g. All children are allergic to milk,Mostchildren are not allergic to orange, There is a child allergic to tomato,In most casesif a child is allergic to milk then she/he is allergic to cheese, too.

(38)

Everyone who passed a basic course in logic would know that statements likeAll children are allergic to milkandThere is a child allergic to tomatoare expressible by∀xMilk(x)and

∃xTomato(x).

Unfortunately classical mathematical quantifiers∀and∃are rather useless in the real world. Much more valuable are generalizedquantifiers likeIn most cases, e.g. in the statement In most casesif a child is allergic to milk then she/he is allergic to cheese, too.

GUHA method is a logic formalism of generalized quantifiers for data mining purposes.

(39)

TAM002

Data mining with GUHA – Part 3 GUHA is a logic approach to data mining

Esko Turunen¹

(40)

Theoretical basis of GUHA In this chapter we learn e.g.

•what is an observational predicate language

•what it means that some statement istruein a data

•how to transfer a data matrix to aBooleandata matrix

•that Boolean data matrices aremodelswhere statement is trueorfalse

•that GUHA theory rests on a firm foundation that allows to introduce new quantifiers and study their properties. This chapter also contains some exercises.

(41)

♣GUHA is based on a first order logic formalism. We start be setting

Definition (A simplified GUHA language)

Aobservational predicate languageL_nconsists of

unary predicatesP₁,· · ·,Pn, and an infinite sequence x₁,· · · ,x_m, . . . ofvariables,

logical connectives¬(negation) and∧(conjunction), non–standardor generalized binary quantifiers Q₁,· · ·,Q_k, however, usually denoted by∼,≈,≡, etc with some

subscripts and superscripts.

Given an observational predicate languageL_ntheatomic formulaeare the symbols P(x), whereP is a predicate andx is a variable. Atomic formulae areformulaeand ifφ,ψare

formulae, then¬φ,φ∧ψandφ(x)≈ψ(x)are formulae.

(42)

∀,∃, theclassical quantifiers, are included in the original definition of the GUHA language as well astruth constants⊥,

>. Like in classical logic, the logical connectives∨(disjunction),

⇒(implication) and⇔(equivalence) are definable by¬and∧.

However, not all this staff is implemented into LISpMiner, so we mostly omit it.

Freeandbound variablesare defined as in classical predicate logic:

•inP(x)and∀y[P(y)∧R(x)]x is a free buty is a bound variable,

•inφ(x)≈ψ(x)x is a bound variable.

Formulae containing free variables areopen formulae(not much of interest!),closed formulaedo not contain free variables. Closed formulae are also calledsentences.

Formulae containing only the variablex and of a form φ(x)≈ψ(x)are in the scope of GUHA method.

(43)

Exercises

1. Write by GUHA language the natural language expression (a)Mostx that are red are round, too. (b)Almost allx that are red are round and vice versa.

2. (a)IsQx(φ(x)⇔ψ(x)), whereQis a generalized quantifier, and(b)∀x(φ(x)⇔ψ(x))awell–formed observational formula?

3. We writeφ(x)≈ψ(x)for ageneralized quantifier≈and φ(x)∧ψ(x)for thelogical connective∧. Discuss the difference between≈and∧.

(44)

♣

(45)

(46)

(47)

(48)

(49)

(50)

(51)

Models in GUHA logic are data matrices

Given an observational languageL_n, consider allm×n– matrices composed of0.sand1.s; it is the setMofmodelsM ofL_n. Thus, in a model M, rows correspond to variables and columns correspond to predicates. Fix such a modelM. For example, anM4×5–matrix:

Object P₁ P₂ P₃ P₄ P5

A 1 1 0 0 0

B 0 0 0 1 0

C 1 1 0 1 1

D 1 1 0 0 0

Associate to each cell(a_ij),i =1,· · ·,m(row) andj =1,· · · ,n (column) avalueTrue/Falsesuch that

v(P_j(x_i)) =

True if(a_ij) =1

False if(a_ij) =0

For example,v(P (B)) =False, whilev(P (B)) =True.

(52)

We have now defined truth valuesv(P(x))∈ {TRUE,FALSE}of all atomic formulaeP(x). Next we extend truth values to all formulae, that is, the setVALof allvaluation functions v :M ×Ln7→ {TRUE, FALSE}.

Letv(φ)andv(ψ)be defined. Like in classical logic we set:

v(φ) v(ψ) v(¬φ) v(φ∧ψ) TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE

Exercises

4. Write the truth tables of the connectives∨,⇒and⇔.

5.v(∃xφ(x)) =TRUEin a modelM iffv(φ(x)) =TRUEfor some x (corresponding to a row ofM!). How isv(∀xφ(x))defined?

(53)

Truth value of formulae with non–classical quantifiers Given a modelM, the valuev(φ)of any formulaeφof the languageLn can be calculated immediately. In some cases we consider several modelsM,N. Thus, to avoid confusion, we writev_M,v_N.

To define the truth value of formulae with non–classical quantifiers, recall the four–fold table:

ψ ¬ψ

φ a b

¬φ c d

wherea+b+c+d =m.

♣Quantifier of simple associationφ(x)∼ψ(x)[read:

coincidence ofφ(x)andψ(x)predominates over difference].

Given a modelM, we define

v(φ(x)∼ψ(x)) = iffad >bc.

(54)

This quantifier is known in most data mining frameworks, not only in GUHA. In LISpMiner it is implemented under a name Simple deviation: the truth definition is more general

ad >e^δbc, where the parameterδ≥0 can be chosen by the user.

♣Quantifiers of founded p–implicationsorbasic implications

⇒_p,_Base, whereBase∈N, 0<p ≤1,prational: φ(x)⇒_p,nψ(x) [read:φ(x)impliesψ(x)with confidencepand supportBase].

Given a modelM, we definev(φ(x)⇒_p,_Baseψ(x)) =TRUEiff

a

a+b ≥p anda≥Base. Example

The Allergy matrix induces a four-fold frequency table

M Apple ¬Apple

Tomato 6 3

¬Tomato 6 0

v(Tomato∼Apple) =FALSE,v(Tomato⇒_0.6,5Apple) =TRUE.

(55)

Some theoretical observations about GUHA Definition

Given an observational language Ln, the setMof all models M, the set VAL of all valuations v and the set

V ={TRUE,FALSE}of truth values, a systemhL_n,M,VAL,Vi is anobservational semantic system.

We say that a sentenceφ∈Lnis atautology(note|=φ) if v(φ) =TRUEfor all valuationsv ∈VAL(thus, in all modelsM).

Moreover,φ∈L_nis alogical consequenceof a finite setAof sentences ofLnif, wheneverv(ψ) =TRUEfor allψ∈A, then v(φ) =TRUE, too.

In theoretical considerations we are interested in tautologies (true in all models) while in practical GUHA–research we consider only one modelM induced by a given data matrix.

(56)

Definition

An observational semantic systemhL_n,M,VAL,Viis axiomatizableif the set of all tautologies is recursively denumerable.

We can prove

Theorem

For any natural number n, the semantic systemhL_n,M,VAL,Vi is axiomatizable.

Remark

Axiomatizable means that there is a finite set of schemas of tautologies calledaxiomsand a finite set ofrules of inference such that all tautologies (and only them) can bereducedfrom axioms by means of rules of inference, i.e. that all tautologiesφ have a proof (noted by`φ). Thus, axiomatizability means: |=φ iff`φ.

(57)

For example, ifφandφ⇒ψare axioms (or, if they have a proof), then one can inferψby means of a rule of inference calledModus Ponens. Thus,ψhas a proof, too.

Here we are not interested in the general axiom system of GUHA. However, we will need rules of inference to understand how LISpMiner decreases the amount of outcomes of practical GUHA procedures.

We give the definition of a (sound)rule of inference. It is of form φ₁,· · ·, φ_n

ψ

such that, wheneverv(φ_i) =TRUEfor alli =1,· · · ,n, then v(ψ) =TRUE, too. φ₁,· · · , φnarepremises,ψis theconclusion.

Rules of inference are calleddeduction rules, too.

(58)

Exercises

6. Prove or disprove

(a)|= [φ∨(ψ∧χ]⇔[(φ∧ψ)∨(φ∧χ)], (b)|= (φ∧ψ)⇔[¬(¬φ∨ ¬ψ)]and (c)|=φ⇒(ψ⇒φ),

whereφ, ψandχare well–formed formulae.

7. Is

(a)φ∧φa logical consequences of a set{ψ, φ}, (b)φa logical consequences of a set{ψ∨φ, ψ∧φ}, (c)φa logical consequences of a set{¬ψ∧φ,¬φ∨ψ}of well–formed formulae?

8. Prove that Modus Ponens is a sound rule of inference.

(59)

Exercises

In exercises 9 – 11, consider the Allergy matrix.

9. Write down all sentencesP_i(x)⇒_0.7,7P_j(x)such that v(P_i(x)⇒_0.7,7P_j(x)) =TRUE, (i6=j).

10. Doesv(Apple(x)∼Orange(x)) =TRUEhold true, where∼ is simple association?

11. Isv(Tomato(x)∼ ¬Cheese(x)) =TRUEtrue (∼is simple association)?

(60)

TAM002

Data mining with GUHA – Part 4 More about foundations of GUHA

Esko Turunen¹

(61)

Theoretical foundations of GUHA – continuation

We study desirable properties the non–standard quantifiers in GUHA should have, e.g.

•if a data contains evidence for an interdependence ofφand ψ, i.e. thatφ≈ψistruein this data, thenφ≈ψshouldremain true in all other such data where there is even more evidence.

Such quantifiers are called associational or implicational.

•that truth value should really depend on values in the four–fold table; by altering these valuesφ≈ψis either true or false.

•We also introducesound rules of inferenceto minimize computations in practical data mining tasks. – This chapter contains exercises.

(62)

Definition (Associational quantifiers)

Letφ(x),ψ(x)be two fixed formulae of a language L_nsuch that x is the only free variable in both of them and they don’t have common predicates. Let M and N be two models. Then we have the following two four–fold tables:

M ψ ¬ψ

φ a₁ b₁

¬φ c₁ d₁

N ψ ¬ψ

φ a₂ b₂

¬φ c₂ d₂

We define: N isassociationally betterthan M if a₂≥a₁, d₂≥d₁b₂≤b₁and c₂≤c₁. Moreover, a binary quantifier≈is associationalif, for all formulaeφ(x),ψ(x), all models M,N: if v_M(φ(x)≈ψ(x)) =TRUE, N associational better than M, then v_N(φ(x)≈ψ(x)) =TRUE.

(63)

Obviously, the quantifier ofsimple associationis associational:

this follows by the fact that, under the given circumstances, a₂d₂≥a₁d₁>b₁c₁≥b₂c₂.

Also quantifiers ofbasic implicationare associational: if a₂≥a₁≥Base,b₂≤b₁, thena₂b₁≥a₁b₂, therefore

a₂a₁+a₂b₁≥a₂a₁+a₁b₂ and finally,

a2

a2+b2 ≥ _a^a¹

1+b1 ≥p.

In stead of being associational, a non–classical quantifier may satisfy a stronger condition. We define

(64)

Definition (Implicational quantifiers)

Letφ(x),ψ(x)be two fixed formulae of a language L_nsuch that x is the only free variable in both of them and they don’t have common predicates. Let M and N be two models. Then we have the following two four–fold tables:

M ψ ¬ψ

φ a₁ b₁

¬φ c₁ d₁

N ψ ¬ψ

φ a₂ b₂

¬φ c₂ d₂

We define: N isimplicational betterthan M if a₂≥a₁and b₂≤b₁. Moreover, a binary quantifier≈isimplicationalif, for all formulaeφ(x),ψ(x), all models M,N: if

v_M(φ(x)≈ψ(x)) =TRUE, N implicational better than M, then v_N(φ(x)≈ψ(x)) =TRUE.

(65)

Basic implication quantifiers are implicational, however, simple association quantifiers arenot implicational. To see this, consider the following two frequency tables

M ψ ¬ψ

φ a₁=1 b₁=1

¬φ c₁=1 d₁=2

N ψ ¬ψ

φ a₂=1 b₂=1

¬φ c₂=2 d₂=1

ClearlyN is implicational better thanM, anda₁d₁>c₁b₁thus, v_M(φ(x)≈ψ(x)) =TRUE. However,a₂d₂<c₂b₂, thus

v_N(φ(x)≈ψ(x)) =FALSE.

Remark

Let≈be an implicational quantifier. Then≈is associational.

Proof.

Let≈be implicational andv_M(φ(x)≈ψ(x)) =TRUE. IfN is associational better thanM, thenN is clearly also implicational better thanM, sov_N(φ(x)≈ψ(x)) =TRUE. Therefore≈is associational, too.

(66)

Theorem

Letφ(x),ψ(x),χ(x)be formulae, and let≈be animplicational quantifier. Then φ≈ψ

φ≈[ψ∨χ] is a sound rule of inference.

Proof.

Letv_M(φ(x)≈ψ(x)) =TRUEand

M ψ ¬ψ

φ a₁ b₁

¬φ c₁ d₁

M ψ∨χ ¬(ψ∨χ)

φ a₂ b₂

¬φ c₂ d₂

We reason thata₁=]{x|v_M(ψ(x)) =v_M(φ(x)) =TRUE}

≤]{x|v_M(ψ(x)∨χ(x)) =v_M(φ(x)) =TRUE}=a₂and b₁=]{x|v_M(¬ψ(x)) =v_M(φ(x)) =TRUE}

≥]{x|v_M(¬ψ(x)∧ ¬χ(x)) =v_M(φ(x)) =TRUE}

=]{x|v_M((¬(ψ(x)∨χ(x)) =v_M(φ(x)) =TRUE}=b₂. Obviously, since≈is implicational,v_M(φ≈[ψ∨χ]) =TRUE.

(67)

Theorem (Reduction Theorem 1.)

Letφ(x),ψ(x),χ(x)be formulae, and let≈be animplicational quantifier. Then [φ∧ ¬χ]≈ψ

φ≈[χ∨ψ] is a sound rule of inference.

Theorem (Reduction Theorem 2.)

Letφ(x),ψ(x),χ(x)be formulae, and let≈besimple associationquantifier. Then φ≈ψ

ψ≈φ

φ≈ψ

¬φ≈ ¬ψ are sound rules of inference.

We have introduced sound rules of inference mainly to minimize computations in practical data mining tasks. For example, if≈is an implicational quantifier andφ≈ψis true in a given modelM, so isφ≈[ψ∨χ]true, too. Due to the solid logical foundations of GUHA, there are several means to reduce the amount of computa tions. However, we do not study them here.♣

(68)

We introduce

•Basic equivalence quantifiers≡_p, where 0<p≤1. In any modelM,v((φ(x)≡_pψ(x)) =TRUEiff

(a+d)≥p(a+b+c+d)except for a casea+d =0, b+c 6=0; thenv((φ(x)≡_pψ(x)) =FALSE.

•BasicorΣ−double implication quantifiers↔_p, where 0<p ≤1. In any modelM,v((φ(x)↔_pψ(x)) =TRUEiff a≥p(a+b+c)except for a casea+b+c =0,d 6=0; then v((φ(x)↔_p ψ(x)) =FALSE.

Exercises

12. Prove Reduction Theorem 1.

13. Prove Reduction Theorem 2.

14. Prove that Reduction Theorem 2does nothold for basic implication quantifiers.

15. Prove that Basic equivalence quantifiers are associational.

16. Prove thatΣ−double implication quantifiers are associational.

(69)

It is unquestionable that any valuable quantifier should be (at least) implicational (associational). However, there are other obvious conditions, too. We say that animplicationalquantifier

≈isa–dependentif there are modelsM andNsuch that

M ψ ¬ψ

φ a b

¬φ c d and

N ψ ¬ψ

φ a^∗ b

¬φ c d

wherev_M((φ(x)≈ψ(x))6=v_N((φ(x)≈ψ(x)). Similarly we define ab–dependentimplicational quantifier≈. If an a–dependent and b–dependent quantifier≈satisfies a condition

a=b=0 impliesv((φ(x)≈ψ(x)) =FALSE, then≈is calledinteresting.

Exercise 17

Show that basic implication quantifiers are interesting. What

(70)

We can also define(a+d)–dependent,(b+c)–dependent, etc implicational quantifiers. For exampleΣ−double implication quantifiers↔_parea–dependent. Indeed, let 0<p≤1. Then

a^∗

a^∗+b+c < _a+b+c^a (=p)iffa^∗a+a^∗b+a^∗c<a^∗a+ab+ac=piffa<a^∗.

Thus, there are modelsM andN such that the corresponding frequency tables differ only with respect to the valueaand v_N((φ(x)↔_pψ(x)) =FALSEwhilev_M((φ(x)↔_pψ(x)) =TRUE.

In a similar manner we prove thatΣ−double implication quantifiers↔_pare(b+c)–dependent, too.

We continue by having a closer look at quantifiers implemented to LISpMiner and see what kinds of tasks can be performed.

We use National Indonesia Contraceptive Prevalence Survey data as a benchmark data test set. However, we often study the above mentioned theoretical properties each individual

quantifier meets.

(71)

TAM002

Data mining with GUHA – Part 5 Introduction to LISpMiner software

Esko Turunen¹

(72)

Instructions to download LISpMiner

•In this chapter we show how to start working with LISpMiner.

We also introduce the first GUHA task where we usefounded implicationquantifier.

•It is highly recommended that, to download LISpMiner, you print these slides and follow the instructions step by step from them. The situation is even better if you have two computers and two terminals available.

•LISpMiner is not a commercial product and you might find some parts somewhat cumbersome. There are often updates so the layout might not look today precisely as it looked when these slides were completed – patience!

•LISpMiner is built on Microsoft products, so you need Micro Soft Access database 2002 or later (.mdb–file).

•You might also like to look athttp://lispminer.vse.cz/, the homepage of LISpMiner.

(73)

LISpMiner

•is an academic software based on GUHA method

•is developed and maintained at Prague University of Economics by Jan Rauch and Milan Šimunek.

•can be downloaded free fromhttp://lispminer.vse.cz/.

We show step by step how you can download LISpMiner on your own computer. Here we have LISpMiner version 12.01.00 from February 24, 2010.♣

Step 1. Open a new folder (e.g. to position C:\) and name it e.g. LispMiner2010.

Step 2. Go to the web pagehttp://lispminer.vse.cz/, click Downloadand download the following file to your folder LispMiner2010. (Later we will download more files from this course – there are also some more instructions there.)

(74)

♣

(75)

Your C:\LispMiner2010 should look like this now.

Add there a new folder and name it e.g. My Data – you will put your data files there!

(76)

Then go to the addresshttp://b-course.cs.helsinki.fi/obc/readymade.html -> (Under Contraseptive methods ->) The data set itself (a .txt-file) to download Indonesia data. Put it to My Data

(77)

Name it Indonesia1.txt

Delete empty space from the names!

(78)

It should look like this now.

Next we create a Micro Soft Access data base through the following steps:

(79)

Take a copy of LMEmpty, rename its LM_Indonesia1.mdb and save it to My Data

(80)

Your My Data folder should now look like this. Next we create there still one data matrix. It will be called Indonesia1.mdb. Do the following steps:

(81)

(82)

(83)

(84)

(85)

(86)

(87)

(88)

(89)

(90)

(91)

(92)

(93)

(94)

Your data file Indonesia1.mdb should now look like this. Close it and go to LispMiner2010

(95)

Open LMAdmin.exe

(96)

(97)

Put here Indonesia1.mdb ...

and here LM_Indonesia1.mdb

(98)

You should now have this on your screen!

Close it and go again to LispMiner2010

(99)

Open now LMDataSource.exe

(100)

(101)

Open from Database -> Data Matrices

(102)

(103)

♣

(104)

Open from Database -> Attributes Three

(105)

(106)

(107)

(108)

♣

(109)

(110)

♣

Here is the distribution of ages

(111)

(112)

In this way we go through the whole list of Attributes.

Depending on how many categories we divide each attribute into, we get the total amount of Predicates (=columns composed of 0/1/-cells in the final data matrix) of our logic language. We should always have them as few as possible.

(113)

We are now ready to start the simplest data mining tasks. Open 4ftTask.exe

(114)

(115)

(116)

We are interested in knowing which Attributes imply Contraception Method, so we name the task like this.

(117)

Open the Succedent–part: click it!

(118)

(119)

(120)

♣

Since there are only 3 Predicates (No use, Short Term and Long Term) it is natural to take ‘subset’

with min- and max-length = 1

(121)

(122)

Next open the Antecedent-part: click here. There are much more choices

(123)

♣

(124)

(125)

Select all but Contraceptive

(126)

Age is typically an interval, say 16-18, 17-19,..., 46-48, 16-19, 17-20,…, 45-48 or 16-20, 17-21,…, 44-48.

(127)

(128)

All the others contain only 2 or 4 Categories (= Predicates) so let them be just 1-1 subsets.

With this division there are 34+15+4+4+4+2+4+2+2=71 columns in the data to be mined

(129)

Choosing the quantifier:

2*Click BASE, choose = 5%

2*Click FUI, choose p=0.900 Push ’Generate’ and see what happens…

(130)

(131)

(132)

Let us see the shortest hypotheses in detail – click here!

(133)

There are 95 + 2 = 97 women who have no children …

… 95 of them use no contraception