Data mining with GUHA – Part 3 GUHA is a logic approach to data mining

Esko Turunen¹

1Tampere University of Technology

Theoretical basis of GUHA In this chapter we learn e.g.

•what is an observational predicate language

•what it means that some statement istruein a data

•how to transfer a data matrix to aBooleandata matrix

•that Boolean data matrices aremodelswhere statement is trueorfalse

•that GUHA theory rests on a firm foundation that allows to introduce new quantifiers and study their properties. This chapter also contains some exercises.

♣GUHA is based on a first order logic formalism. We start be setting

Definition (A simplified GUHA language)

Aobservational predicate languageL_nconsists of

unary predicatesP₁,· · ·,Pn, and an infinite sequence x₁,· · · ,x_m, . . . ofvariables,

logical connectives¬(negation) and∧(conjunction), non–standardor generalized binary quantifiers Q₁,· · ·,Q_k, however, usually denoted by∼,≈,≡, etc with some

subscripts and superscripts.

Given an observational predicate languageL_ntheatomic formulaeare the symbols P(x), whereP is a predicate andx is a variable. Atomic formulae areformulaeand ifφ,ψare

formulae, then¬φ,φ∧ψandφ(x)≈ψ(x)are formulae.

∀,∃, theclassical quantifiers, are included in the original definition of the GUHA language as well astruth constants⊥,

>. Like in classical logic, the logical connectives∨(disjunction),

⇒(implication) and⇔(equivalence) are definable by¬and∧.

However, not all this staff is implemented into LISpMiner, so we mostly omit it.

Freeandbound variablesare defined as in classical predicate logic:

•inP(x)and∀y[P(y)∧R(x)]x is a free buty is a bound variable,

•inφ(x)≈ψ(x)x is a bound variable.

Formulae containing free variables areopen formulae(not much of interest!),closed formulaedo not contain free variables. Closed formulae are also calledsentences.

Formulae containing only the variablex and of a form φ(x)≈ψ(x)are in the scope of GUHA method.

Exercises

1. Write by GUHA language the natural language expression (a)Mostx that are red are round, too. (b)Almost allx that are red are round and vice versa.

2. (a)IsQx(φ(x)⇔ψ(x)), whereQis a generalized quantifier, and(b)∀x(φ(x)⇔ψ(x))awell–formed observational formula?

3. We writeφ(x)≈ψ(x)for ageneralized quantifier≈and φ(x)∧ψ(x)for thelogical connective∧. Discuss the difference between≈and∧.

♣

Models in GUHA logic are data matrices

Given an observational languageL_n, consider allm×n– matrices composed of0.sand1.s; it is the setMofmodelsM ofL_n. Thus, in a model M, rows correspond to variables and columns correspond to predicates. Fix such a modelM. For example, anM4×5–matrix:

Object P₁ P₂ P₃ P₄ P5 (column) avalueTrue/Falsesuch that

v(P_j(x_i)) =

True if(a_ij) =1

False if(a_ij) =0

For example,v(P (B)) =False, whilev(P (B)) =True.

We have now defined truth valuesv(P(x))∈ {TRUE,FALSE}of all atomic formulaeP(x). Next we extend truth values to all formulae, that is, the setVALof allvaluation functions v :M ×Ln7→ {TRUE, FALSE}.

Letv(φ)andv(ψ)be defined. Like in classical logic we set:

v(φ) v(ψ) v(¬φ) v(φ∧ψ)

4. Write the truth tables of the connectives∨,⇒and⇔.

5.v(∃xφ(x)) =TRUEin a modelM iffv(φ(x)) =TRUEfor some x (corresponding to a row ofM!). How isv(∀xφ(x))defined?

Truth value of formulae with non–classical quantifiers Given a modelM, the valuev(φ)of any formulaeφof the languageLn can be calculated immediately. In some cases we consider several modelsM,N. Thus, to avoid confusion, we writev_M,v_N.

To define the truth value of formulae with non–classical quantifiers, recall the four–fold table:

ψ ¬ψ

φ a b

¬φ c d

wherea+b+c+d =m.

♣Quantifier of simple associationφ(x)∼ψ(x)[read:

coincidence ofφ(x)andψ(x)predominates over difference].

Given a modelM, we define

v(φ(x)∼ψ(x)) = iffad >bc.

This quantifier is known in most data mining frameworks, not only in GUHA. In LISpMiner it is implemented under a name Simple deviation: the truth definition is more general

ad >e^δbc, where the parameterδ≥0 can be chosen by the user.

♣Quantifiers of founded p–implicationsorbasic implications

⇒_p,_Base, whereBase∈N, 0<p ≤1,prational: φ(x)⇒_p,nψ(x) [read:φ(x)impliesψ(x)with confidencepand supportBase].

Given a modelM, we definev(φ(x)⇒_p,_Baseψ(x)) =TRUEiff

a+b ≥p anda≥Base. Example

The Allergy matrix induces a four-fold frequency table

M Apple ¬Apple

Tomato 6 3

¬Tomato 6 0

v(Tomato∼Apple) =FALSE,v(Tomato⇒_0.6,5Apple) =TRUE.

Some theoretical observations about GUHA Definition

Given an observational language Ln, the setMof all models M, the set VAL of all valuations v and the set

V ={TRUE,FALSE}of truth values, a systemhL_n,M,VAL,Vi is anobservational semantic system.

We say that a sentenceφ∈Lnis atautology(note|=φ) if v(φ) =TRUEfor all valuationsv ∈VAL(thus, in all modelsM).

Moreover,φ∈L_nis alogical consequenceof a finite setAof sentences ofLnif, wheneverv(ψ) =TRUEfor allψ∈A, then v(φ) =TRUE, too.

In theoretical considerations we are interested in tautologies (true in all models) while in practical GUHA–research we consider only one modelM induced by a given data matrix.

Definition

An observational semantic systemhL_n,M,VAL,Viis axiomatizableif the set of all tautologies is recursively denumerable.

We can prove

Theorem

For any natural number n, the semantic systemhL_n,M,VAL,Vi is axiomatizable.

Remark

Axiomatizable means that there is a finite set of schemas of tautologies calledaxiomsand a finite set ofrules of inference such that all tautologies (and only them) can bereducedfrom axioms by means of rules of inference, i.e. that all tautologiesφ have a proof (noted by`φ). Thus, axiomatizability means: |=φ iff`φ.

For example, ifφandφ⇒ψare axioms (or, if they have a proof), then one can inferψby means of a rule of inference calledModus Ponens. Thus,ψhas a proof, too.

Here we are not interested in the general axiom system of GUHA. However, we will need rules of inference to understand how LISpMiner decreases the amount of outcomes of practical GUHA procedures.

We give the definition of a (sound)rule of inference. It is of form φ₁,· · ·, φ_n

such that, wheneverv(φ_i) =TRUEfor alli =1,· · · ,n, then v(ψ) =TRUE, too. φ₁,· · · , φnarepremises,ψis theconclusion.

Rules of inference are calleddeduction rules, too.

Exercises

6. Prove or disprove

(a)|= [φ∨(ψ∧χ]⇔[(φ∧ψ)∨(φ∧χ)], (b)|= (φ∧ψ)⇔[¬(¬φ∨ ¬ψ)]and (c)|=φ⇒(ψ⇒φ),

whereφ, ψandχare well–formed formulae.

7. Is

(a)φ∧φa logical consequences of a set{ψ, φ}, (b)φa logical consequences of a set{ψ∨φ, ψ∧φ}, (c)φa logical consequences of a set{¬ψ∧φ,¬φ∨ψ}of well–formed formulae?

8. Prove that Modus Ponens is a sound rule of inference.

Exercises

In exercises 9 – 11, consider the Allergy matrix.

9. Write down all sentencesP_i(x)⇒_0.7,7P_j(x)such that v(P_i(x)⇒_0.7,7P_j(x)) =TRUE, (i6=j).

10. Doesv(Apple(x)∼Orange(x)) =TRUEhold true, where∼ is simple association?

11. Isv(Tomato(x)∼ ¬Cheese(x)) =TRUEtrue (∼is simple association)?

TAM002

Data mining with GUHA – Part 4

In document The GUHA Method in Data Mining (sivua 39-60)