Data mining with GUHA – Part 2 GUHA produces hypothesis

Esko Turunen¹

1Tampere University of Technology

GUHA does not test hypothesis, instead GUHA produces them In this chapter we learn the outlines of GUHA e.g. that

•GUHA is suitable for exploratory analysis of large data

•GUHA goes through 2×2 contingency tables and picks up the interesting ones

•in logic terms, GUHA is based on first order monoidal logic whose models are finite

•a fundamental reference, the ‘GUHA book’, is free to be downloaded from www.cs.cas.cz/hajek/guhabook/

♣1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method of data mining.

•GUHA is one of the oldest methods of data mining – GUHA was introduced by Hájek, Havel and Chytil inThe GUHA method of automatic hypotheses determinationpublished in Computing 1 (1966) 293–308.

•GUHA still develops: there are dozens of new features added to GUHA during the past ten years.

•GUHA is a kind of automated exploratory data analysis.

Instead of testing given hypothesissupported by a data,GUHA procedure generates systematically hypothesessupported by the data. By GUHA it is possible to find all interesting

dependences hidden in a data.

♣2. GUHA is suitable for exploratory analysis of large data.

•The processed data form a rectangle matrix, where rows corresponds to objects belonging to the sample and each column corresponds to one investigated variable. A typical data matrix processed by GUHA has hundreds or thousands of rows and tens of columns.

•Cells of the analyzed data matrix can contain whatever symbols, however, before a concrete data mining task the data must be categorized to contain only0s or1s (or are empty).

•Exploratory analysismeans that there is no single specific hypothesis that should be tested by our data; rather, our aim is to get orientation in the domain of investigation, analyze the behavior of chosen variables, interactions among them etc.

Such inquiry is not blind but directed by some general (possibly vague) direction of research (some general problem).

♣3. GUHA systematically creates all hypotheses interesting from the point of view of a given general problem and on the base of given data.

•This is the main principle: all interesting hypotheses. Clearly, this contains a dilemma:allmeans most possible,only

interestingmeansnot too many. To cope with this dilemma, one may use different GUHA procedures, implemented in LISp–Miner system, and having selected one, by fixing in various ways its numerous parameters.

•The LISp–Miner system leads the user and makes the

selection of parameters relatively easy but somewhat laborious.

LISp–Miner cannot be as automatized as much as e.g.

B–course is automatized, however, the results of LISp–Miner are much more illustrative and detailed.

Three remarks:

•GUHA procedures polyfactorial hypotheses i.e. not only hypotheses relating one variable with another one, but expressing relations among single variables, pairs, triples, quadruples of variables etc.

•GUHA offers hypotheses. Exploratory character implies that the hypotheses produced by the computer (numerous in number: typically tens or hundreds of hypotheses) are just supported by the data, not verified. You are assumed to use this offer as inspiration, and possibly select some few hypotheses for further testing.

•GUHA is not suitable for testing a single hypothesis: routine packages are good for this.

♣4. 4ft–miner procedure generateshypotheses(or

observational statements) on association between complex Booleanformulae(attributes). These formulae are constructed fromunary predicates(corresponding to the columns of the processed0/1–data matrix) by logical connectives∧,∨,¬ (conjunction, disjunction, negation).

Examples of predicates are

TEMPERATURE :≥38^◦C,PRESURE: high,DIAGNOSIS: infection

Examples of formulae are

[TEMPERATURE :≥38^◦C]∧[PRESURE: high]and[DIAGNOSIS: infection]

An example of a hypothesis is

[TEMPERATURE :≥38^◦C]∧[PRESURE: high]≈[DIAGNOSIS: infection]

More generally, hypotheses are of formφ≈ψ.

Notice that our terminology differs from the original GUHA approach: we want to keep things as simple as possible!

Given the 0/1–data matrix, each pair of Boolean attributesφ,ψ determines itsfour-fold frequency table; the association of with is defined by choosing anassociational or generalized

quantifieri.e. a function assigning to each four-fold table either 1 (associated ortruein the data) or 0 (not associated or falsein the data) and satisfying some natural monotonicity conditions.

♣The four–fold table has the form:

ψ ¬ψ

φ a b a+b=r

¬φ c d c+d =s

a+c =k b+d =l m wherea+b+c+d =mand

•ais the number of objects satisfying bothφandψ,

•bis the number of objects satisfyingφbut notψ,

•c is the number of objects not satisfyingφbut satisfyingψ,

♣6. There are various types of generalized quantifiers formalizing various kinds of associations:

•Implicational quantifiersformalize the associationmanyφare ψ, they do not depend on the valuesc,d.

•Comparative quantifiersformalize the associationφmakesψ more likely than¬ψ.

•Some quantifiers just express observations on the data, some others serve as tests of statistical hypotheses on unknown probabilities.

•Some quantifiers onessymmetric:φ≈ψimpliesψ≈φ, some admit negation:φ≈ψimplies¬φ≈ ¬ψ.

4ft-Miner procedure contains dozen of generalized quantifiers, a novel procedure Ac4ft–Miner offers alsoaction quantifiers. An advantage of GUHA is that new quantifiers can be defined and their properties can be analyzed in a well establish logic framework.

♣7. After preparing the original data matrix to a 0/1-format, the user can start the 4ft–Miner procedure. The input of such a procedure consists of

(1) the 0/1–data matrix

(2) parameters determining symbolic restriction to the pairsψ, φ of Boolean attributes to be generated;ψis calledantecedent andφis calledsuccedent

(3) the quantifier to be used and some parameters of the quantifier

(4) some other determinations, for example the user has to declare predicates that can occur in the antecedent and the succedent, the use of logic connectives, minimal and maximal length of antecedent and succedent, way to process of missing data etc.

8. 4ft–Miner produces all associationsφ≈ψsatisfying the syntactic restrictions and true in the data.The generation is not done blindly but uses various techniques serving to avoid exhaustive search. The found associations together with various parameters are not mechanically printed but saved in a solution file for further processing.

9. 4ft–Result module for interpretation of results enables the user to browse the associations’ format, sort them according to various criteria, select reasonably defined subsets and output concise information of various kinds.

There are however other procedures implemented in the LISp–Miner system that do mine for other kinds of patterns, even for more complex one than associational rules.

10. The GUHA method has deep logical and statistical

foundations. GUHA is being further developed at the institute of Computer Science of the Academy of Sciences of the Czech Republic (Petr Hájek and his group), at the Prague University of Economics (Jan Rauch and his group) and at Tampere

University of Technology (Esko Turunen and his group).

Example

Assume we are observing children who have an allergic reaction to, say, tomato, apple, orange, cheese or milk. These observations are presented in the following table:

Child Tomato Apple Orange Cheese Milk

Anna 1 1 0 1 1

Thus, we have observations asChildx is allergic to milkand Childy is allergic to cheese, We write shorter Milk(x) and Cheese(y).

•Milk(-), Cheese(-), Tomato(-), Orange(-) and Apple(-) are unary predicatesof our observational language andx,y,z,· · · arevariables.

•Expressions like Milk(x) or Cheese(y) areatomic (open) formulae. Combine formulae by logical connectives¬(not),∧ (and)∨(or). E.g. Milk(x)∧¬Cheese(x) would mean

Childx is allergic to milk and is not allergic to cheese.

•However, in stead of open formulae, we are more interested in universal closed formulae, e.g. All children are allergic to milk,Mostchildren are not allergic to orange, There is a child allergic to tomato,In most casesif a child is allergic to milk then she/he is allergic to cheese, too.

Everyone who passed a basic course in logic would know that statements likeAll children are allergic to milkandThere is a child allergic to tomatoare expressible by∀xMilk(x)and

∃xTomato(x).

Unfortunately classical mathematical quantifiers∀and∃are rather useless in the real world. Much more valuable are generalizedquantifiers likeIn most cases, e.g. in the statement In most casesif a child is allergic to milk then she/he is allergic to cheese, too.

GUHA method is a logic formalism of generalized quantifiers for data mining purposes.

TAM002

Data mining with GUHA – Part 3

In document The GUHA Method in Data Mining (sivua 24-39)