9A?A J IJK@O -NFAHJ 5OIJAI

(1)

Welcome to study Expert Systems!

In this course, you will learn

• general knowledge of expert systems and their possibilities

• how to implement a system for your own problem

• how to compare dierent systems and select the best one

• several metaskills like new learning methods, groupworking, problem solving, information retrieval, ...

Course details

5 ects master-level course

Classes: Wed 14-16 D106/B 14.9.-30.11. 2005 Teacher: Wilhelmiina Hämäläinen

(Meeting times: Wed 13.00-14.00, whamalai@cs.joensuu.)

Prerequicites: basic knowledge of logics and probability calculus

1

(2)

Performance

• No exam, no demo sessions, but still we will learn!

• Prepare for 12 hours work/week as in normal courses!

1. Participating classes (you can drop at most 2/12 classes) 2. Presentations about main expert systems: A student pair

gives a 30 min presentation about some expert system with some lecture material for other students. They should also design some exercise task about the topic and check the solutions. Each exercise task should take approximately 2 hours work.

3. Exercises: In the beginning of course exercises about introduction lectures and all student presentations. Weekly exercises take approximately 6 hours work in the beginning of course.

4. Learning diary: Each student writes a personal learning diary and returns a piece of it weekly. The lecturer will give feedback next week. The learning diaries are evaluated in the end of course.

5. Projectworks The projectworks are made in groups of three students (other than in the presentations). The projects are performed in several phases: specication (topic and data), design (how to implement), implementation and testing

2

(3)

Evaluation

Projectworks 40%

Presentations 20%

Learning diaries 20%

Exercises: 20%

Presentations will be evaluated by students and the teacher (according to how much you learnt)

For learning diaries, we will x the detailed criteria together!

3

(4)

Learning diary

In the beginning:

• What are your expections?

• What do you know beforehand?

• What is your motivation?

Every week:

• What you have learnt and what you have not?

• Set yourself learning goals, what you should learn or nd out!

Check also, if you have reached your previous learning goals!

• All experiences, feelings, changes in motivation are also important!

• Explain the new or dicult things with your own words! Try to relate them to your previous knowledge, and invent new applications

4

(5)

In the end:

• Combine all pieces of knowledge into one and draw the nal picture (by words or drawing a concept map)

• Evaluate your whole learning process during the course: Did you reach your goals? Are you satised or not?

In addition:

• You can write or draw anything you wish the style is totally free

• Return your piece of learning diary weekly on paper, so that your teacher can give you individual feedback and answer your questions

• Save old learning diaries for nal evaluation

5

(6)

Preliminary schedule

14.9. Introduction to Expert systems, course overview and work allocation

21.9. Basic types of ESs, rule-based, functional and probabilistic systems

28.9. ESs, student presentations 5.10. ESs, student presentations 12.10. ESs, student presentations

19.10. ESs, last student presentations. Comparing and evaluat- ing systems (lecturer). Beginning projectworks.

26.10. Work specications (topic + data), short presentations.

2.11. Work plans (method + how to implement), short presentations.

9.11. Detailed plans with class diagrams and main algorithms.

16.11. Implementations, short presentations?

23.11. About testing, designing test plans. Other evaluation (restrictions, time and space requirements etc.)

30.11. Supersession (4 h), nal presentations

6

(7)

What are expert systems?

Several denitions:

A computer system which emulates the decision-making ability of a human expert

A computer system capable of giving advice in a particular knowledge domain

A model of the expertise of the best practitioners of the eld

7

(8)

• application of Articial Intelligence (simulates intelligent be- haviour)

• roots in Decision Analysis (utility of decision)

• also related to Machine Learning (learning models from data) and Data Mining (discovering new patterns in data)

• performs at the level of an expert (to assist an expert or even instead of an expert)

• have been designed for a narrow problem domain

• can be implemented as an intelligent agent = a part of software, which monitors the system and oers help, when it considers it is needed

8

(9)

What they can do?

Several types of tasks

• Diagnosis: nding faults in the system or diseases of living or- ganisms, e.g. printer fault detection system, automatic medical advicors in the net, ...

• Monitoring: continuous interpretetion of signals, e.g. breating- support machines in hospitals, controlling systems in industry

• Planning: producing a sequence of actions for achieving a particular goal (e.g. designing an ideal curriculum for a school, interactive guides for repairing a car, navigating an explo- ration robot in Mars)

• Instruction (intelligent tutoring systems): especially for prac- tising physical skills like plane simulators, but nowadays in all levels, especially in distance learning

• Prediction: forecasting future events (e.g. advicing pricing according to market uctuations)

9

(10)

Examples

A rule-based system for automobile trouble-shooting The system consist of simple if ... then rules, with uncertainty measures. For example:

If the result of trying the starter is the car cranks normally and a gas smell is not present while trying the starter Then the gas tank is empty with 90% condence

The on-line demo can be found on:

http://www.expertise2go.com/webesie/car/

The system is described on

http://www.expertise2go.com/webesie/

10

(11)

Bayesian automobile trouble shooting

The idea is to diagnosize the most probable reasons for automobile start-up problems and evaluate the probability that the engine works, when we x a certain component. For example if we observe that the gauge is empty (whether or not there is fu- el), we can calculate the probability that the car starts after we have lled the tank.

This example is from Breese, J.S., Heckerman, D. and Rommelse, K.: Troubleshooting Under Uncertainty. Technical report MSR- TR-94-07, 1994. http://www.research.microsoft.com/research/

pubs/view.aspx?msr_tr_id=MSR-TR-94-07

11

(12)

A decision tree for medical diagnosis

An HIV-positive patient has come to the emergency department for care. What is the urgency of the visit? The nurse should classify patients as 0=emergent, 1=urgent, 2=non-urgent.

This example is from Lewis, R.J.: An Introduction to Classica- tion and Regression Tree (CART) Analysis. Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California, 2000. http://www.saem.org/download/lewis1.pdf.

12

(13)

Why ESs are useful?

• help if expertise is scarse, expensive or unavailable training new human experts is time-consuming and expensive

• save time, because they are fast and the same system can serve several users parallelly

• do not forget or make human errors

• are consistent: similar situations are always handled in the same way

• can combine knowledge of multiple human experts However, ESs

• are totally dependent on their knowledge base

• usually manage only one way of reasoning

• don't have common sense!

• cannot adapt to new situations creatively like people can

• don't recognize when no answer exists or when the problem is outside teir area of expertise!

13

(14)

History

• medical software tools begun to emerge during 1970's, e.g.

MYCIN (Stanford University 1976) for aiding physicians in diagnosing and treating patients with infectious blood diseases

• the rst systems were simple rule-based systems dened by human experts

• Revolution in the late 80's: Bayesian networks

• in 1990's several new applications and methods

• nowadays AI-tools general make-it-yourself expert systems

• Evolution: rule-based systems ⇒ Uncertain rules ⇒ Bayesian networks, neural networks, decision trees, ...

• Tendency from manually constructed ad hoc models to machine learning, systems, which can learn themselves

14

(15)

System construction

Terminology

A model M = (S, θ) = a model structure S and assigned pa- rameter values θ. I.e. it is an instance of a given model structure.

The model structure S is a structure which determines the parameters required for constructing a model. Typically the structure consists of variables and their relations (e.g. conditional dependencies).

The model parameters θ are assigned numerical values like probabilities in Bayesian network or linear coecients (above α, β, γ) in linear regression.

Modelling paradigm the general modelling principles used.

The modelling paradigm consist of basic denitions, assumptions, and techniques for constructing and using certain kind of models.

15

(16)

Hierarchy of concepts

Bayesian networks Decision

trees Linear regression

A B

C A

B C

δ1 C A

B

A B

C A

C B

Modelling paradigms

Model family

Model class

A

B C

P(B|A) P(B| A)

P(C|A) P(C| A) P(A)

Model

. .

.. ..

2 A

B C

δ δ

B C

A

3

Modelling paradigms like Bayesian networks, decision trees and linear regression describe the general modelling principles used.

A model family is a set of model structures in the given modelling paradigm. E.g. in Bayesian networks modelling paradigm a model family consists of dierent graph structures with variable nodes and conditional dependencies (edges) between variables.

A model class is a set of models with a xed structure but dierent parameters θ_i. In the case of Bayesian networks the parameters are prior probabilities associated to root nodes and conditional probabilities associated to non-root nodes.

A model is an item in a model class with a xed structure and xed parameters.

16

(17)

Construction process (from the implementor's point of view)

1. Understanding the domain: interview human experts and read literature. What are the system functionalities? Special util- ities or risks?

2. Data management: what and how to collect? Preprocessing?

3. Dene the model structure (see below). Especially causal relations are important!

4. Dene the model parametres, usually from data.

5. Verication and evaluation: Test the system! How sensitive it is for small changes in the data? (=sensitivity analysis)

17

(18)

Four approaches

EXPERT

MODEL STRUCTURE + MODEL PARAMETRES MODEL=

DATA

DATA PATTERNS

STRUCTURE MODEL

STRUCTURE MODEL MODEL STRUCTURE +

MODEL PARAMETRES MODEL=

LEARNING MACHINE

DATA

MODEL PARAMETRES

MODEL PARAMETRES LEARNING

MACHINE defines 1)

3)

EXPERT

4) 2) DATA

DATA

MINING interprets defines

4 APPROACHES FOR ES CONSTRUCTION

18

(19)

1. Traditional approach: the expert denes both the model structure and the parametres

• simple and fast method

• expert knows the general causalities quite well

• numeric measures (parametres) are hard to estimate

2. Machine Learning-approach: the whole model is dened by ML techniques

• simple model structures can be learnt fast, but learning a graph-structure is generally NP-hard problem! (intractable for large data sets)

• several model structures can be equally plausible recog- nizing causal relationships is very hard!

• accurate numeric parametres, if the data set was good (large and representative)

19

(20)

3. Compromise approach: The expert denes the model structure, and the parametres are learnt from data

• combines the expert's knowledge on general causalities and computational means for dening parametres

• the data and expert's knowledge can be contrary!

4. Iterative approach (See: Descriptive and predictive modelling by WH): 1) Data mining reveals patterns in the data. An expert denes the model structure according to these patterns.

2) The model parametres are learnt from data.

• data and model structure are now consistent

• can be applied to new and poorly known domains

Task: What other advantages and disadvantages do you invent in each of the approaches?

20

(21)

Remarks

• Likelihood 6= importance (utility): e.g. the most probable medical diagnosis is:There is nothing wrong in you, you just have a cold, but the 3rd probable (not reported) diagnosis is lung cancer.

• Don't test with the same data set you have used for model construction any model will then work perfectly! ⇒ distinct training and test sets

• In every modelling paradigm there is some bias they suit better for some problems than others. Become conscious of the bias and select the best one!

21

(22)

Summary

The main tasks of expert systems can be reduced to two main types:

1. Classication or regression: Given attributes X₁, ..., X_k, what is Y?

(a) Classication: Y is class variable (discrete valued)

(b) Regression: Y is numeric function value Y = f(X₁, ..., X_k) 2. Direction of inference:

• Prediction: Y is eect of X_i's (in future)

• Diagnosis: Y is cause of X_i's (in past)

3. Action selection: given situation and goal(s), what is the optimal action or sequence of actions?

22

(23)

EXPERT

MODEL STRUCTURE + MODEL PARAMETRES MODEL=

DATA

DATA PATTERNS

STRUCTURE MODEL

STRUCTURE MODEL MODEL STRUCTURE +

MODEL PARAMETRES MODEL=

LEARNING MACHINE

DATA

MODEL PARAMETRES

MODEL PARAMETRES LEARNING

MACHINE defines 1)

3)

EXPERT

4) 2) DATA

DATA

MINING interprets defines

4 APPROACHES FOR ES CONSTRUCTION

Task: Invent examples of expert systems, which should learn a sequence of optimal actions!

23

(24)

Modelling paradigms

Three main techniques:

1. Logical rules

2. Numeric functions 3. Probabilistic reasoning

24

(25)

Rule-based systems

ES=Inference engine + knowledge base Knowledge base

• =rules + most of facts

• Rules are of form antecedent ⇒ consequent or if condition then action

• condition part is expressed as a logical formula with operators AND, OR, NOT

• Usually both factual and heuristic knowledge

• Heuristics=rules of good judgement, rules of good guessing, rules of thumb

An example rule from MYCIN:

if the stain of the organism is gram negative AND the morphlogy of the orgasm is rod AND the aerobicity of the organism is anaer- obic THEN there is a strongly suggestive evidence (0.8) that the class of the organism is Enterobacter iaceae

25

(26)

Inference engine

• Searches through knowledge base and tries to match facts to the antecedents of rules.

• If the rule matches, the consequent can be executed.

• Several ifthen rules are chained:

1. Forward chaining: starts from the data give and tries to nd rules, whose premises can be satised. When a rule is satised, the conclusions are added to the knowledge base, and new rules are searched until all new information has been added.

2. Backward chaining (goal-oriented system): tries to prove a goal or rule conclusion by conrming truth of all its premises.

The premise can be a conclusion of another rule, which has to be satied, or data supplied by the user from observations.

26

(27)

Advantages:

• easy to construct

• experts can easily interpret the answers

• can explain the reasoning process (gives condence, easier to debug)

Restrictions:

• cannot represent uncertainty

real world phenomena are often complex and incompletely known

the knowledge can be imprecise or probabilistic

or expressed vaguely (most, several, few, often, un- likely,...)

• how rank alternative solutions?

• how evaluate condence of conclusions?

27

(28)

⇒ Solution: uncertain rules, which express uncertainty by

• Fuzzy logic

The truth value of A, T(A), is not 0 or 1 but T(A) ∈ [0,1]

T(A∧ B) = min(T(A), T(B)) T(A∨ B) = max(T(A), T(B)) T(¬A) = 1 −T(A)

• Dempster-Shafer theory

Instead of a single value, we dene an interval [Bel(A), P l(A)], where Bel(A) tells how necessary A is and P l(A) = 1 − Bel(¬A) tells, how possible A is (Notice that generally Bel(A) + Bel(¬A) 6= 1)

Beliefs are combined by Dempster's rule

Flexible way to represent ignorance (lack of information), but sometimes the results seem counter-intuitive (if you are used to probability theory)

• Probability theory (see below)

• other measures

28

(29)

Truth-maintenance systems

• mathematically sound methods

• not in fashion nowadays, but some uncertain extensions have been introduced

• Machine learning algorithms for model construction do not exist? (yet!)

• the most natural implementation language is Prolog, which is not in fashion...

29

(30)

Main types:

1. Justication-based TMS

• The set of logical rules is represented as a graph.

• The nodes correspond to premises (always true), assumptions (which can be cancelled) or contradictions.

• The justications are logical rules of form p₁∧p₂∧...∧p_n → p

• we can calculate a set of nodes which are believed, ask justications for it, or cancel inconsistent assumptions

2. Assumption-based TMS

• similar graph structure with assumptions and justications (rules)

• now we calculate for all nodes all minimal assumption sets, under which it is true

• more eective approach, when assumptions change often between queries

• enlargements for uncertain reasoning (with probabilities, Dempster- Shafer beliefs, or possibilistic logic)

30

(31)

Decision trees

• the decision rules are expressed in tree-form

• the whole model can be easily learnt from data

• can deal with both numeric and categorial data

Regression

• For predicting the numeric value of dependent variable Y given the independent variables X₁, ..., X_n.

• The goal is to learn a function f such that Y = f(X₁, ..., X_n).

• The simplest model is linear regression Y = α+β₁X₁ +...+ β_nX_n

• More complex functions can be learnt by neural networks

31

(32)

Probabilistic reasoning

The basic laws of probability calculus

• 0 ≤ P(A) ≤ 1

• If A can get k values a₁, .., a_k, then Σ_i=1^kP(A = a_i) = 1 (and for binary-valued data P(¬A) = 1 − P(A))

• Probability of A and B, P(A, B) = P(A)P(B), if A is independent of B

• Probability of A or B is P(A∨B) = P(A)+P(B)−P(A, B)

• The conditional probability of B given A is P(B|A) = ^P_P^(A,B_(A)⁾

• If we know P(A) and P(B|A), we can update the probability of A given B by Bayes rule: P(A|X) = ^P^(A)P_P_(B)^(B|A)

32

(33)

Example

You don't remember, if your friend has a cat, but you are allergic to cats. In addition, you have hay-u in summer time. In autumn, the probability of hay-u is only 30%. You know that a cat alone causes you allergic symptoms with 60% probability and hay-u alone with 70% probability. If there is both a cat and hay-u, the probability is 90%. What is the probability that you get allergic symptoms, when you visit your friend?

Let's denote A=Your friend has a catB=You have hay-u,C=You get allergic symptoms.

P(A) = 0.5 (You don't know it) P(B) = 0.30

P(C|A, B) = 0.90 P(C|A,¬B) = 0.60 P(C|¬A, B) = 0.70

P(C|¬A,¬B) = 0.10 (This is rare, but not impossible other factors can cause allergic symptoms, too!)

33

(34)

Now the probability of allergic

P(C) = P(A)P(B)P(C|A, B) + P(A)P(¬B)P(C|A,¬B) + P(¬A)P(B)P(C|¬A, B) + P(¬A)P(¬B)P(C|¬A,¬B) =

You cannot see a cat what is the probability of cat, if you get allergic symptoms?

P(A|C) = ^P_P^(A,C)_(C) = ^P^(A,B,C)+P_P_(C)^(A,¬B,C =

P(A)P(B)P(C|A,B)+P(A)P(¬B)P(C|A,¬B)

P(C) =

Task: Calculate the probabilities!

Notice! If we would know the total probability distribution P(A, B, C) (for all value combinations), we could solve everything directly.

However, in practice, we can usually estimate only the conditional probabilities.

34

(35)

Bayesian networks

• A nice visual representation for probabilistic relations

• Vertices correspond to variables and edges dependencies between variables (if variables are independent, there is no edge!)

• Each root node A has prior probabilities P(A) (for all values a_i)

• Each non-root node has a table of conditional probabilities P(B|X), where X is the set of antecedent nodes (for all value combinations of X)

• We can predict the probability of future events simply by chaining probabilities

• The probability of a cause A can be updated when events X have been observed by Bayes rule

35

(36)

Neural networks

• Idea is to learn a function, which maps inputs to outputs (general regression).

• In principle, can express any function, but it can be hard to learn!

• The system consists of nodes, which calculate the weighed sum of inputs (from antecedent nodes), apply some function, and output the result to successor nodes

Main types:

• Feed-forward networks: the data ows from inputs to outputs single-layer = perceptron: simple, but restricted

multi-layer: very expressive, but hard to learn and interpret

• Recurrent networks: for modelling time series data (now we take into account also the value of the node in the previous time step)

36

(37)

Instance-based methods

• do not built any explicit model

• Assumption: the data point behaves in the similar way than the neighbouring points

• How to dene neighbours?

k-nearest neighbours

• for numeric data

• A simple approach: if d₁, .., d_k are k nearest neighbours to d according to X_i variables, then Y is weighed average of Y₁, .., Y_k

• Problem: how to dene neighbours? Several possible distance functions (and Euclidean is not always the best!)

Several variations for non-numeric data, e.g.

• Case-based systems: All solutions are stored in a database.

Similar cases are searched from the database and the solutions are applied to new situations or new solutions are generated from the old ones.

• Social or collaborative ltering (usually recommending systems, not necessarily expert systems in classical sense)

37

(38)

Genetic algorithms

• simulates evolution process

• we keep a set of solutions (population), which evolves through selection, crossing and mutation operations

• can be slow, but achieve good results for complex problems, which are hard to model otherwise

• e.g. learning the set of optimal action rules for a robot

38

(39)

Which paradigm to select?

• There is no one super system for all problems!

• The most appropriate system depends on the problem

• The systems come into fashion and go out of fashion Abstraction level?

Does the system work with

• high-level data (structured, e.g. relational database)

• low-level data (unstructured sensor data, e.g. sounds, images)

• or both?

Traditionally only high-level data ⇒ the expert can dene the model and understands it!

Task: Invent applications of expert systems, which use only low- level data!

39

(40)

RECOGNITION PATTERN

. . . STRUCTURED INFORMATION

TEXT

DOCUMENTS RETRIEVAL DATA

KNOWLEDGE DISCOVERY

MODEL

sensor k sensor 2

PHYSICAL SENSORS sensor 1

unstructured data

EXPERT domain

knowledge

OTHER DATABASES

40

(41)

Basic datatypes

Usually the system is restricted to certain data types!

or numeric quantitative

or categorial qualitative

ordinal nominal

DATA

continuous numeric

discrete numeric

dicrete data

The data can be classied according to type of attribute domain:

• numerical (quantitative) data: has meaning as numbers: we can measure order, distance and equality of two numerical variables.

discrete data: can get only countably many values (Notice:

in computer systems, all the data is represented with some precision and is thus discrete) (e.g. age={1, ...,120})

continuous data can in principle get any value in the given interval (e.g. weight ≥ 0)

• categorial (qualitative) data: has no meaning as numbers; generally we can measure only equality between two categorial variables (e.g. 1=female, 2=male)

nominal data: the values are only names for categories and have no other meaning as categorizing

ordinal data: the values have a natural ordering, but the dierences have no meaning (e.g. 1=poor, 2=fair, 3=good, 4=excellent)

41

(42)

Temporal or static data?

In temporal data we can nd dynamic patterns (how the system evolves in time)

Amount of data:

If we have too little data (relative to number of variables), the parametres cannot be estimated accurately!

Corrupted data:

• The data may contain errorneous or missing values

• The measurements can cause variation (noise)

• Outliers: non-typical data points with extreme values

⇒ Solutions:

• enough data

• robust models (=not sensitive to small variations in data)

• use of expert's knowledge

Task: Find out, what means overtting and how it can be pre- vented!

42

(43)

Desirable characteristics

Typically, we wish that the model

• can be learnt from data

• can integrate expert's knowledge

• reasons in a consistent way and detects incosistencies (e.g.

conicting rules, circular rule chains, illegal values)

• can be updated in the light of new evidence

• can handle noisy data (missing, spurious or errorneous)

• meets time-requirements with the given data (works in real time, if needed)

43

9A?A J IJK@O -NFAHJ 5OIJAI

Welcome to study Expert Systems!

Performance

Learning diary

Preliminary schedule

What are expert systems?

What they can do?

Examples

Why ESs are useful?

History

System construction

Summary

Modelling paradigms

Rule-based systems

Truth-maintenance systems

Decision trees

Regression

Probabilistic reasoning

Neural networks

Instance-based methods

Genetic algorithms

Which paradigm to select?

Desirable characteristics

9A?A J IJK@O -NFAHJ 5OIJAI