Welcome to study Expert Systems!
In this course, you will learn
• general knowledge of expert systems and their possibilities
• how to implement a system for your own problem
• how to compare dierent systems and select the best one
• several metaskills like new learning methods, groupworking, problem solving, information retrieval, ...
Course details
5 ects master-level course
Classes: Wed 14-16 D106/B 14.9.-30.11. 2005 Teacher: Wilhelmiina Hämäläinen
(Meeting times: Wed 13.00-14.00, whamalai@cs.joensuu.)
Prerequicites: basic knowledge of logics and probability calculus
1
Performance
• No exam, no demo sessions, but still we will learn!
• Prepare for 12 hours work/week as in normal courses!
1. Participating classes (you can drop at most 2/12 classes) 2. Presentations about main expert systems: A student pair
gives a 30 min presentation about some expert system with some lecture material for other students. They should also design some exercise task about the topic and check the so- lutions. Each exercise task should take approximately 2 hours work.
3. Exercises: In the beginning of course exercises about intro- duction lectures and all student presentations. Weekly exer- cises take approximately 6 hours work in the beginning of course.
4. Learning diary: Each student writes a personal learning diary and returns a piece of it weekly. The lecturer will give feedback next week. The learning diaries are evaluated in the end of course.
5. Projectworks The projectworks are made in groups of three students (other than in the presentations). The projects are performed in several phases: specication (topic and data), design (how to implement), implementation and testing
2
Evaluation
Projectworks 40%
Presentations 20%
Learning diaries 20%
Exercises: 20%
Presentations will be evaluated by students and the teacher (ac- cording to how much you learnt)
For learning diaries, we will x the detailed criteria together!
3
Learning diary
In the beginning:
• What are your expections?
• What do you know beforehand?
• What is your motivation?
Every week:
• What you have learnt and what you have not?
• Set yourself learning goals, what you should learn or nd out!
Check also, if you have reached your previous learning goals!
• All experiences, feelings, changes in motivation are also im- portant!
• Explain the new or dicult things with your own words! Try to relate them to your previous knowledge, and invent new applications
4
In the end:
• Combine all pieces of knowledge into one and draw the nal picture (by words or drawing a concept map)
• Evaluate your whole learning process during the course: Did you reach your goals? Are you satised or not?
In addition:
• You can write or draw anything you wish the style is totally free
• Return your piece of learning diary weekly on paper, so that your teacher can give you individual feedback and answer your questions
• Save old learning diaries for nal evaluation
5
Preliminary schedule
14.9. Introduction to Expert systems, course overview and work allocation
21.9. Basic types of ESs, rule-based, functional and probabilistic systems
28.9. ESs, student presentations 5.10. ESs, student presentations 12.10. ESs, student presentations
19.10. ESs, last student presentations. Comparing and evaluat- ing systems (lecturer). Beginning projectworks.
26.10. Work specications (topic + data), short presentations.
2.11. Work plans (method + how to implement), short presenta- tions.
9.11. Detailed plans with class diagrams and main algorithms.
16.11. Implementations, short presentations?
23.11. About testing, designing test plans. Other evaluation (re- strictions, time and space requirements etc.)
30.11. Supersession (4 h), nal presentations
6
What are expert systems?
Several denitions:
A computer system which emulates the decision-making ability of a human expert
A computer system capable of giving advice in a particular knowledge domain
A model of the expertise of the best practitioners of the eld
7
• application of Articial Intelligence (simulates intelligent be- haviour)
• roots in Decision Analysis (utility of decision)
• also related to Machine Learning (learning models from data) and Data Mining (discovering new patterns in data)
• performs at the level of an expert (to assist an expert or even instead of an expert)
• have been designed for a narrow problem domain
• can be implemented as an intelligent agent = a part of soft- ware, which monitors the system and oers help, when it considers it is needed
8
What they can do?
Several types of tasks
• Diagnosis: nding faults in the system or diseases of living or- ganisms, e.g. printer fault detection system, automatic med- ical advicors in the net, ...
• Monitoring: continuous interpretetion of signals, e.g. breating- support machines in hospitals, controlling systems in industry
• Planning: producing a sequence of actions for achieving a par- ticular goal (e.g. designing an ideal curriculum for a school, interactive guides for repairing a car, navigating an explo- ration robot in Mars)
• Instruction (intelligent tutoring systems): especially for prac- tising physical skills like plane simulators, but nowadays in all levels, especially in distance learning
• Prediction: forecasting future events (e.g. advicing pricing ac- cording to market uctuations)
9
Examples
A rule-based system for automobile trouble-shooting The system consist of simple if ... then rules, with uncertainty measures. For example:
If the result of trying the starter is the car cranks normally and a gas smell is not present while trying the starter Then the gas tank is empty with 90% condence
The on-line demo can be found on:
http://www.expertise2go.com/webesie/car/
The system is described on
http://www.expertise2go.com/webesie/
10
Bayesian automobile trouble shooting
The idea is to diagnosize the most probable reasons for auto- mobile start-up problems and evaluate the probability that the engine works, when we x a certain component. For example if we observe that the gauge is empty (whether or not there is fu- el), we can calculate the probability that the car starts after we have lled the tank.
This example is from Breese, J.S., Heckerman, D. and Rommelse, K.: Troubleshooting Under Uncertainty. Technical report MSR- TR-94-07, 1994. http://www.research.microsoft.com/research/
pubs/view.aspx?msr_tr_id=MSR-TR-94-07
11
A decision tree for medical diagnosis
An HIV-positive patient has come to the emergency department for care. What is the urgency of the visit? The nurse should classify patients as 0=emergent, 1=urgent, 2=non-urgent.
This example is from Lewis, R.J.: An Introduction to Classica- tion and Regression Tree (CART) Analysis. Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California, 2000. http://www.saem.org/download/lewis1.pdf.
12
Why ESs are useful?
• help if expertise is scarse, expensive or unavailable training new human experts is time-consuming and expensive
• save time, because they are fast and the same system can serve several users parallelly
• do not forget or make human errors
• are consistent: similar situations are always handled in the same way
• can combine knowledge of multiple human experts However, ESs
• are totally dependent on their knowledge base
• usually manage only one way of reasoning
• don't have common sense!
• cannot adapt to new situations creatively like people can
• don't recognize when no answer exists or when the problem is outside teir area of expertise!
13
History
• medical software tools begun to emerge during 1970's, e.g.
MYCIN (Stanford University 1976) for aiding physicians in diagnosing and treating patients with infectious blood dis- eases
• the rst systems were simple rule-based systems dened by human experts
• Revolution in the late 80's: Bayesian networks
• in 1990's several new applications and methods
• nowadays AI-tools general make-it-yourself expert sys- tems
• Evolution: rule-based systems ⇒ Uncertain rules ⇒ Bayesian networks, neural networks, decision trees, ...
• Tendency from manually constructed ad hoc models to ma- chine learning, systems, which can learn themselves
14
System construction
Terminology
A model M = (S, θ) = a model structure S and assigned pa- rameter values θ. I.e. it is an instance of a given model structure.
The model structure S is a structure which determines the pa- rameters required for constructing a model. Typically the struc- ture consists of variables and their relations (e.g. conditional de- pendencies).
The model parameters θ are assigned numerical values like probabilities in Bayesian network or linear coecients (above α, β, γ) in linear regression.
Modelling paradigm the general modelling principles used.
The modelling paradigm consist of basic denitions, assumptions, and techniques for constructing and using certain kind of models.
15
Hierarchy of concepts
Bayesian networks Decision
trees Linear regression
A B
C A
B C
δ1 C A
B
A B
C A
C B
Modelling paradigms
Model family
Model class
A
B C
P(B|A) P(B| A)
P(C|A) P(C| A) P(A)
Model
. .
.. ..
2 A
B C
δ δ
B C
A
3
Modelling paradigms like Bayesian networks, decision trees and linear regression describe the general modelling principles used.
A model family is a set of model structures in the given mod- elling paradigm. E.g. in Bayesian networks modelling paradigm a model family consists of dierent graph structures with variable nodes and conditional dependencies (edges) between variables.
A model class is a set of models with a xed structure but dier- ent parameters θi. In the case of Bayesian networks the param- eters are prior probabilities associated to root nodes and condi- tional probabilities associated to non-root nodes.
A model is an item in a model class with a xed structure and xed parameters.
16
Construction process (from the implementor's point of view)
1. Understanding the domain: interview human experts and read literature. What are the system functionalities? Special util- ities or risks?
2. Data management: what and how to collect? Preprocessing?
3. Dene the model structure (see below). Especially causal re- lations are important!
4. Dene the model parametres, usually from data.
5. Verication and evaluation: Test the system! How sensitive it is for small changes in the data? (=sensitivity analysis)
17
Four approaches
EXPERT
MODEL STRUCTURE + MODEL PARAMETRES MODEL=
DATA
DATA PATTERNS
STRUCTURE MODEL
STRUCTURE MODEL MODEL STRUCTURE +
MODEL PARAMETRES MODEL=
LEARNING MACHINE
DATA
MODEL PARAMETRES
MODEL PARAMETRES LEARNING
MACHINE defines 1)
3)
EXPERT
4) 2) DATA
DATA
MINING interprets defines
4 APPROACHES FOR ES CONSTRUCTION
18
1. Traditional approach: the expert denes both the model struc- ture and the parametres
• simple and fast method
• expert knows the general causalities quite well
• numeric measures (parametres) are hard to estimate
2. Machine Learning-approach: the whole model is dened by ML techniques
• simple model structures can be learnt fast, but learning a graph-structure is generally NP-hard problem! (intractable for large data sets)
• several model structures can be equally plausible recog- nizing causal relationships is very hard!
• accurate numeric parametres, if the data set was good (large and representative)
19
3. Compromise approach: The expert denes the model struc- ture, and the parametres are learnt from data
• combines the expert's knowledge on general causalities and computational means for dening parametres
• the data and expert's knowledge can be contrary!
4. Iterative approach (See: Descriptive and predictive modelling by WH): 1) Data mining reveals patterns in the data. An ex- pert denes the model structure according to these patterns.
2) The model parametres are learnt from data.
• data and model structure are now consistent
• can be applied to new and poorly known domains
Task: What other advantages and disadvantages do you invent in each of the approaches?
20
Remarks
• Likelihood 6= importance (utility): e.g. the most probable medical diagnosis is:There is nothing wrong in you, you just have a cold, but the 3rd probable (not reported) diagnosis is lung cancer.
• Don't test with the same data set you have used for model construction any model will then work perfectly! ⇒ distinct training and test sets
• In every modelling paradigm there is some bias they suit better for some problems than others. Become conscious of the bias and select the best one!
21
Summary
The main tasks of expert systems can be reduced to two main types:
1. Classication or regression: Given attributes X1, ..., Xk, what is Y?
(a) Classication: Y is class variable (discrete valued)
(b) Regression: Y is numeric function value Y = f(X1, ..., Xk) 2. Direction of inference:
• Prediction: Y is eect of Xi's (in future)
• Diagnosis: Y is cause of Xi's (in past)
3. Action selection: given situation and goal(s), what is the op- timal action or sequence of actions?
22
EXPERT
MODEL STRUCTURE + MODEL PARAMETRES MODEL=
DATA
DATA PATTERNS
STRUCTURE MODEL
STRUCTURE MODEL MODEL STRUCTURE +
MODEL PARAMETRES MODEL=
LEARNING MACHINE
DATA
MODEL PARAMETRES
MODEL PARAMETRES LEARNING
MACHINE defines 1)
3)
EXPERT
4) 2) DATA
DATA
MINING interprets defines
4 APPROACHES FOR ES CONSTRUCTION
Task: Invent examples of expert systems, which should learn a sequence of optimal actions!
23
Modelling paradigms
Three main techniques:
1. Logical rules
2. Numeric functions 3. Probabilistic reasoning
24
Rule-based systems
ES=Inference engine + knowledge base Knowledge base
• =rules + most of facts
• Rules are of form antecedent ⇒ consequent or if condition then action
• condition part is expressed as a logical formula with operators AND, OR, NOT
• Usually both factual and heuristic knowledge
• Heuristics=rules of good judgement, rules of good guessing, rules of thumb
An example rule from MYCIN:
if the stain of the organism is gram negative AND the morphlogy of the orgasm is rod AND the aerobicity of the organism is anaer- obic THEN there is a strongly suggestive evidence (0.8) that the class of the organism is Enterobacter iaceae
25
Inference engine
• Searches through knowledge base and tries to match facts to the antecedents of rules.
• If the rule matches, the consequent can be executed.
• Several ifthen rules are chained:
1. Forward chaining: starts from the data give and tries to nd rules, whose premises can be satised. When a rule is satised, the conclusions are added to the knowledge base, and new rules are searched until all new information has been added.
2. Backward chaining (goal-oriented system): tries to prove a goal or rule conclusion by conrming truth of all its premises.
The premise can be a conclusion of another rule, which has to be satied, or data supplied by the user from observations.
26
Advantages:
• easy to construct
• experts can easily interpret the answers
• can explain the reasoning process (gives condence, easier to debug)
Restrictions:
• cannot represent uncertainty
real world phenomena are often complex and incompletely known
the knowledge can be imprecise or probabilistic
or expressed vaguely (most, several, few, often, un- likely,...)
• how rank alternative solutions?
• how evaluate condence of conclusions?
27
⇒ Solution: uncertain rules, which express uncertainty by
• Fuzzy logic
The truth value of A, T(A), is not 0 or 1 but T(A) ∈ [0,1]
T(A∧ B) = min(T(A), T(B)) T(A∨ B) = max(T(A), T(B)) T(¬A) = 1 −T(A)
• Dempster-Shafer theory
Instead of a single value, we dene an interval [Bel(A), P l(A)], where Bel(A) tells how necessary A is and P l(A) = 1 − Bel(¬A) tells, how possible A is (Notice that generally Bel(A) + Bel(¬A) 6= 1)
Beliefs are combined by Dempster's rule
Flexible way to represent ignorance (lack of information), but sometimes the results seem counter-intuitive (if you are used to probability theory)
• Probability theory (see below)
• other measures
28
Truth-maintenance systems
• mathematically sound methods
• not in fashion nowadays, but some uncertain extensions have been introduced
• Machine learning algorithms for model construction do not exist? (yet!)
• the most natural implementation language is Prolog, which is not in fashion...
29
Main types:
1. Justication-based TMS
• The set of logical rules is represented as a graph.
• The nodes correspond to premises (always true), assumptions (which can be cancelled) or contradictions.
• The justications are logical rules of form p1∧p2∧...∧pn → p
• we can calculate a set of nodes which are believed, ask justi- cations for it, or cancel inconsistent assumptions
2. Assumption-based TMS
• similar graph structure with assumptions and justications (rules)
• now we calculate for all nodes all minimal assumption sets, under which it is true
• more eective approach, when assumptions change often be- tween queries
• enlargements for uncertain reasoning (with probabilities, Dempster- Shafer beliefs, or possibilistic logic)
30
Decision trees
• the decision rules are expressed in tree-form
• the whole model can be easily learnt from data
• can deal with both numeric and categorial data
Regression
• For predicting the numeric value of dependent variable Y given the independent variables X1, ..., Xn.
• The goal is to learn a function f such that Y = f(X1, ..., Xn).
• The simplest model is linear regression Y = α+β1X1 +...+ βnXn
• More complex functions can be learnt by neural networks
31
Probabilistic reasoning
The basic laws of probability calculus
• 0 ≤ P(A) ≤ 1
• If A can get k values a1, .., ak, then Σi=1kP(A = ai) = 1 (and for binary-valued data P(¬A) = 1 − P(A))
• Probability of A and B, P(A, B) = P(A)P(B), if A is inde- pendent of B
• Probability of A or B is P(A∨B) = P(A)+P(B)−P(A, B)
• The conditional probability of B given A is P(B|A) = PP(A,B(A))
• If we know P(A) and P(B|A), we can update the probability of A given B by Bayes rule: P(A|X) = P(A)PP(B)(B|A)
32
Example
You don't remember, if your friend has a cat, but you are al- lergic to cats. In addition, you have hay-u in summer time. In autumn, the probability of hay-u is only 30%. You know that a cat alone causes you allergic symptoms with 60% probability and hay-u alone with 70% probability. If there is both a cat and hay-u, the probability is 90%. What is the probability that you get allergic symptoms, when you visit your friend?
Let's denote A=Your friend has a catB=You have hay-u,C=You get allergic symptoms.
P(A) = 0.5 (You don't know it) P(B) = 0.30
P(C|A, B) = 0.90 P(C|A,¬B) = 0.60 P(C|¬A, B) = 0.70
P(C|¬A,¬B) = 0.10 (This is rare, but not impossible other factors can cause allergic symptoms, too!)
33
Now the probability of allergic
P(C) = P(A)P(B)P(C|A, B) + P(A)P(¬B)P(C|A,¬B) + P(¬A)P(B)P(C|¬A, B) + P(¬A)P(¬B)P(C|¬A,¬B) =
You cannot see a cat what is the probability of cat, if you get allergic symptoms?
P(A|C) = PP(A,C)(C) = P(A,B,C)+PP(C)(A,¬B,C =
P(A)P(B)P(C|A,B)+P(A)P(¬B)P(C|A,¬B)
P(C) =
Task: Calculate the probabilities!
Notice! If we would know the total probability distribution P(A, B, C) (for all value combinations), we could solve everything directly.
However, in practice, we can usually estimate only the condition- al probabilities.
34
Bayesian networks
• A nice visual representation for probabilistic relations
• Vertices correspond to variables and edges dependencies be- tween variables (if variables are independent, there is no edge!)
• Each root node A has prior probabilities P(A) (for all values ai)
• Each non-root node has a table of conditional probabilities P(B|X), where X is the set of antecedent nodes (for all value combinations of X)
• We can predict the probability of future events simply by chaining probabilities
• The probability of a cause A can be updated when events X have been observed by Bayes rule
35
Neural networks
• Idea is to learn a function, which maps inputs to outputs (general regression).
• In principle, can express any function, but it can be hard to learn!
• The system consists of nodes, which calculate the weighed sum of inputs (from antecedent nodes), apply some function, and output the result to successor nodes
Main types:
• Feed-forward networks: the data ows from inputs to outputs single-layer = perceptron: simple, but restricted
multi-layer: very expressive, but hard to learn and inter- pret
• Recurrent networks: for modelling time series data (now we take into account also the value of the node in the previous time step)
36
Instance-based methods
• do not built any explicit model
• Assumption: the data point behaves in the similar way than the neighbouring points
• How to dene neighbours?
k-nearest neighbours
• for numeric data
• A simple approach: if d1, .., dk are k nearest neighbours to d according to Xi variables, then Y is weighed average of Y1, .., Yk
• Problem: how to dene neighbours? Several possible distance functions (and Euclidean is not always the best!)
Several variations for non-numeric data, e.g.
• Case-based systems: All solutions are stored in a database.
Similar cases are searched from the database and the solutions are applied to new situations or new solutions are generated from the old ones.
• Social or collaborative ltering (usually recommending sys- tems, not necessarily expert systems in classical sense)
37
Genetic algorithms
• simulates evolution process
• we keep a set of solutions (population), which evolves through selection, crossing and mutation operations
• can be slow, but achieve good results for complex problems, which are hard to model otherwise
• e.g. learning the set of optimal action rules for a robot
38
Which paradigm to select?
• There is no one super system for all problems!
• The most appropriate system depends on the problem
• The systems come into fashion and go out of fashion Abstraction level?
Does the system work with
• high-level data (structured, e.g. relational database)
• low-level data (unstructured sensor data, e.g. sounds, images)
• or both?
Traditionally only high-level data ⇒ the expert can dene the model and understands it!
Task: Invent applications of expert systems, which use only low- level data!
39
RECOGNITION PATTERN
. . . STRUCTURED INFORMATION
TEXT
DOCUMENTS RETRIEVAL DATA
KNOWLEDGE DISCOVERY
MODEL
sensor k sensor 2
PHYSICAL SENSORS sensor 1
unstructured data
EXPERT domain
knowledge
OTHER DATABASES
40
Basic datatypes
Usually the system is restricted to certain data types!
or numeric quantitative
or categorial qualitative
ordinal nominal
DATA
continuous numeric
discrete numeric
dicrete data
The data can be classied according to type of attribute domain:
• numerical (quantitative) data: has meaning as numbers: we can measure order, distance and equality of two numerical variables.
discrete data: can get only countably many values (Notice:
in computer systems, all the data is represented with some precision and is thus discrete) (e.g. age={1, ...,120})
continuous data can in principle get any value in the given interval (e.g. weight ≥ 0)
• categorial (qualitative) data: has no meaning as numbers; gen- erally we can measure only equality between two categorial variables (e.g. 1=female, 2=male)
nominal data: the values are only names for categories and have no other meaning as categorizing
ordinal data: the values have a natural ordering, but the dierences have no meaning (e.g. 1=poor, 2=fair, 3=good, 4=excellent)
41
Temporal or static data?
In temporal data we can nd dynamic patterns (how the system evolves in time)
Amount of data:
If we have too little data (relative to number of variables), the parametres cannot be estimated accurately!
Corrupted data:
• The data may contain errorneous or missing values
• The measurements can cause variation (noise)
• Outliers: non-typical data points with extreme values
⇒ Solutions:
• enough data
• robust models (=not sensitive to small variations in data)
• use of expert's knowledge
Task: Find out, what means overtting and how it can be pre- vented!
42
Desirable characteristics
Typically, we wish that the model
• can be learnt from data
• can integrate expert's knowledge
• reasons in a consistent way and detects incosistencies (e.g.
conicting rules, circular rule chains, illegal values)
• can be updated in the light of new evidence
• can handle noisy data (missing, spurious or errorneous)
• meets time-requirements with the given data (works in real time, if needed)
43