• Ei tuloksia

E MPIRICAL S TUDY

As shown in the case studies, genetic software architecture synthesis is able to produce reasonable architecture proposals, although obviously they still need some human polishing. However, fitness values do not have a straightforward correlation with expert evaluations of “good”

architectures. Thus, in order to answer the final and ultimate research question of how far the synthesis can be taken, an empirical study was conducted. In this experiment the quality of the generated architectures is studied in relation to the quality of architectures produced by students.

This empirical study is further discussed in publication [VIII].

5.3.1 Setup

First, a group of 38 students from an undergraduate software engineering class was asked to produce an architecture design for ehome. Most of the students were third year Software Systems majors from Tampere University of Technology, having participated in a course on software architectures.

The students were given essentially the same information that is used as input for the GA, i.e., the null architecture, the scenarios, and information about the expected frequencies of operations and their expected sensitiveness to variation. In addition, students were given a brief explanation of the purpose and functionality of the system. They were asked to design the architecture for the system using only the same architecture styles and design patterns that were available for GA. The students were instructed to consider efficiency, modifiability and complexity in their designs, with an emphasis on modifiability. It took 90 minutes on average for the students to produce a design.

112

The synthesized solutions, in turn, were achieved in 38 runs, resulting in 38 architecture proposals. Each run took approximately one minute (i.e., it took one minute for the synthesizer to produce one solution).

The assistant teacher for the course (impartial to the GA research) graded the student designs as test answers on a scale of 1 to 5, 5 being the highest. The solutions were then categorized according to the points they achieved, and one solution from each of the categories of 1, 3 and 5 was randomly selected. These architectures were presented as grading examples to four software engineering experts. The experts were researchers and teachers at the Department of Software Systems, Tampere University of Technology. They all had a M.Sc. or a Ph.D. degree in Software Systems or in a closely related discipline and several years of expertise from software architectures, gained by research or teaching.

They were given the same information as the students regarding the requirements for ehome.

In the actual experiment, the experts were given 10 pairs of architectures.

One solution in each pair was a student solution, selected randomly from the whole set of student solutions, and one was a synthesized solution, also selected randomly. The solutions were edited in such a way that it was not possible for the experts to know which solutions were synthesized, while still keeping all information of architectural design decisions. The experts were not told how the solutions were achieved, i.e., that they were a combination of student and synthesized solutions. They were merely asked to help in evaluating how good solutions a synthesizer could make.

The experts were then asked to give each solution 1, 3 or 5 points. The setup is discussed in more detail in publication [VIII].

5.3.2 Results

The scores given by the experts (e1-e4) both to the automatically synthesized architectures (a1-a10) and architectures produced manually by the students (m1-m10) are given in Table 1. As the experts viewed the solutions pair-wise, the points in Table 1 are also organized in pairs of synthesized and manually produced solutions. The result of each comparison within a solution pair is one of the following

 the synthesized solution is considered better (ai > mi, denoted later by +)

 the student solution is considered better (mi > ai , denoted later by -), or

 the solutions are considered equal (ai = mi, denoted later by 0).

113

Table 1. Points for synthesized solutions and solutions produced by the students

The best synthesized solutions appear to be a3 and a10, with two 3’s and two 5’s. In solution a3 the message dispatcher was used, and there were quite few patterns, so the design was easy to understand while still being modifiable. However, a10 was quite the opposite: the message dispatcher was not used, and there were especially as many as eight instances of the Strategy pattern, when a3 had only two. There were also several Template Method and Adapter pattern instances. In this case the solution was highly modifiable, but also quite complex. This demonstrates how very different solutions can be highly valued with the same evaluation criteria, when the criteria are conflicting: it seems impossible to achieve a solution that is at the same time optimally efficient, modifiable and still understandable.

The worst synthesized solution was considered to be a4, with three 1’s and one 3. This solution used the message dispatcher but also the client-server style was eagerly applied. There were not very many patterns, and the ones that existed were quite poorly applied. Among the human-made solutions, there were three solutions (m5, m8, and m10) with similar scoring.

Table 2 shows the numbers of the preferences of the experts, with “+”

indicating that the synthesized proposal was considered better than the student proposal, “-“ indicating the opposite, and “0” indicating a tie.

Only one (e1) of the four experts prefers the student solutions slightly more often than synthesized solution, while two experts (e2 and e4) prefer the synthesized solutions clearly more often than the student solutions.

The fourth expert (e3) prefers both types of solutions equally. There were totally 17 pairs of solutions with better score for the synthesized solution, 9 pairs preferring the student solution, and 14 ties.

The presented crude analysis clearly indicates that in this simple experiment, the synthesized solutions were ranked at least as high as student-made solutions. Thus, it can be deduced that synthesized solutions at this stage are competitive with those produced by third year software engineering students.

114

Table 2. Numbers of preferences of the experts

+ - 0 e1 3 4 3 e2 4 1 5 e3 3 3 4 e4 7 1 2 total 17 9 14

5.3.3 Threats and limitations

There are several threats and limitations to the presented experiment.

Firstly, as the solutions for evaluations were selected randomly out of all the 38 solutions, it is theoretically possible that the solutions selected for the experiment do not give a true representation of the entire solution group. However, as all experts were able to find solutions they judged worth of 5 points as well as solutions only worth 1 point, and the majority of solutions were given 3 points (i.e., the distribution of points roughly followed the Gaussian normal distribution), it is unlikely that the solutions subjected to evaluation would be so biased it would substantially affect the outcome of the experiment.

Secondly, the pairing of solutions could be questioned. The evaluation could have been more diverse if the experts were given the solutions in different pairs (e.g., for expert e1 the solution a1 would have been paired with m5 instead of m1). One might also ask if the outcome would be different with different pairing. However, as the overall points are better for the synthesized solutions, different pairing would most probably not significantly change the outcome. Also, the experts were not actually told to evaluate the solutions as pairs – the pairing was simply done in order to ease the evaluation and analysis processes.

Thirdly, the actual evaluations made by the experts should be considered.

Naturally, having more experts would have strengthened the results.

However, the evaluations were quite uniform. There were very few cases where three experts considered the synthesized solution better or equal to the student solution (or the student solution better or equal to the synthesized one) and the fourth evaluation was completely contradicting.

In fact, there were only three cases where such contradiction occurred (pairs 2, 3 and 4), and the contradicting expert was always the same (e4).

Thus, the consensus between experts is sufficiently good, and increasing

115 the number of evaluations would not substantially alter the outcome of the experiment in its current form.

Finally, the task setup was limited in the sense that architecture design was restricted to a given selection of patterns. Giving such a selection to the students may both improve the designs (as the students know that these patterns are potentially applicable) and worsen the designs (due to overuse of the patterns). Unfortunately, this limitation is due to the genetic synthesizer in its current stage, and could not be avoided.