Results

IPC-4 featured a lot of domain versions and instances, and it featured a lot of planners. So there are a lot of results. But it is definetely worthwhile to consider them in detail. You may download the plots and posters tar-archive and the entire ASCII results tar-archive.

Results Evaluation, and Awards

By far the best way to understand the results of a complex event such as IPC-4 is to examine the result plots in detail, making sense of them in combination with the descriptions/PDDL encodings of the domains, and the techniques used in the respective planners. We strongly recommend to do so, at least to some extent, to everybody who is interested in the results of IPC-4.

Since the awarding of prizes is one of the main sources of excitement in the competition event, we had to decide upon some form of evaluation as the basis for the award decisions. We evaluated the data in terms of asymptotic runtime and solution quality performance. The comparisons between the planners were made by hand, i.e. by looking at runtime and plan quality graphs. While this may seem a little simplistic, we believe that it is the most adequate way of evaluating the data, given the goals of the field, and the demands of the event. It should be agreed that what we are interested in is the (typical) scaling behaviour of planners in the specific domains. This excludes ``more formal'' primitive performance measures such as counts of solved instances, since a planner may scale worse than another planner, yet be faster in a lot of smaller instances just due to, e.g., pre-processing implementation details. The only adequate formal measure of performance would be to approximate the actual scaling functions underlying the planners' data points. But it is completely infeasible to generate enough data, in an event like the IPC, to do such formal approximations.

For plan quality performance, we found that in almost none of our test suites there were any significant observations to be made. The suboptimal planners generally produced plans of very similar quality, fairly close to the optimal plans -- as returned by the optimal planners, in those (generally smaller) instances solved by them. The only exceptions to that rule were two domain versions of the Pipesworld, where YAHSP returned unnecessarily long plans. We remark that we compared each planner according to only a single plan quality criterion. That is, for every domain version the competitors told us what criterion (nr. actions, makespan, or metric value) their planner was trying to optimize in that domain version, and we evaluated all (and only) those planners together that tried to optimize the same criterion. We figure that it does not make sense to compare, e.g., a planner that tries to minimize the number of actions based on the metric value of its plans.

For runtime performance, the observations to be made are much more interesting and diverse. Indeed, we were stunned to see the performance that some of the planners achieved in domains that we thought were completely infeasible!! When running FF and Mips on the same test suites, we found that they were, most of the time, significantly outperformed by the best IPC-4 participants. The same holds true for the IPC-3 version of LPG, which we included into the tests and the results as a performance measure (thanks to the LPG guys for providing us with these results!!).

Within each domain version, we identified a group of planners that scaled best, and roughly similar. These planners were counted as having a 1st place in that domain version. Similarly, we also identified groups of 2nd place planners. Then, for each planner, we simply summed the achieved 1st and 2nd places up. The tables below show the results (planners that never came in 1st or 2nd place are left out of the tables). Since many of the planners (6 of the sub-optimal planners, and 4 of the optimal planners) only dealt with the purely propositional domain versions (i.e., STRIPS or ADL), we counted the performance in these domains separately.

Suboptimal 1st and 2nd Places


SGPlan
LPG-TD
Downward
Diagonally
Macro-FF
YAHSP
Crikey
Propositional Domains (1st / 2nd Places)
4 / 6
1 / 6
6 / 1
7 / 2
3 / 0
4 / 2
0 / 1
Temporal/Metric Domains (1st / 2nd Places)
13 / 0
9 / 4





Total Count (1st / 2nd Places)
17 / 6
10 / 10
6 / 1
7 / 2
3 / 0
4 / 2
0 / 1

Optimal 1st and 2nd Places


CPT
TP-4
HSP*a
SATPLAN
Optiplan
Semsyn
BFHSP
Propositional Domains (1st / 2nd Places)
0 / 2
0 / 0
0 / 0
5 / 2
0 / 4
0 / 2
0 / 3
Temporal/Metric Domains (1st / 2nd Places)
3 / 2
1 / 4
1 / 2




Total Count (1st / 2nd Places)
3 / 4
1 / 4
1 / 2
5 / 2
0 / 4
0 / 2
0 / 3

For the sub-optimal planners, based on the above observations we decided to award separate prizes for performance in the pure STRIPS and ADL domains. For the optimal planners this seemed not apropiate due to, first, the small number of planners competing in the temporal/metric domains, and, second, the smaller overall number of competing systems -- giving 4 prizes to 7 systems seemed too much. Overall, the awards made in the classical part of IPC-4 are te following:

We would like to re-iterate that the awarding of prizes is, and has to be, a very sketchy ``summary'' of the results of a complex event such as IPC-4. A few bits of information are just not sufficient to summarize thousands of data points. Many of the decisions we took in the awarding of the prices, i.e. in the judgement of scaling behaviour, were very close. This holds especially true for most of the runtime graphs concerning optimal planners, and for some of the runtime graphs concerning sub-optimal propositional planners. What we think is best, and what we encourage everybody to do, is to have a closer look at the results plots for themselves.

The above left aside, our congratulations to the awardees! And many thanks again to all the competitors, to the people who helped us creating the domains, and last but not least to the organizing committee who helped us set up the language PDDL2.2. All these people contributed to make IPC-4 such an exciting and fruitful event!!