Next: Conclusions
Up: Difficulties and future competitions
Previous: Some lessons
Because any of the SOTA contributors is important for SAT research,
we are thinking about delivering a SOTA contributor certificate for
the next competition. But, to adopt the SOTA system for awarding
solvers, we need a better classification of available SAT benchmarks
as in TPTP. Let us emphasize some differences between the CASC
competition (where SOTA ranking is the rule) and the SAT2002
competition. Differences are essentially due to the different level of
maturity of these competitions:
- In the CASC competition, most instances are part of the TPTP library,
so they are well known and classified. This is not the case for the
SAT competition since we received a lot of completely new benchmarks for
running the competition.
- The ranking proposed above is made on Specialist Problem
Classes, whose granularity is finer than our simple (Industrial,
Handmade, Randomly generated) partition (for instance, the pigeon-hole
problem may occur as an industrial problem... How can we classify
such a benchmark?).
- We used a heuristic to save CPU time, so not all systems were
run on all instances.
Many other improvements are possible for the next competition. Still it
seems like human action is unavoidable. To make it more fair and
less time-consuming for the organizers, we propose to appoint a
board of judges similarly to CASC.
When the competition rules do not give an explicit answer, it is up
to the judges to decide what to do. They can also play a role for
instance in the partition of benchmarks.
For the future competitions we propose the following:
- Allow more time for submitters to experiment with their solvers
(prior to submission) on one of the actual computers of
the competition.
- Make clear which version of a solver is used. Sometimes, under
the same name, quite different algorithms or techniques are used
(e.g. Berkmin 56 that is a single engine solver whose results
were published in [GN02] and Berkmin 62 that was the 2
engines solver submitted to the competition). This would
prevent a spectator from a suspicion when an instance is solved by
a particular solver as reported in a previously published paper but
not during the competition.
- Make competition more automatic both technically (better
scripts) and semantically (more strict rules with no
exceptions). This could allow to make the results online as soon
as they are available, for instance to allow solver submitters
to check the behavior of their solver during the first stage.
- After benchmarks are selected, every solver is run on every
benchmark even if it seems a waste of CPU time (however, solver
submitters could be allowed to submit their solvers only for some
categories of benchmarks).
- Run the winners of the previous competition anyway (or the SOTA
contributors). If an old winner wins a category, no award is
given for this category. The same principle applies for the 2nd
stage benchmarks.
- Limit the number of submissions per author groups: if one
technique performs well in a category, it is likely that all its
variants perform equally well (see e.g. dlmsat1/2/3 in 1st stage
random SAT category). Then the second stage is biased. One
solution is to limit the number of variants and to qualify only
the best one for the final stage.
- Provide right now the scripts/programs used to check
input/output/instances format to delegate that step to the
submitters (Then a fully automatic machinery can reject
submissions not respecting the formats).
Some other points, more or less prospective, are possible:
- Classify new benchmarks accordingly to the performances of
previous year SOTA contributors. For instance, ``easy''
benchmarks (according to SOTA) could thus be discarded, freeing
cpu-time.
- Discuss the notion of series and how to score solvers on
series. For instance, if many benchmarks in a series are easy,
then most of solvers would have solved the series. This can be
resolved for instance by giving the solvers points according to
their performance for each particular series (and not 1 vs 0 as
it was in SAT 2002 competition). For example, the series winner
could get 5 points, the next solver could get 4, etc.
- More prospective, one can use only a timeout per series instead
of a timeout per instance (maybe as a new category). A solver
able to solve quickly the first instances of a series will have
more time to solve the remaining instances. Beware here: the
order in which the instances are provided to the solver matters,
and it is not easy to determine the best order.
- At last, we could use a larger cluster of machines, as the ones
that they use for VLSI/CAD applications and simulations, where
very large empirical evaluations are often performed. For
instance, the new version of bookshelf.exe
[CM02] should allow to use hundreds of identical
computers with a simple-web interface (automatic report,
...). However, modifications are needed to merge the SAT-Ex
architecture into this cluster interface.
The final point is to run the next competition only before the SAT
Symposium (only publishing results and giving awards at the meeting)...
Next: Conclusions
Up: Difficulties and future competitions
Previous: Some lessons
LE BERRE Daniel
2002-09-16