SOTA view

Geoff Sutcliffe and Christian Suttner are running the CASC⁹ competition for many years now, and provide in [SS01] some clues about what is a fair way to evaluate automated theorem provers. One of the key ideas of their work is the notion of State Of The Art (SOTA) solver. It is based on a subsumption relationship between the set of instances solved by the competing solvers: ``a solver A is better than a solver B iff solver A solves a strict subset of the instances solved by solver B''. The underlying idea is that any solver being the only one to solve an instance is meaningful. The subsumption relationship provides a partial order between the competing solvers, and a virtual solver representing advances of the whole community, the SOTA solver. This solver can solve every problem solved by any of the competing solvers (so would be the unique maximal element for the subsumption relationship). There is a little chance that the SOTA solver is a real solver, and thus the notion of SOTA contributors is introduced by the authors. They correspond to the maximal elements of the subsumption relationship. The SOTA solver is equivalent to the set of SOTA contributors running in parallel.

Tables 10, 11 and 12 show the problems solved by only one solver during the first stage of the competition, for the three categories of benchmarks (hors-concours solvers are discarded). We can now find SOTA contributors from each table:

Table 10: Uniquely solved industrial benchmarks during the first stage

Solver	Bench (shortname)	cpu (s)
2clseq	bmc2/cnt09.cnf	198.00
2clseq	rand_net/50-60-10	96.33
2clseq	rand_net/60-40-10	16.03
2clseq	rand_net/60-60-10	180.45
2clseq	rand_net/60-60-5	172.22
2clseq	rand_net/70-40-10	35.02
2clseq	rand_net/70-40-5	52.15
2clseq	rand_net/70-60-10	188.11
2clseq	rand_net/70-60-5	611.71
2clseq	satex-c/c6288-s	0.73
2clseq	satex-c/c6288	0.77
berkmin	dinphil/dp10u09	321.67
berkmin	fpga-r/rcs_w9	2 054.50
berkmin	satex-c/cnf-r4-b2-k1.1	1 402.24
limmat	Homer/homer16	2 315.05
limmat	dinphil/dp12s12	5.21
modoc	Homer/homer15	843.87
zchaff	fifo8/fifo8_200	1 417.68
zchaff	w10/w10_70.cnf	661.74
zchaff	satex-/9vliw_bp_mc	1 274.94
zchaff	fvp-u.2.0/5pipe	186.58
zchaff	fvp-u.2.0/5pipe_3_oo	533.11
zchaff	fvp-u.2.0/5pipe_4_ooo	1 667.30
zchaff	fvp-u.2.0/5pipe_5_ooo	814.92
zchaff	fvp-u.2.0/7pipe_bug	452.14

Table 11: Uniquely solved hand-made benchmarks during the first stage

Solver	Benchmark	CPU (s)
berkmin	hanoi6_on	295.04
berkmin	rope_1000	2 058.62
berkmin	x1.1_32	57.13
berkmin	x1_32	82.18
limmat	pyhala-braun-sat-40-4-01	1 295.77
rb2cl	lisa21_1_a	1 955.62
zchaff	lisa21_2_a	287.74
zchaff	pyhala-braun-sat-40-4-02	971.59
zchaff	pyhala-braun-unsat-35-4-01	1 474.30
zchaff	pyhala-braun-unsat-35-4-02	1 653.80
zchaff	Urquhart-s3-b5	1 507.11
zchaff	x1.1_36	786.60
zchaff	x2_36	1 828.81

Table 12: Uniquely solved randomly generated benchmarks during the first stage

Solver	Benchmark	CPU (s)
2clseq	5col100_15_1	1 148.08
2clseq	5col100_15_4	1 097.78
2clseq	5col100_15_5	1 018.13
2clseq	5col100_15_6	1 353.14
2clseq	5col100_15_7	748.21
2clseq	5col100_15_8	1 160.66
2clseq	5col100_15_9	821.42
dlmsat1	hgen5-v300-s1895562135	391.94
dlmsat1	hgen5-v300-s528975468	512.81
dlmsat1	hgen2-v300-s1807441418	651.35
dlmsat1	hgen3-v450-s646636548	141.16
dlmsat1	hgen4-v200-s2074278220	16.79
dlmsat1	hgen4-v200-s812807056	1 158.20
dlmsat3	hgen3-v450-s356974048	821.42
dlmsat3	4col280_9_2	1 766.60
dlmsat3	4col280_9_4	6.60
dlmsat3	4col280_9_6	1 792.59
dlmsat3	4col280_9_7	108.43
OKsolver	glassy-v399-s1993893641	1 162.07
OKsolver	glassy-v399-s524425695	814.44
OKsolver	glassybp-v399-s1499943388	273.31
OKsolver	glassybp-v399-s944837607	615.27
OKsolver	glassyb-v399-s1267848873	1 001.71
OKsolver	5cnf_3900_3900_160	243.24
OKsolver	5cnf_3900_3900_170	911.32
OKsolver	5cnf_3900_3900_180	1 790.46
OKsolver	5cnf_3900_3900_190	1 778.96
OKsolver	5cnf_4300_4300_090	1 390.55
OKsolver	5cnf_4300_4300_100	1 311.92
unitwalk	hgen2-v650-s2139597266	536.09
unitwalk	hgen2-v700-s543738649	32.38
unitwalk	hgen2-v700-s548148704	0.62
unitwalk	hgen3-v500-s1754870155	157.03
usat05	okgen-c2550-v600-s552691850	5.57
usat10	hgen3-v450-s1400022686	0.39

In order to obtain a total order among the solvers, one must first classify benchmarks themselves. For this, the notion of SOTA contributors can be used [SS01]: benchmarks solved by all SOTA contributors are said easy, those solved by at least one SOTA contributor (but not all) are called difficult¹⁰. Note that the benchmarks not solved by any solver are not considered here.

Now, it is easy to rank the solvers accordingly to the number of difficult instances they can solve. Tables 13 and 14 provide a summary for SAT2002 competition, for respectively complete solvers and satisfiable benchmarks.

Table 13: Number of difficult SAT+UNSAT benchmarks solved by each solver during the first stage

Industrial		Handmade		Randomly generated
solver	#solved	solver	#solved	solver	#solved
berkmin	121	berkmin	68	OKsolver	548
zchaff	104	zchaff	68	marchII	543
2clseq	93	limmat	47	marchIIse	543
limmat	87	simo	43	2clseq	369
simo	57	2clseq	34	rb2cl	338
rb2cl	36	OKsolver	33	zchaff	305
OKsolver	35	jquest	16	simo	298
jquest	31	marchIIse	13	berkmin	263
marchIIse	29	rb2cl	10	limmat	221
modoc	26	marchII	9	modoc	219
blindsat	20	modoc	7	jquest	122
marchII	16	blindsat	0	blindsat	3

Table 14: Number of difficult satisfiable benchmarks solved by each solver during the first stage

Industrial		Handmade		Randomly generated
solver	#solved	solver	#solved	solver	#solved
berkmin	68	zchaff	35	OKsolver	548
zchaff	64	berkmin	33	marchII	543
limmat	54	limmat	29	marchIIse	543
2clseq	46	simo	27	2clseq	369
unitwalk	41	OKsolver	23	rb2cl	338
simo	39	2clseq	21	zchaff	305
dlmsat2	37	unitwalk	11	simo	298
dlmsat1	36	dlmsat1	11	berkmin	263
rb2cl	35	marchIIse	10	dlmsat1	255
dlmsat3	35	dlmsat3	10	dlmsat2	234
saturn	34	saturn	9	dlmsat3	233
usat10	33	usat05	9	unitwalk	227
usat05	31	usat10	9	limmat	221
OKsolver	30	dlmsat2	9	modoc	219
marchIIse	26	marchII	7	saturn	208
jquest	24	rb2cl	6	usat10	206
blindsat	20	modoc	5	usat05	205
modoc	20	jquest	4	jquest	122
ga	18	blindsat	0	blindsat	3
marchII	13	ga	0	ga	3

What can we conclude? First of all, we awarded SOTA contributors. Looking at the SOTA ranking per category, berkmin, zchaff and OKsolver would be awarded. Limmat, our fourth awarded, is most of the time third after zchaff and berkmin in industrial and hand-made categories. So the result of the SAT2002 competition looks reasonable. Berkmin certainly deserves a special attention from the community, since despite the bug that made it crashed on more than 100 benchmarks, it is ranked first in the industrial category using the SOTA system. This little impact of crashes can certainly be due to the fact that only the second SAT-engine of Berkmin crashed, which means that most instances on which Berkmin crashed are hard for all solvers anyway (this engine is called for hardest instances only).

Furthermore, we now have difficult and unsolved instances for the next competition. The degree of difficulty provided by the SOTA system can be used to tune solvers for the next competition: first try to solve instances of medium difficulty, then try the really hard ones. All that information will be included in the instances archive.