A workshop held in conjunction with EDBT/ICDT 2013, March 22, 2013, Genoa, Italy

Frequently asked questions

Where can I get further support?
Answer:
We have created a mailing list for the workshop. In order to subscribe to the mailing list send an email to wandelt(at)informatik.hu-berlin.de.

Which programming language can I use?
It is completely up to the participants to decide, which language/compiler/etc. they use. The only constraint is that the program can be run on the EC2 image. If you want a package/library be added to the image, please let us know.

Are you really going to measure time? Is this fair, if, e.g., one algorithm is implemented in Java and one in C?
Answer:
Different programming languages and libraries do expose different run time behaviours. However, it is completely up to the participants to decide, which language/compiler/etc. they assume to be advantageous and prefer to use. The only constraint is that the program can be run on the EC2 image.

I have an index based method. How can I participate?
Answer:
If submissions use indexes, computation of the index has to be performed beforehand by a separate program. Both programs, the index computation program and the query answering program, will be run sequentially. The run time and main memory consumption of the index computation program is limited to reasonable values which will be defined during the test phase.

How about multi-threading? What is the test hardware specification?
You are allowed to use multiple threads. The number of CPUs and more details of the test machine will be announced later.

Is the alphabet size of the data sets fixed? At what size?
Answer:
Yes. For the first data set (Human genome read data) the alphabet size is 5. For the second data set (Geographical names) the alphabet size smaller 255.

My algorithm supports a similarity method other than unweighted edit distance. Can I still participate?
Answer:
You may participate (and also submit a paper describing your solution). However, you will not take part in the official competition, i.e. you cannot win the competition.

Is there an upper bound for k?
Answer:
The value of k is limited to at most 6.

I think that substring search is more important than searching a dictionary or similarity join.
Answer:
This is a initial workshop on string similarity search and join. Substring search may be considered as a follow up workshop depending on the lessons learned from this competition.

On what kind of machine the solutions will be evaluated: 64-bit or 32-bit?
Answer:
64-bit.

What is the size of the main memory (RAM) that can be used?
Answer:
The final specification will be communicated in November/December. You can expect more than 16 GB of RAM. In fact, we have not decided yet on the test hardware.

Will there be two winners of the competition: one winner of track I and one winner of track II?
Answer:
Yes.

Is the upper bound for k (=6) fixed?
Answer:
The upper bound is/was planned to be fixed at 6.
As for higher values of k: we are open for any suggestions by the participants. If more teams agree, we can and will increase the bound.
How about case-sensitivity?
Answer:
The strings 'Hello' and 'hello' have edit distance 1.

Do k-approximate matches include k-1-approxaimte, k-2-approxaimte, etc. matches?
Answer:
Yes.

Is there a maximum length of a string in the geonames dataset?
Answer:
Yes, a string has at most 64 characters.

09/20/2012 WBI, Humboldt-Universität zu Berlin, Berlin, Germany