|
Call for Contributions
Summary
Approximate string search/join is an important task in many domains such as data cleansing, information extraction, or comparison of biological sequences. The problem has been studied intensively in various fields of computer science, including database research. However, quantitatively comparing the manifold of contributions is very hard. Earlier approaches probably do not scale-up to recent problem sizes and often do not make use of modern hardware, especially large main memories and high degrees of parallelization, while more recent methods often only target a special sub-problem or only perform well on data with certain properties (alphabet size, average string length etc.). Furthermore, published methods usually only compare against few other approaches, and different comparisons rarely use the same dataset or the same configuration. To get a clearer picture of the state-of-the-art in approximate string matching, we organize a competition on scalable string similarity operations, comprising two tracks: search and join.
This competition will bring together researchers and practitioners from database research, algorithms, and bioinformatics to address the open challenges of highly scalable approximate string matching techniques over very large datasets.
Competition
The challenge is to perform string similarity operations as fast as possible. The distance function is un-weighted edit-distance for different error thresholds k. There will be two tracks:
- Track I: String Similarity Search. Given a set of strings S and a query string Q, the task is to find all k-approximate matches of strings from S with Q. A program must output all matches with string identifier and distance k.
- Track II: String Similarity Join. Given a set of strings S, the task is to find all pairs of k-similar strings from S. A program must output all matches with both string identifiers and distance k.
Evaluation metric
For the evaluation of submissions, we consider the complete runtime of algorithms. This includes load time for
data sets from hard disk and output of the results. We will run each submitted software three times and report
average results. Submissions must be correct, i.e., the program must find precisely all k-matches with the query
(Track I) or finds precisely all pairs of k-similar strings (Track II). Please follow the instructions on our webpage
to format the output file of your program.
For Track I, the runtime of submissions for each data set will be evaluated based on 5000 queries, where a
query consists of a query string and a value for k chosen at random between 0 and 6. The experimentation data
sets will contain 5000 example queries each. You can assume that these queries are roughly representative for
the 5000 queries which will be used for final evaluation. The evaluation data sets and the experimentation data
sets against which the queries have to be evaluated will be different, yet will exhibit similar properties. For
Track II, the runtime for computing the self-similarity join will be measured. The evaluation data sets and the
experimentation data sets will be different, yet will exhibit similar properties.
If submissions use indexes, computation of the index has to be performed beforehand by a separate program.
Both programs, the index computation program and the query answering program, will be run sequentially.
The run time and main memory consumption of the index computation program is limited to reasonable values
which will be defined during the test phase.
A program may not take more than a maximum amount of main memory, which will also be defined exactly
during the test phase. We will measure run time and maximum main memory consumption and report both.
The primary evaluation criterion will be run time.
Evaluation environment
All submissions will be evaluated on an out-of-the-box Fedora 17, with Linux kernel 3.3, including GCC 4.7 and
Java 7. An image file for EC2, which will also be used for evaluation, can be found on our webpage. Please use
this image to make sure that your software runs properly. For the evaluation, the image will be run on a local
machine. The final hardware of that machine will be described during the test phase (multi-core, more than 16
GB RAM). There are no restrictions in terms of programming language or architecture being used. If you want
additional packages to be included on the image, please contact us. The final image file of the test environment
will be fixed on December 15, 2012.
Data sets for evaluation
In order to cover different alphabets, we use two different datasets for evaluation in both tracks.
- Human genome read data
- These data sets (one for experimentation, one for evaluation) contain reads from a human genome.
The data is characterized by a small alphabet (5 symbols) and quite uniform length of strings (around
100 bases). The evaluation data set will contain in the order of dozens of millions of reads.
- Track 1: All reads have to be searched for a k-approximate match with a given query read Q.
- Track 2: All reads in the data set have to be k-approximately joined.
- Geographical names
- These data sets contain names of cities from all over the world, after applying phonetic rewriting.
The alphabet size is rather large (around 200 symbols) and strings are comparably short, yet differ a
lot in length (maximum length of strings is 64; average length is much shorter). The evaluation data set will contain in the
order of several millions of names.
- Track 1: All cities have to be searched for a k-approximate match with a given query city Q.
- Track 2: All cities in the data set have to be k-approximately joined.
We supply experimentation data sets for testing and optimization of your code (link for download on our
webpage). The size of the experimentation data sets will be about 5% of the competition data sets used for the
final evaluation. You can assume that the experimentation data sets are roughly representative for the
competition data sets in terms of relative frequencies of symbols, distribution of string length, etc.
Participation
Participation is open and only requires a registration. Participation at the competition will not be required for
participating to the workshop. Teams may participate in both tracks or in only one. We will consider up to three
submissions per track per team; thus, a team could maximally provide six programs (or different configurations
of the same program).
Submission guidelines
Your submission should consist of the following two components:
- The program(s) (including all additional libraries, data files etc.). Please use the provided EC2 virtual
machine to ensure compatibility of your program with our environment. You may submit up to three
programs per track (typically differing in some form of configuration).
- Instructions on how to install, start, and use your program(s). Installation essentially must not be more
complicated than unzipping everything into one directory.
- A one page abstract of the general ideas and design of your algorithm/implementation and summary
of adaptations you might have performed to fit to the task.
- A paper describing your approach to the problems tackled. Papers may have up to 8 pages and will
appear in the EDBT workshop proceedings. Note that papers will not be reviewed scientifically.
Authors are responsible for not infringing copyrights when citing from prior work.
Please upload your final submission to a webpage (your own website, Dropbox, etc.) and send a link for that
repository to Sebastian Wandelt. Please do not send a link to the whole image file, but only your program.
Copyright/Licensing issues
We expect that all submitted programs may be distributed freely in the form they are submitted. Please
include appropriate copyright / license notices wherever necessary. Although the organizers will not pursue
any form of distribution, they also do not take any responsibility in case of submissions being made public in
any form. Note that we do not require the source code of software to be made available.
Organizers
Ulf Leser, WBI, Humboldt-Universität zu Berlin, Germany
Sebastian Wandelt, WBI, Humboldt-Universität zu Berlin, Germany
Contact
For questions concerning the submission and competition please send an email to wandelt(at)informatik.hu-berlin.de or use our
mailing list.
|
|