A workshop held in conjunction with EDBT/ICDT 2013, March 22, 2013, Genoa, Italy

Call for Contributions

  • Summary

    Approximate string search/join is an important task in many domains such as data cleansing, information extraction, or comparison of biological sequences. The problem has been studied intensively in various fields of computer science, including database research. However, quantitatively comparing the manifold of contributions is very hard. Earlier approaches probably do not scale-up to recent problem sizes and often do not make use of modern hardware, especially large main memories and high degrees of parallelization, while more recent methods often only target a special sub-problem or only perform well on data with certain properties (alphabet size, average string length etc.). Furthermore, published methods usually only compare against few other approaches, and different comparisons rarely use the same dataset or the same configuration. To get a clearer picture of the state-of-the-art in approximate string matching, we organize a competition on scalable string similarity operations, comprising two tracks: search and join.

    This competition will bring together researchers and practitioners from database research, algorithms, and bioinformatics to address the open challenges of highly scalable approximate string matching techniques over very large datasets.

  • Competition

    The challenge is to perform string similarity operations as fast as possible. The distance function is un-weighted edit-distance for different error thresholds k. There will be two tracks:

    1. Track I: String Similarity Search. Given a set of strings S and a query string Q, the task is to find all k-approximate matches of strings from S with Q. A program must output all matches with string identifier and distance k.
    2. Track II: String Similarity Join. Given a set of strings S, the task is to find all pairs of k-similar strings from S. A program must output all matches with both string identifiers and distance k.
  • Evaluation metric

    For the evaluation of submissions, we consider the complete runtime of algorithms. This includes load time for data sets from hard disk and output of the results. We will run each submitted software three times and report average results. Submissions must be correct, i.e., the program must find precisely all k-matches with the query (Track I) or finds precisely all pairs of k-similar strings (Track II). Please follow the instructions on our webpage to format the output file of your program.

    For Track I, the runtime of submissions for each data set will be evaluated based on 5000 queries, where a query consists of a query string and a value for k chosen at random between 0 and 6. The experimentation data sets will contain 5000 example queries each. You can assume that these queries are roughly representative for the 5000 queries which will be used for final evaluation. The evaluation data sets and the experimentation data sets against which the queries have to be evaluated will be different, yet will exhibit similar properties. For Track II, the runtime for computing the self-similarity join will be measured. The evaluation data sets and the experimentation data sets will be different, yet will exhibit similar properties.

    If submissions use indexes, computation of the index has to be performed beforehand by a separate program. Both programs, the index computation program and the query answering program, will be run sequentially. The run time and main memory consumption of the index computation program is limited to reasonable values which will be defined during the test phase.

    A program may not take more than a maximum amount of main memory, which will also be defined exactly during the test phase. We will measure run time and maximum main memory consumption and report both. The primary evaluation criterion will be run time.

  • Evaluation environment

    All submissions will be evaluated on an out-of-the-box Fedora 17, with Linux kernel 3.3, including GCC 4.7 and Java 7. An image file for EC2, which will also be used for evaluation, can be found on our webpage. Please use this image to make sure that your software runs properly. For the evaluation, the image will be run on a local machine. The final hardware of that machine will be described during the test phase (multi-core, more than 16 GB RAM). There are no restrictions in terms of programming language or architecture being used. If you want additional packages to be included on the image, please contact us. The final image file of the test environment will be fixed on December 15, 2012.

  • Data sets for evaluation

    In order to cover different alphabets, we use two different datasets for evaluation in both tracks.

    1. Human genome read data
      • These data sets (one for experimentation, one for evaluation) contain reads from a human genome. The data is characterized by a small alphabet (5 symbols) and quite uniform length of strings (around 100 bases). The evaluation data set will contain in the order of dozens of millions of reads.
      • Track 1: All reads have to be searched for a k-approximate match with a given query read Q.
      • Track 2: All reads in the data set have to be k-approximately joined.
    2. Geographical names
      • These data sets contain names of cities from all over the world, after applying phonetic rewriting. The alphabet size is rather large (around 200 symbols) and strings are comparably short, yet differ a lot in length (maximum length of strings is 64; average length is much shorter). The evaluation data set will contain in the order of several millions of names.
      • Track 1: All cities have to be searched for a k-approximate match with a given query city Q.
      • Track 2: All cities in the data set have to be k-approximately joined.

    We supply experimentation data sets for testing and optimization of your code (link for download on our webpage). The size of the experimentation data sets will be about 5% of the competition data sets used for the final evaluation. You can assume that the experimentation data sets are roughly representative for the competition data sets in terms of relative frequencies of symbols, distribution of string length, etc.

  • Participation

    Participation is open and only requires a registration. Participation at the competition will not be required for participating to the workshop. Teams may participate in both tracks or in only one. We will consider up to three submissions per track per team; thus, a team could maximally provide six programs (or different configurations of the same program).

  • Submission guidelines

    Your submission should consist of the following two components:

    1. The program(s) (including all additional libraries, data files etc.). Please use the provided EC2 virtual machine to ensure compatibility of your program with our environment. You may submit up to three programs per track (typically differing in some form of configuration).
    2. Instructions on how to install, start, and use your program(s). Installation essentially must not be more complicated than unzipping everything into one directory.
    3. A one page abstract of the general ideas and design of your algorithm/implementation and summary of adaptations you might have performed to fit to the task.
    4. A paper describing your approach to the problems tackled. Papers may have up to 8 pages and will appear in the EDBT workshop proceedings. Note that papers will not be reviewed scientifically.

    Authors are responsible for not infringing copyrights when citing from prior work. Please upload your final submission to a webpage (your own website, Dropbox, etc.) and send a link for that repository to Sebastian Wandelt. Please do not send a link to the whole image file, but only your program.

  • Copyright/Licensing issues

    We expect that all submitted programs may be distributed freely in the form they are submitted. Please include appropriate copyright / license notices wherever necessary. Although the organizers will not pursue any form of distribution, they also do not take any responsibility in case of submissions being made public in any form. Note that we do not require the source code of software to be made available.

  • Organizers

      Ulf Leser, WBI, Humboldt-Universität zu Berlin, Germany
      Sebastian Wandelt, WBI, Humboldt-Universität zu Berlin, Germany
  • Contact

    For questions concerning the submission and competition please send an email to wandelt(at)informatik.hu-berlin.de or use our mailing list.


09/20/2012 WBI, Humboldt-Universität zu Berlin, Berlin, Germany