A workshop held in conjunction with EDBT/ICDT 2013, March 22, 2013, Genoa, Italy

Experimentation data sets and Input/Output

We supply experimentation data sets for testing and optimization of your code. You can assume that the experimentation data sets are roughly representative for the competition data sets in terms of relative frequencies of symbols, distribution of string length, etc.
In order to cover different alphabets, we use two datasets:
1. Human genome read data:
  The evaluation data set contains in the order of dozens of millions of reads from (different) human genomes. The size of the public experimentation dataset is 750.000 reads; which is roughly 5% of the competition dataset.
  - Link to the public experimentation dataset 1
  - Link to the test queries for public experimentation dataset 1 (Track I)
2. Geographical names
  The evaluation dataset contains in the order of several millions of names of cities from all over the world after phonetic rewriting. The size of the public experimentation dataset is 400.000 names; which is roughly 5% of the competition data.
  - Link to the public experimentation dataset 2
  - Link to the test queries for public experimentation dataset 2 (Track I)
General Input
For both datasets, the input, output and further constraints are defined here

09/20/2012 WBI, Humboldt-Universität zu Berlin, Berlin, Germany