DOWNLOADS
CONTENTS
1.) Requirements
We have the following requirements:
- GCC (other compilers may work as well, we use C++11)
- Make
- OpenMP
- AVX Intrinsics
- libspatialindex (Version: 1.8.5)
2.) Installation
As first step, you need to download and unzip our benchmark suite:
$ wget https://www2.informatik.hu-berlin.de/~sprengsz/mdrq/mdrq-analysis.zip
$ unzip mdrq-analysis.zip
$ cd mdrq-analysis
Please make also sure that libspatialindex is correctly installed and available (installation guide). You need to apply our changes before compiling the source code.
3.) Running Experiments
We run experiments for each contestants separately. Experiments and benchmarks are provided in the subfolders benchmarks/scan (sequential scans), benchmarks/kdtree (kd-tree), benchmarks/rtree (R*-tree) and benchmarks/vafile (VA-File).
When running an experiment for a contestant, e.g., the VA-file, we need to switch to the according directory:
$ cd benchmarks/vafile
As first step, we should compile the code by using the make command:
(benchmarks/vafile) $ make clean all
Then, we can list all available options:
(benchmarks/vafile) $ ./benchmark
> Usage: ./benchmark num_elements num_dimensions distribution(0=normal, 1=subclustered, 2=uniform, 3=gmrqb, 4=power)
For instance, to run experiments on 1 Million uniformly distributed 10-dimensional feature vectors, we may provide the following options:
(benchmarks/vafile) $ ./benchmark 1000000 10 2
To run the mixed workload from the GMRQ benchmark using 24 CPU threads, you may use the following options (the last two options are only available for GMRQB):
(benchmarks/vafile) $ ./benchmark 10000000 19 3 ../../gmrqb-mixed.queries 24
We also provide executables for all experiments done in our paper.
4.) Genomic Multidimensional Range Query Benchmark (GMRQB)
Dataset
The following table describes all attributes and provides the domain and number of distinct values:
Attribute Name | Description | Domain (real numbers) | Distinct Values |
---|---|---|---|
chromosome | The chromosome the variation has been found at. | [22.0,22.0] | 1 |
location | The position the variation has been found at. | [16050100.0,18791500.0] | 20,550 |
quality | Quality score of the read. | [100.0,100.0] | 1 |
depth | Depth of the read. | [301.0,91949.0] | 19,538 |
reference_genome | The used reference genome. | [398393985191641088.0,398393985191641088.0] | 1 |
variation_id | Distinct identifier of the variation. | [1225.0,99992304.0] | 63,883 |
allele_freq | Relative frequency of the allele. | [0.0002,0.9998] | 3,176 |
ref_base | Nucleobase found in the reference genome. | [25855900690415616.0,18445699537663164416.0] | 540 |
alt_base | Nucleobase found in the sample genome. | [66030698359685120.0,18446499982128185344.0] | 423 |
ancestral_allele | The allele of ancestors. | [1476799971876405248.0,8591840045650935808.0] | 3 |
ancestral_count | The count of the allele. | [1.0,5008.0] | 3,177 |
filter | Any filters that the variation failed to pass. | [4774679821152157696.0,4774679821152157696.0] | 1 |
variant_type | The type of the variation. | [310693982822727680.0,17548400192863076352.0] | 3 |
sample_id | Distinct identifier of the sample (human). | [961370.0,99848200.0] | 2,502 |
gender | Gender of the sample. | [1.0,2.0] | 2 |
family_id | Distinct family identifier of the sample. | [11118.0,94755104.0] | 1,867 |
population | Population of the sample. | [390559002771062784.0,17293700523312021504.0] | 26 |
relationship | Relationship to other samples. | [184136999010041856.0,17976799609858555904.0] | 16 |
genotype | The genotype of the sample. | [107920004023844864.0,18388100521530490880.0] | 29 |
Query Templates
The following sections provide the eight query templates of GMRQB written as SQL statements to ease readability.GMRQB Query Template 1
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?;
GMRQB Query Template 2
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?
AND quality BETWEEN ? AND ?
AND depth BETWEEN ? AND ?
AND allele_freq BETWEEN ? AND ?;
GMRQB Query Template 3
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?
AND gender = ?;
GMRQB Query Template 4
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?
AND gender = ?
AND population = '?';
GMRQB Query Template 5
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?
AND gender = ?
AND population = '?'
AND relationship = '?';
GMRQB Query Template 6
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?
AND gender = ?
AND population = '?'
AND relationship = '?'
AND family_id BETWEEN ? AND ?;
GMRQB Query Template 7
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?
AND gender = ?
AND population = '?'
AND relationship = '?'
AND family_id BETWEEN ? AND ?
AND variation_id BETWEEN ? AND ?;
GMRQB Query Template 8
SELECT * FROM variations
WHERE chromosome = ?
AND location BETWEEN ? AND ?
AND quality BETWEEN ? AND ?
AND depth BETWEEN ? AND ?
AND allele_freq BETWEEN ? AND ?
AND ref_base = '?'
AND alt_base = '?'
AND ancestral_allele = '?'
AND variation_id BETWEEN ? AND ?
AND sample_id BETWEEN ? AND ?
AND gender = ?
AND family_id BETWEEN ? AND ?
AND population = '?'
AND relationship = '?'
AND variant_type = '?'
AND genotype = '?'
AND genotype_quality BETWEEN ? AND ?
AND read_depth BETWEEN ? AND ?
AND haplotype_quality BETWEEN ? AND ?;
5.) Vectorization of the R*-tree
In our experiments, we used the R*-tree implementation provided by libspatialindex (version 1.8.5). We vectorized the range query implementation by applying SIMD instrinsics.
The following archive provides all changes made for our experiments (after unpacking, see the file diff-files.txt for a list of all changed files).
6.) Further Experimental Results
The following sections present experimental results that did not fit in the original paper due to space limitations.Memory Consumption
The following figure shows the memory usage of each contestant depending on the dataset. As for all experiments in our paper we used p = 24, i. e., we build 24 instances of a MDIS/to-be-scanned array. For scanning vertically partitioned data, we used m partitions (depending on the dataset). We measure the combined space consumption of all instances/partitions.
For SYNT-UNI, we show the memory consumption when storing 10M 5-dimensional data objects. For SYNT-CLUST, we show the memory consumption when storing 1M 5-dimensional data objects featuring 20 clusters. For POWER, we show the memory consumption when storing 10M 3-dimensional data objects. For GMRQB, we show the memory consumption when storing 10M 19-dimensional data objects. Parallel scans require the least memory across all datasets, because they do not employ any additional data structures like MDIS. Comparing both hierarchical MDIS, the kd-tree shows a more efficient memory usage than the R*-tree. Please note that our VA-file implementation uses 8-bit integers (the smallest available data type) for approximation values to allow an arbitrary number of partitions although only 2 bits are needed (we create 4 partitions per dimension), thus reducing the memory efficiency.

Experiments on further Hardware Platform
In addition to the main evaluation in the paper, we also ran all experiments on a further hardware platform to prove that our findings do not depend on the hardware architecture used in our evaluation. As further hardware platform we used a modern desktop machine that features one Intel i7-5930K CPU (3.5 GHz clock rate, 6 cores, 12 virtual cores) and 32 GB RAM. The CPU supports AVX instructions. The results on both hardware platforms are very similar, which confirms that our experimental results do not depend on the used hardware architecture.
As exemplary results for the experiments on the desktop machine, the following figure shows the throughput when executing the query templates from the GMRQ Benchmark on 10M 19-dimensional data objects from the 1000 Genomes Project. The query templates are instantiated with the same values as in Section 7.7 of the paper. In contrast to the experiment on the other machine, here only 12 software threads are used because the desktop machine features less virtual cores. As on the other hardware platform, the parallel scan with vertical partitioning achieves the highest throughput for queries with a moderate to low selectivity (Mixed Workload and Query Templates 1-5). Only for queries with a very high selectivity much below 1% (Query Templates 6-8), it is outperformed by the kd-tree.
