Multidimensional Range Queries on Modern Hardware

Authors: Stefan Sprenger, Patrick Schäfer and Ulf Leser

Contact: sprengsz (at) informatik (dot) hu-berlin (dot) de

DOWNLOADS

Benchmark Suite (2.1 GB)

CONTENTS

Requirements Installation Running Experiments GMRQ Benchmark Vectorization of the R*-tree Further Experimental Results

1.) Requirements

We have the following requirements:

2.) Installation

As first step, you need to download and unzip our benchmark suite:

$ wget https://www2.informatik.hu-berlin.de/~sprengsz/mdrq/mdrq-analysis.zip
$ unzip mdrq-analysis.zip
$ cd mdrq-analysis

Please make also sure that libspatialindex is correctly installed and available (installation guide). You need to apply our changes before compiling the source code.

3.) Running Experiments

We run experiments for each contestants separately. Experiments and benchmarks are provided in the subfolders benchmarks/scan (sequential scans), benchmarks/kdtree (kd-tree), benchmarks/rtree (R*-tree) and benchmarks/vafile (VA-File).

When running an experiment for a contestant, e.g., the VA-file, we need to switch to the according directory:

$ cd benchmarks/vafile

As first step, we should compile the code by using the make command:

(benchmarks/vafile) $ make clean all

Then, we can list all available options:

(benchmarks/vafile) $ ./benchmark
> Usage: ./benchmark num_elements num_dimensions distribution(0=normal, 1=subclustered, 2=uniform, 3=gmrqb, 4=power)

For instance, to run experiments on 1 Million uniformly distributed 10-dimensional feature vectors, we may provide the following options:

(benchmarks/vafile) $ ./benchmark 1000000 10 2

To run the mixed workload from the GMRQ benchmark using 24 CPU threads, you may use the following options (the last two options are only available for GMRQB):

(benchmarks/vafile) $ ./benchmark 10000000 19 3 ../../gmrqb-mixed.queries 24

We also provide executables for all experiments done in our paper.

4.) Genomic Multidimensional Range Query Benchmark (GMRQB)

Dataset

The following table describes all attributes and provides the domain and number of distinct values:

Attribute NameDescriptionDomain (real numbers)Distinct Values
chromosomeThe chromosome the variation has been found at.[22.0,22.0]1
locationThe position the variation has been found at.[16050100.0,18791500.0]20,550
qualityQuality score of the read.[100.0,100.0]1
depthDepth of the read.[301.0,91949.0]19,538
reference_genomeThe used reference genome.[398393985191641088.0,398393985191641088.0]1
variation_idDistinct identifier of the variation.[1225.0,99992304.0]63,883
allele_freqRelative frequency of the allele.[0.0002,0.9998]3,176
ref_baseNucleobase found in the reference genome.[25855900690415616.0,18445699537663164416.0]540
alt_baseNucleobase found in the sample genome.[66030698359685120.0,18446499982128185344.0]423
ancestral_alleleThe allele of ancestors.[1476799971876405248.0,8591840045650935808.0]3
ancestral_countThe count of the allele.[1.0,5008.0]3,177
filterAny filters that the variation failed to pass.[4774679821152157696.0,4774679821152157696.0]1
variant_typeThe type of the variation.[310693982822727680.0,17548400192863076352.0]3
sample_idDistinct identifier of the sample (human).[961370.0,99848200.0]2,502
genderGender of the sample.[1.0,2.0]2
family_idDistinct family identifier of the sample.[11118.0,94755104.0]1,867
populationPopulation of the sample.[390559002771062784.0,17293700523312021504.0]26
relationshipRelationship to other samples.[184136999010041856.0,17976799609858555904.0]16
genotypeThe genotype of the sample.[107920004023844864.0,18388100521530490880.0]29

Query Templates

The following sections provide the eight query templates of GMRQB written as SQL statements to ease readability.
GMRQB Query Template 1
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?;
GMRQB Query Template 2
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?
    AND quality BETWEEN ? AND ?
    AND depth BETWEEN ? AND ?
    AND allele_freq BETWEEN ? AND ?;
GMRQB Query Template 3
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?
    AND gender = ?;
GMRQB Query Template 4
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?
    AND gender = ?
    AND population = '?';
GMRQB Query Template 5
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?
    AND gender = ?
    AND population = '?'
    AND relationship = '?';
GMRQB Query Template 6
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?
    AND gender = ?
    AND population = '?'
    AND relationship = '?'
    AND family_id BETWEEN ? AND ?;
GMRQB Query Template 7
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?
    AND gender = ?
    AND population = '?'
    AND relationship = '?'
    AND family_id BETWEEN ? AND ?
    AND variation_id BETWEEN ? AND ?;
GMRQB Query Template 8
SELECT * FROM variations
  WHERE chromosome = ?
    AND location BETWEEN ? AND ?
    AND quality BETWEEN ? AND ?
    AND depth BETWEEN ? AND ?
    AND allele_freq BETWEEN ? AND ?
    AND ref_base = '?'
    AND alt_base = '?'
    AND ancestral_allele = '?'
    AND variation_id BETWEEN ? AND ?
    AND sample_id BETWEEN ? AND ?
    AND gender = ?
    AND family_id BETWEEN ? AND ?
    AND population = '?'
    AND relationship = '?'
    AND variant_type = '?'
    AND genotype = '?'
    AND genotype_quality BETWEEN ? AND ?
    AND read_depth BETWEEN ? AND ?
    AND haplotype_quality BETWEEN ? AND ?;

5.) Vectorization of the R*-tree

In our experiments, we used the R*-tree implementation provided by libspatialindex (version 1.8.5). We vectorized the range query implementation by applying SIMD instrinsics.

The following archive provides all changes made for our experiments (after unpacking, see the file diff-files.txt for a list of all changed files).

6.) Further Experimental Results

The following sections present experimental results that did not fit in the original paper due to space limitations.

Memory Consumption

The following figure shows the memory usage of each contestant depending on the dataset. As for all experiments in our paper we used p = 24, i. e., we build 24 instances of a MDIS/to-be-scanned array. For scanning vertically partitioned data, we used m partitions (depending on the dataset). We measure the combined space consumption of all instances/partitions.

For SYNT-UNI, we show the memory consumption when storing 10M 5-dimensional data objects. For SYNT-CLUST, we show the memory consumption when storing 1M 5-dimensional data objects featuring 20 clusters. For POWER, we show the memory consumption when storing 10M 3-dimensional data objects. For GMRQB, we show the memory consumption when storing 10M 19-dimensional data objects. Parallel scans require the least memory across all datasets, because they do not employ any additional data structures like MDIS. Comparing both hierarchical MDIS, the kd-tree shows a more efficient memory usage than the R*-tree. Please note that our VA-file implementation uses 8-bit integers (the smallest available data type) for approximation values to allow an arbitrary number of partitions although only 2 bits are needed (we create 4 partitions per dimension), thus reducing the memory efficiency.

Memory usage of contestants when stor- ing the datasets used in our evaluation.
Memory usage of contestants when storing the datasets used in our evaluation.

Experiments on further Hardware Platform

In addition to the main evaluation in the paper, we also ran all experiments on a further hardware platform to prove that our findings do not depend on the hardware architecture used in our evaluation. As further hardware platform we used a modern desktop machine that features one Intel i7-5930K CPU (3.5 GHz clock rate, 6 cores, 12 virtual cores) and 32 GB RAM. The CPU supports AVX instructions. The results on both hardware platforms are very similar, which confirms that our experimental results do not depend on the used hardware architecture.

As exemplary results for the experiments on the desktop machine, the following figure shows the throughput when executing the query templates from the GMRQ Benchmark on 10M 19-dimensional data objects from the 1000 Genomes Project. The query templates are instantiated with the same values as in Section 7.7 of the paper. In contrast to the experiment on the other machine, here only 12 software threads are used because the desktop machine features less virtual cores. As on the other hardware platform, the parallel scan with vertical partitioning achieves the highest throughput for queries with a moderate to low selectivity (Mixed Workload and Query Templates 1-5). Only for queries with a very high selectivity much below 1% (Query Templates 6-8), it is outperformed by the kd-tree.

Throughput of the contestants on the desktop machine.
Throughput of the contestants on the desktop machine when executing the GMRQB with varying selectivities on 10M 19-dimensional data objects from the 1000 Genomes Project dataset using 12 software threads (query templates are ordered by average selectivity, from low (10.76%) to high (0.00001%)).