SFA

SFA - Symbolic Fourier Approximation

This page was built in support of our paper "SFA: A Symbolic Fourier Approximation and Index for Similarity Search in Time Series Data" by Patrick Schäfer and Mikael Högqvist.

This page contains further experiments, which we couldn't show in the paper using varying time series lengths.

Java Implemenation of SFA

Scalable Time Series Data Analytics using SFA, BOSS and BOSS VS

Tightness of Lower Bounds for Dimensionality Reduction Techniques

SFA has a very tight lower bound to DFT. This is shown in this experiment for 256 symbols of the symbolic representations.

Coefficients 8 16
Length 256
Length 1024
Length 2048

Here is the raw data.

Wilcoxon Signed-Rank Test based on Tightness of Lower Bounds

We applied a one-tailed Wilcoxon signed-rank test using the statistics tool R to test if any method has a higher tlb. The null-hypothesis is that the tlb of PAA is higher than that of DFT.

length2048204810241024256256
symbols256256256256256256
coef.168168168
N282828282929
W+379.5385390.5386.5415.5404
p-value3.063e-51.789e-51.029e-51.54e-59.743e-62.486e-6

With a max p-value of 3.063e-5 the null-hypothesis can be rejected on all tested configurations. This means, the alternative hypothesis that the tlb of DFT if higher than the tlb of PAA is statistically significant.

If the time series are z-normalised and N PAA-coefficients are calculated, it is sufficient to store only N-1 and reconstruct the N-th coefficient. Thus, we tested PAA2 with N+1 coefficients while DFT uses only N. The null-Hypothesis is that the tlb of PAA2 is higher than the tlb of DFT.

Length2048204810241024256256
Symbols256256256256256256
Coefficients16(17)8(9)16(17)8(9)16(17)8(9)
N282828282929
W_{+}351309376338404.5407
p-value0.00039130.0020437.57e-060.0010962.75e-05 2.181e-05

With a max p-value of 0.002043 the null-hypothesis can be rejected on all tested configurations. This means, the alternative hypothesis tlb of DFT is higher than tlb of PAA is statistically significant.

Here is the raw data for the Wilcoxon signed-rank test.

Pruning Power Experiments

We compare SFA and iSAX in terms of pruning power, which is a measure for the performance of 1-nearest-neighbour queries. The pruning power P is defined as the fraction of the database, that must be examined in reduced space before the 1-nearest-neighbour to a query is found.

It can be seen as the number of false alarms that have to be inspected until the true nearest neighbour is found (lower is better). To measure P, we first determine the true 1-nearest-neighbour TNN to a query Q, sort the time series in reduced space by increasing distance to Q and count the number of time series that have a smaller distance than TNN.

Dataset Length 256 Length 512 Length 1024 Data Source
Trajectory3_7_2006
256

512

1024
data source
Fluid_dynamics
256

512

1024
data source
PostureCentroidB
256

512

1024
data source
TOR (Italy power demand 97)
256

512

1024
data source
Ann_gun_CentroidA
256

512

1024
data source
Buoy Sensor
256

512

1024
data source
Burst
256

512

1024
data source
chfdb_chf
256

512

1024
data source
Clorine (cl2fullLarge)
256

512

1024
data source
Converted_ERP
256

512

1024
data source
Kung Fu MoCap
256

512

1024
data source
CSTR
256

512

1024
data source
Mallet (DB time)
256

512

1024
data source
Foetal_ecg
256

512

1024
data source
Koski ECG
256

512

1024
data source
forte-6class
256

512

1024
data source
Koski ECG
256

512

1024
data source
Star Light Curve
256

512

1024
data source
mitdbx_mitdbx_108
256

512

1024
data source
Motorcurrent
256

512

1024
data source
Muscle Activation
256

512

1024
data source
Physiodata (npo141)
256

512

1024
data source
Physiodata (npog41)
256

512

1024
data source
Respiration (nprs43)
256

512

1024
data source
Pen
256

512

1024
data source
Power_Data
256

512

1024
data source
Steamgen
256

512

1024
data source
synemp-train
256

512

1024
data source
Tickwise
256

512

1024
data source
Tide
256

512

1024
data source
Winding
256

512

1024
data source

Exact Similarity Search - Performance

For completness, we show further results for lengths 256, 512 and 1024 on this page for the TS-tree and the kd-tree, which show the same trend, that the SFA trie is competitive or superior to other indexing techniques.

time series length page accesses time distance calculations
1024
512
256

Exact Similarity Search - iSAX index vs. SFA trie

In the paper we could only show the number of page accesses and wall clock time. For completness, we show the number of distance calculations on this page. It shows that the SFA trie does not only have less page accesses but also requires less distance calculations for exact similarity search.

time series length page accesses time distance calculations
1024

Here is the raw data for the experiments.

Acknowledgements: