SFA

SFA - Symbolic Fourier Approximation

This page was built in support of our paper "SFA: A Symbolic Fourier Approximation and Index for Similarity Search in Time Series Data" by Patrick Schäfer and Mikael Högqvist.

This page contains further experiments, which we couldn't show in the paper using varying time series lengths.

Java Implemenation of SFA

Scalable Time Series Data Analytics using SFA, BOSS and BOSS VS

Tightness of Lower Bounds for Dimensionality Reduction Techniques

SFA has a very tight lower bound to DFT. This is shown in this experiment for 256 symbols of the symbolic representations.

Coefficients	8	16
Length 256
Length 1024
Length 2048

Here is the raw data.

Wilcoxon Signed-Rank Test based on Tightness of Lower Bounds

We applied a one-tailed Wilcoxon signed-rank test using the statistics tool R to test if any method has a higher tlb. The null-hypothesis is that the tlb of PAA is higher than that of DFT.

length	2048	2048	1024	1024	256	256
symbols	256	256	256	256	256	256
coef.	16	8	16	8	16	8
N	28	28	28	28	29	29
W+	379.5	385	390.5	386.5	415.5	404
p-value	3.063e-5	1.789e-5	1.029e-5	1.54e-5	9.743e-6	2.486e-6

With a max p-value of 3.063e-5 the null-hypothesis can be rejected on all tested configurations. This means, the alternative hypothesis that the tlb of DFT if higher than the tlb of PAA is statistically significant.

If the time series are z-normalised and N PAA-coefficients are calculated, it is sufficient to store only N-1 and reconstruct the N-th coefficient. Thus, we tested PAA2 with N+1 coefficients while DFT uses only N. The null-Hypothesis is that the tlb of PAA2 is higher than the tlb of DFT.

Length	2048	2048	1024	1024	256	256
Symbols	256	256	256	256	256	256
Coefficients	16(17)	8(9)	16(17)	8(9)	16(17)	8(9)
N	28	28	28	28	29	29
W_{+}	351	309	376	338	404.5	407
p-value	0.0003913	0.002043	7.57e-06	0.001096	2.75e-05	2.181e-05

With a max p-value of 0.002043 the null-hypothesis can be rejected on all tested configurations. This means, the alternative hypothesis tlb of DFT is higher than tlb of PAA is statistically significant.

Here is the raw data for the Wilcoxon signed-rank test.

Pruning Power Experiments

We compare SFA and iSAX in terms of pruning power, which is a measure for the performance of 1-nearest-neighbour queries. The pruning power P is defined as the fraction of the database, that must be examined in reduced space before the 1-nearest-neighbour to a query is found.

It can be seen as the number of false alarms that have to be inspected until the true nearest neighbour is found (lower is better). To measure P, we first determine the true 1-nearest-neighbour TNN to a query Q, sort the time series in reduced space by increasing distance to Q and count the number of time series that have a smaller distance than TNN.

Dataset	Length 256	Length 512	Length 1024	Data Source
Trajectory3_7_2006	256	512	1024	data source
Fluid_dynamics	256	512	1024	data source
PostureCentroidB	256	512	1024	data source
TOR (Italy power demand 97)	256	512	1024	data source
Ann_gun_CentroidA	256	512	1024	data source
Buoy Sensor	256	512	1024	data source
Burst	256	512	1024	data source
chfdb_chf	256	512	1024	data source
Clorine (cl2fullLarge)	256	512	1024	data source
Converted_ERP	256	512	1024	data source
Kung Fu MoCap	256	512	1024	data source
CSTR	256	512	1024	data source
Mallet (DB time)	256	512	1024	data source
Foetal_ecg	256	512	1024	data source
Koski ECG	256	512	1024	data source
forte-6class	256	512	1024	data source
Koski ECG	256	512	1024	data source
Star Light Curve	256	512	1024	data source
mitdbx_mitdbx_108	256	512	1024	data source
Motorcurrent	256	512	1024	data source
Muscle Activation	256	512	1024	data source
Physiodata (npo141)	256	512	1024	data source
Physiodata (npog41)	256	512	1024	data source
Respiration (nprs43)	256	512	1024	data source
Pen	256	512	1024	data source
Power_Data	256	512	1024	data source
Steamgen	256	512	1024	data source
synemp-train	256	512	1024	data source
Tickwise	256	512	1024	data source
Tide	256	512	1024	data source
Winding	256	512	1024	data source

Exact Similarity Search - Performance

For completness, we show further results for lengths 256, 512 and 1024 on this page for the TS-tree and the kd-tree, which show the same trend, that the SFA trie is competitive or superior to other indexing techniques.

time series length	page accesses	time	distance calculations
1024
512
256

Exact Similarity Search - iSAX index vs. SFA trie

In the paper we could only show the number of page accesses and wall clock time. For completness, we show the number of distance calculations on this page. It shows that the SFA trie does not only have less page accesses but also requires less distance calculations for exact similarity search.

time series length	page accesses	time	distance calculations
1024

Here is the raw data for the experiments.

Acknowledgements:

Thank you to the providers of the time series datasets.
Thank you to Eamonn J. Keogh for collecting all time series datasets and providing this page on iSAX.