The cosine similarity measure[1] (O(m + n)) contributes a decent runtime
reduction when used as a filter prior to execution of more expensive
algorithms such as LCS[2] (O(m * n)).
A private test set of 3500 strings was used to quantify the improvement.
The shape of the test set is described by Python's Pandas module as:
>>> frames.apply(len).describe()
count 3500.000000
mean 47.454286
std 14.980197
min 10.000000
25% 37.000000
50% 45.000000
75% 61.000000
max 109.000000
dtype: float64
>>>
The tests were performed on a lightly loaded Lenovo X201s stocked with a
Intel Core i7 L 640 @ 2.13GHz CPU with 3.7 GiB of memory. The test was
compiled with GCC 4.9.3:
$ gcc --version
gcc (Gentoo 4.9.3 p1.0, pie-0.6.2) 4.9.3
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using the test program outlined below, ten runs for input set sizes
incrementing in batches of 500 were taken prior to filtering with cosine
similarity:
The test driver is as below, where both it and its dependencies were
compiled with 'CFLAGS=-O2 -fopenmp' and linked with 'LDFLAGS=-fopenmp',
'LDLIBS=-lm':