- * 1. Coarse reduction of the required comparisons
- * 1a. Each group has a 'key', which is the string that triggered the creation
- * of the group
- * 1b. Input strings are only compared against group keys rather than all known
- * strings, reducing the complexity to the current number of groups rather
- * than all known strings. Note due the pathological case where the number
- * of groups is equal to the number of known strings the algorithm still
- * has O(n^2) computational complexity
- *
- * 2. Elimination of LCS computations that will never breach the configured
- * threshold
- * 2a. This can be measured from the length of the compared strings
- * 2b. This avoids invoking the O(mn) behaviour of LCS
- *
- * 3. Caching of input strings and their associated group
- * 3a. By incurring the cost of a map's string hash function we may eliminate
- * all calls to the LCS function for exact matches, potentially reducing
- * the insertion to a constant-time operation.
- *
- * 4. Whilst the data dependencies of LCS prevent internally parallel
- * implementations, LCS as a function can be applied in parallel. The code
- * uses OpenMP to automatically and concurrently distribute scoring of the
- * input string against group keys across threads.
+ * 1. Caching of input strings and their associated group. By incurring the
+ * cost of a map's string hash function we may eliminate all further search
+ * costs for exact matches, potentially reducing the insertion to a
+ * constant-time operation.
+ *
+ * 2. Coarse reduction of the required comparisons. Each group has a 'key',
+ * which is the string that triggered the creation of the group. Input strings
+ * are only compared against group keys rather than all known strings, reducing
+ * the complexity to the current number of groups rather than all known
+ * strings. Note due the pathological case where the number of groups is equal
+ * to the number of known strings the algorithm still has O(n^2) computational
+ * complexity
+ *
+ * 3. Whilst the data dependencies of LCS prevent internally parallel
+ * implementations, LCS and other filters can be applied in parallel. The code
+ * uses OpenMP to automatically distribute scoring of the input string
+ * against group keys across a number of threads.
+ *
+ * 4. Elimination of LCS computations that will never breach the configured
+ * threshold. Two measurements are used as rejecting filters (i.e. a failure to
+ * exceed the threshold prevents further measurements being made):
+ *
+ * 4a. Comparsion of string lengths: String lengths must be within given bounds
+ * to satisfy the user-supplied similarity constraint. A negative result avoids
+ * invoking the O(m * n) behaviour of LCS at the cost of O(m + n) in the two
+ * strlen() invocations.
+ *
+ * 4b. Comparison of character distribution: Cosine similarity[2] is used to
+ * measure unordered character similarity between the input strings. The
+ * implementation is again O(m + n), and avoids the O(m * n) behaviour of LCS.
+ *
+ * Performance will vary not only with the number of input strings but
+ * with their lengths and relative similarities. A large number of long input
+ * strings that are relatively similar will give the worst performance.
+ * However to provide some context, strgrp can cluster a real-world test set of
+ * 3500 strings distributed in length between 20 and 100 characters to 85%
+ * similarity on a 4-core 2010-era amd64 laptop in approximately 750ms.