6 * strgrp - group/cluster similar strings.
8 * strgrp clusters similar strings using the Longest Common Subsequence (LCS)
9 * algorithm[1]. Clustering is governed by a threshold value, which is compared
10 * with the normalised LCS scores calculated from the input string and each
13 * As a coarse and not entirely accurate summary, strgrp takes the following
16 * 1. For all known groups, calculate the normalised LCS value against the
19 * 2. Find the maximum normalised LCS value and associated group
21 * 3. If the calculated maximum normalised LCS value exceeds the configured
22 * threshold add the input string to the group, otherwise create a new group
24 * The clustering operation is expensive; LCS on its own is computationally
25 * O(m * n) on its two input strings and optimally requires O(min(m, n))
26 * memory. In general each input string should be compared against all known
27 * strings, giving O(n^2) behaviour of the clustering algorithm on top of the
28 * O(m * n) LCS similarity measurement.
30 * strgrp tries to battle this complexity on several fronts:
32 * 1. Caching of input strings and their associated group. By incurring the
33 * cost of a map's string hash function we may eliminate all further search
34 * costs for exact matches, potentially reducing the insertion to a
35 * constant-time operation.
37 * 2. Coarse reduction of the required comparisons. Each group has a 'key',
38 * which is the string that triggered the creation of the group. Input strings
39 * are only compared against group keys rather than all known strings, reducing
40 * the complexity to the current number of groups rather than all known
41 * strings. Note due the pathological case where the number of groups is equal
42 * to the number of known strings the algorithm still has O(n^2) computational
45 * 3. Whilst the data dependencies of LCS prevent internally parallel
46 * implementations, LCS and other filters can be applied in parallel. The code
47 * uses OpenMP to automatically distribute scoring of the input string
48 * against group keys across a number of threads.
50 * 4. Elimination of LCS computations that will never breach the configured
51 * threshold. Two measurements are used as rejecting filters (i.e. a failure to
52 * exceed the threshold prevents further measurements being made):
54 * 4a. Comparsion of string lengths: String lengths must be within given bounds
55 * to satisfy the user-supplied similarity constraint. A negative result avoids
56 * invoking the O(m * n) behaviour of LCS at the cost of O(m + n) in the two
57 * strlen() invocations.
59 * 4b. Comparison of character distribution: Cosine similarity[2] is used to
60 * measure unordered character similarity between the input strings. The
61 * implementation is again O(m + n), and avoids the O(m * n) behaviour of LCS.
63 * Performance will vary not only with the number of input strings but
64 * with their lengths and relative similarities. A large number of long input
65 * strings that are relatively similar will give the worst performance.
66 * However to provide some context, strgrp can cluster a real-world test set of
67 * 3500 strings distributed in length between 20 and 100 characters to 85%
68 * similarity on a 4-core 2010-era amd64 laptop in approximately 750ms.
70 * [1] https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
72 * [2] https://en.wikipedia.org/wiki/Cosine_similarity
75 * Author: Andrew Jeffery <andrew@aj.id.au>
81 * struct strgrp_iter *iter;
82 * const struct strgrp_grp *grp;
83 * struct strgrp_grp_iter *grp_iter;
84 * const struct strgrp_item *item;
87 * #define BUF_SIZE 512
88 * buf = malloc(BUF_SIZE);
89 * ctx = strgrp_new(0.85);
90 * while(fgets(buf, BUF_SIZE, f)) {
91 * buf[strcspn(buf, "\r\n")] = '\0';
92 * if (!strgrp_add(ctx, buf, NULL)) {
93 * printf("Failed to classify %s\n", buf);
97 * // Re-implement something similar to strgrp_print() via API as an
99 * iter = strgrp_iter_new(ctx);
100 * while(NULL != (grp = strgrp_iter_next(iter))) {
101 * printf("%s\n", strgrp_grp_key(grp));
102 * grp_iter = strgrp_grp_iter_new(grp);
103 * while(NULL != (item = strgrp_grp_iter_next(grp_iter))) {
104 * printf("\t%s\n", strgrp_item_key(item));
106 * strgrp_grp_iter_free(grp_iter);
109 * strgrp_iter_free(iter);
114 int main(int argc, char *argv[]) {
119 if (strcmp(argv[1], "depends") == 0) {
120 printf("ccan/darray\n");
121 printf("ccan/stringmap\n");
122 printf("ccan/tal\n");
123 printf("ccan/tal/str\n");
127 if (strcmp(argv[1], "testdepends") == 0) {
128 printf("ccan/str\n");
132 if (strcmp(argv[1], "cflags") == 0) {
134 printf("-fopenmp\n");
140 if (strcmp(argv[1], "libs") == 0) {