6 * strgrp - group/cluster similar strings.
8 * strgrp uses the Longest Common Subsequence (LCS) algorithm[1] to cluster
9 * similar strings. It is governed by a threshold which is compared against
10 * the computed normalised LCS length for all known groups.
12 * As a coarse and not entirely accurate summary, strgrp takes the following
15 * 1. For all known strings, calculate the normalised LCS value against the
17 * 2. Find the maximum normalised LCS value and associated group
18 * 3. If the calculated normalised LCS value exceeds the configured threshold,
19 * add the input string to the group, otherwise create a new group
21 * The clustering operation is expensive; LCS on its own is computationally
22 * O(mn) on its two input strings and optimally requires O(min(m,n)) memory. In
23 * general each input string should be compared against all known strings,
24 * giving O(n^2) behaviour of the clustering algorithm on top of the O(mn) LCS
25 * similarity measurement.
27 * strgrp tries to battle this complexity on several fronts:
29 * 1. Coarse reduction of the required comparisons
30 * 1a. Each group has a 'key', which is the string that triggered the creation
32 * 1b. Input strings are only compared against group keys rather than all known
33 * strings, reducing the complexity to the current number of groups rather
34 * than all known strings. Note due the pathological case where the number
35 * of groups is equal to the number of known strings the algorithm still
36 * has O(n^2) computational complexity
38 * 2. Elimination of LCS computations that will never breach the configured
40 * 2a. This can be measured from the length of the compared strings
41 * 2b. This avoids invoking the O(mn) behaviour of LCS
43 * 3. Caching of input strings and their associated group
44 * 3a. By incurring the cost of a map's string hash function we may eliminate
45 * all calls to the LCS function for exact matches, potentially reducing
46 * the insertion to a constant-time operation.
48 * 4. Whilst the data dependencies of LCS prevent internally parallel
49 * implementations, LCS as a function can be applied in parallel. The code
50 * uses OpenMP to automatically and concurrently distribute scoring of the
51 * input string against group keys across threads.
53 * [1] https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
56 * Author: Andrew Jeffery <andrew@aj.id.au>
59 * #define BUF_SIZE 512
60 * FILE *const f = fdopen(0, "r");
61 * char *buf = malloc(BUF_SIZE);
62 * struct strgrp *ctx = strgrp_new(0.85);
63 * while(fgets(buf, BUF_SIZE, f)) {
64 * buf[strcspn(buf, "\r\n")] = '\0';
65 * if (!strgrp_add(ctx, buf, NULL)) {
66 * printf("Failed to classify %s\n", buf);
70 * // Re-implement something similar to strgrp_print() via API as an
72 * struct strgrp_iter *iter = strgrp_iter_new(ctx);
73 * const struct strgrp_grp *grp;
74 * while(NULL != (grp = strgrp_iter_next(iter))) {
75 * printf("%s\n", strgrp_grp_key(grp));
76 * struct strgrp_grp_iter *grp_iter = strgrp_grp_iter_new(grp);
77 * const struct strgrp_item *item;
78 * while(NULL != (item = strgrp_grp_iter_next(grp_iter))) {
79 * printf("\t%s\n", strgrp_item_key(item));
81 * strgrp_grp_iter_free(grp_iter);
83 * strgrp_iter_free(iter);
88 int main(int argc, char *argv[]) {
93 if (strcmp(argv[1], "depends") == 0) {
94 printf("ccan/darray\n");
95 printf("ccan/stringmap\n");
97 printf("ccan/tal/str\n");
101 if (strcmp(argv[1], "testdepends") == 0) {
102 printf("ccan/str\n");