6 * strgrp - group/cluster similar strings.
8 * strgrp uses the Longest Common Subsequence (LCS) algorithm[1] to cluster
9 * similar strings. It is governed by a threshold which is compared against
10 * the computed normalised LCS length for all known groups.
12 * As a coarse and not entirely accurate summary, strgrp takes the following
15 * 1. For all known strings, calculate the normalised LCS value against the
18 * 2. Find the maximum normalised LCS value and associated group
20 * 3. If the calculated maximum normalised LCS value exceeds the configured
21 * threshold add the input string to the group, otherwise create a new group
23 * The clustering operation is expensive; LCS on its own is computationally
24 * O(mn) on its two input strings and optimally requires O(min(m,n)) memory. In
25 * general each input string should be compared against all known strings,
26 * giving O(n^2) behaviour of the clustering algorithm on top of the O(mn) LCS
27 * similarity measurement.
29 * strgrp tries to battle this complexity on several fronts:
31 * 1. Coarse reduction of the required comparisons. Each group has a 'key',
32 * which is the string that triggered the creation of the group. Input strings
33 * are only compared against group keys rather than all known strings, reducing
34 * the complexity to the current number of groups rather than all known
35 * strings. Note due the pathological case where the number of groups is equal
36 * to the number of known strings the algorithm still has O(n^2) computational
39 * 2. Elimination of LCS computations that will never breach the configured
40 * threshold. This property can be measured from the length of the input
41 * strings, and a negative result avoids invoking the O(mn) behaviour of LCS
43 * 3. Caching of input strings and their associated group. By incurring the
44 * cost of a map's string hash function we may eliminate all calls to the LCS
45 * function for exact matches, potentially reducing the insertion to a
46 * constant-time operation.
48 * 4. Whilst the data dependencies of LCS prevent internally parallel
49 * implementations, LCS as a function can be applied in parallel. The code
50 * uses OpenMP to automatically distribute scoring of the input string
51 * against group keys across a number of threads.
53 * [1] https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
56 * Author: Andrew Jeffery <andrew@aj.id.au>
62 * struct strgrp_iter *iter;
63 * const struct strgrp_grp *grp;
64 * struct strgrp_grp_iter *grp_iter;
65 * const struct strgrp_item *item;
68 * #define BUF_SIZE 512
69 * buf = malloc(BUF_SIZE);
70 * ctx = strgrp_new(0.85);
71 * while(fgets(buf, BUF_SIZE, f)) {
72 * buf[strcspn(buf, "\r\n")] = '\0';
73 * if (!strgrp_add(ctx, buf, NULL)) {
74 * printf("Failed to classify %s\n", buf);
78 * // Re-implement something similar to strgrp_print() via API as an
80 * iter = strgrp_iter_new(ctx);
81 * while(NULL != (grp = strgrp_iter_next(iter))) {
82 * printf("%s\n", strgrp_grp_key(grp));
83 * grp_iter = strgrp_grp_iter_new(grp);
84 * while(NULL != (item = strgrp_grp_iter_next(grp_iter))) {
85 * printf("\t%s\n", strgrp_item_key(item));
87 * strgrp_grp_iter_free(grp_iter);
90 * strgrp_iter_free(iter);
95 int main(int argc, char *argv[]) {
100 if (strcmp(argv[1], "depends") == 0) {
101 printf("ccan/darray\n");
102 printf("ccan/stringmap\n");
103 printf("ccan/tal\n");
104 printf("ccan/tal/str\n");
108 if (strcmp(argv[1], "testdepends") == 0) {
109 printf("ccan/str\n");
114 if (strcmp(argv[1], "cflags") == 0) {
115 printf("-fopenmp\n");