I've got a list of strings like
- Foobar
- Foobaron
- Foot
- barstool
- barfoo
- footloose
I want to find the set of shortest possible sub-sequences that are unique to each string in the set; the characters in each sub-sequence do not need to be adjacent, just in order as they appear in the original string. For the example above, that would be (along other possibilities)
Fb
(as unique to Foobar as it gets; collision with Foobaron unavoidable)Fn
(unique to Foobaron, no other...F...n...
)Ft
(Foot)bs
(barstool)bf
(barfoo)e
(footloose)
Is there an efficient way to mine such sequences and minimize the number of colliding strings (when collisions can't be avoided, e.g. when strings are substrings of other strings) from a given array of strings? More precisely, chosing the length N, what is the set of sub-sequences of up to N characters each that identify the original strings with the fewest number of collisions.
3 Answers
Answers 1
I would'nt really call that 'efficient', but you can do better than totally dumb like that:
words = ['Foobar', 'Foobaron', 'Foot', 'barstool', 'barfoo', 'footloose'] N = 2 n = len(words) L = max([len(word) for word in words]) def generate_substrings(word, max_length=None): if max_length is None: max_length = len(word) set_substrings = set() set_substrings.add('') for charac in word: new_substr_list = [] for substr in set_substrings: new_substr = substr + charac if len(new_substr) <= max_length: new_substr_list.append(new_substr) set_substrings.update(new_substr_list) return set_substrings def get_best_substring_for_each(string_list=words, max_length=N): all_substrings = {} best = {} for word in string_list: for substring in generate_substrings(word, max_length=max_length): if substring not in all_substrings: all_substrings[substring] = 0 all_substrings[substring] = all_substrings[substring] + 1 for word in string_list: best_score = len(string_list) + 1 best[word] = '' for substring in generate_substrings(word=word, max_length=max_length): if all_substrings[substring] < best_score: best[word] = substring best_score = all_substrings[substring] return best print(get_best_substring_for_each(words, N))
This program prints the solution:
{'barfoo': 'af', 'Foobar': 'Fr', 'Foobaron': 'n', 'footloose': 'os', 'barstool': 'al', 'Foot': 'Ft'}
This can still be improved easily by a constant factor, for instance by storing the results of generate_substrings
instead of computing it twice.
The complexity is O(n*C(N, L+N))
, where n is the number of words and L the maximum length of a word, and C(n, k)
is the number of combinations with k elements out of n.
I don't think (not sure though) that you can do much better in the worst case, because it seems hard not to enumerate all possible substrings in the worst case (the last one to be evaluated could be the only one with no redundancy...). Maybe in average you can do better...
Answers 2
You could use a modification to the longest common subsequence algorithm. In this case you are seeking the shortest unique subsequence. Shown below is part of a dynamic programming solution which is more efficient than a recursive solution. The modifications to the longest common subsequence algorithm are described in the comments below:
for (int i = 0; i < string1.Length; i++) for (int j = 0; j < string2.Length; j++) if (string1[i-1] != string2[j-1]) // find characters in the strings that are distinct SUS[i][j] = SUS[i-1][j-1] + 1; // SUS: Shortest Unique Substring else SUS[i][j] = min(SUS[i-1][j], SUS[i][j-1]); // find minimum size of distinct strings
You can then put this code in a function and call this function for each string in your set to find the length of the shortest unique subsequence in the set.
Once you have the length of the shortest unique subsequence, you can backtrack to print the subsequence.
Answers 3
You should use modified Trie structure, insert strings to a trie in a way that :
Foo-bar-on -t bar-stool -foo
The rest is straightforward, just choose correct compressed node[0] char
That Radix tree should help
0 comments:
Post a Comment