Wednesday, May 3, 2017

How to find set of shortest subsequences with minimal collisions from set of strings

Leave a Comment

I've got a list of strings like

  • Foobar
  • Foobaron
  • Foot
  • barstool
  • barfoo
  • footloose

I want to find the set of shortest possible sub-sequences that are unique to each string in the set; the characters in each sub-sequence do not need to be adjacent, just in order as they appear in the original string. For the example above, that would be (along other possibilities)

  • Fb (as unique to Foobar as it gets; collision with Foobaron unavoidable)
  • Fn (unique to Foobaron, no other ...F...n...)
  • Ft (Foot)
  • bs (barstool)
  • bf (barfoo)
  • e (footloose)

Is there an efficient way to mine such sequences and minimize the number of colliding strings (when collisions can't be avoided, e.g. when strings are substrings of other strings) from a given array of strings? More precisely, chosing the length N, what is the set of sub-sequences of up to N characters each that identify the original strings with the fewest number of collisions.

3 Answers

Answers 1

I would'nt really call that 'efficient', but you can do better than totally dumb like that:

words = ['Foobar', 'Foobaron', 'Foot', 'barstool', 'barfoo', 'footloose'] N = 2 n = len(words) L = max([len(word) for word in words])  def generate_substrings(word, max_length=None):     if max_length is None:         max_length = len(word)     set_substrings = set()     set_substrings.add('')     for charac in word:         new_substr_list = []         for substr in set_substrings:             new_substr = substr + charac             if len(new_substr) <= max_length:                 new_substr_list.append(new_substr)         set_substrings.update(new_substr_list)     return set_substrings  def get_best_substring_for_each(string_list=words, max_length=N):     all_substrings = {}     best = {}     for word in string_list:         for substring in generate_substrings(word, max_length=max_length):             if substring not in all_substrings:                 all_substrings[substring] = 0             all_substrings[substring] = all_substrings[substring] + 1     for word in string_list:         best_score = len(string_list) + 1         best[word] = ''         for substring in generate_substrings(word=word, max_length=max_length):             if all_substrings[substring] < best_score:                 best[word] = substring                 best_score = all_substrings[substring]     return best  print(get_best_substring_for_each(words, N)) 

This program prints the solution:

{'barfoo': 'af', 'Foobar': 'Fr', 'Foobaron': 'n', 'footloose': 'os', 'barstool': 'al', 'Foot': 'Ft'} 

This can still be improved easily by a constant factor, for instance by storing the results of generate_substringsinstead of computing it twice.

The complexity is O(n*C(N, L+N)), where n is the number of words and L the maximum length of a word, and C(n, k) is the number of combinations with k elements out of n.

I don't think (not sure though) that you can do much better in the worst case, because it seems hard not to enumerate all possible substrings in the worst case (the last one to be evaluated could be the only one with no redundancy...). Maybe in average you can do better...

Answers 2

You could use a modification to the longest common subsequence algorithm. In this case you are seeking the shortest unique subsequence. Shown below is part of a dynamic programming solution which is more efficient than a recursive solution. The modifications to the longest common subsequence algorithm are described in the comments below:

for (int i = 0; i < string1.Length; i++)     for (int j = 0; j < string2.Length; j++)         if (string1[i-1] != string2[j-1])   // find characters in the strings that are distinct             SUS[i][j] = SUS[i-1][j-1] + 1;  // SUS: Shortest Unique Substring         else             SUS[i][j] = min(SUS[i-1][j], SUS[i][j-1]);  // find minimum size of distinct strings 

You can then put this code in a function and call this function for each string in your set to find the length of the shortest unique subsequence in the set.

Once you have the length of the shortest unique subsequence, you can backtrack to print the subsequence.

Answers 3

You should use modified Trie structure, insert strings to a trie in a way that :

Foo-bar-on    -t bar-stool    -foo 

The rest is straightforward, just choose correct compressed node[0] char

That Radix tree should help

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment