I have a string that holds a very long sentence without whitespaces/spaces.

mystring = "abcdthisisatextwithsampletextforasampleabcd"

I would like to find all of the repeated substrings that contains minimum 4 chars.

So I would like to achieve something like this:

'text' 2 times 'sample' 2 times 'abcd' 2 times

As both abcd,text and sample can be found two times in the mystring they were recognized as propely matched substrings with more than 4 char length. It's important that I am seeking repeated substrings, finding only existing english words is not a requirement.

The answers I found are helpful for finding duplicates in texts with whitespaces, but I couldn't find a proper resource that covers the situation when there are no spaces and whitespaces in the string. I would really appreciate if somebody could show me how this should be done the most efficient way.

10 Answers

Answers 1

This is in Python 2 because I'm not doing Python 3 at this time. So you'll have to adapt it to Python 3 yourself.

#!python2  # import module from collections import Counter  # get the indices def getIndices(length):     # holds the indices     specific_range = []; all_sets = []      # start building the indices     for i in range(0, length - 2):          # build a set of indices of a specific range         for j in range(1, length + 2):             specific_range.append([j - 1, j + i + 3])              # append 'specific_range' to 'all_sets', reset 'specific_range'             if specific_range[j - 1][1] == length:                 all_sets.append(specific_range)                 specific_range = []                 break      # return all of the calculated indices ranges     return all_sets  # store search strings tmplst = []; combos = []; found = []  # string to be searched mystring = "abcdthisisatextwithsampletextforasampleabcd" # mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"  # get length of string length = len(mystring)  # get all of the indices ranges, 4 and greater all_sets = getIndices(length)  # get the search string combinations for sublst in all_sets:     for subsublst in sublst:         tmplst.append(mystring[subsublst[0]: subsublst[1]])     combos.append(tmplst)     tmplst = []  # search for matching string patterns for sublst in all_sets:     for subsublst in sublst:         for sublstitems in combos:             if mystring[subsublst[0]: subsublst[1]] in sublstitems:                 found.append(mystring[subsublst[0]: subsublst[1]])  # make a dictionary containing the strings and their counts d1 = Counter(found)  # filter out counts of 2 or more and print them for k, v in d1.items():     if v > 1:         print k, v

Answers 2

Let's go through this step by step. There are several sub-tasks you should take care of:

Identify all substrings of length 4 or more.
Count the occurrence of these substrings.
Filter all substrings with 2 occurrences or more.

You can actually put all of them into a few statements. For understanding, it is easier to go through them one at a time.

The following examples all use

mystring = "abcdthisisatextwithsampletextforasampleabcd" min_length = 4

1. Substrings of a given length

You can easily get substrings by slicing - for example, mystring[4:4+6] gives you the substring from position 4 of length 6: 'thisis'. More generically, you want substrings of the form mystring[start:start+length].

So what values do you need for start and length?

start must...
- cover all substrings, so it must include the first character: start in range(0, ...).
- not map to short substrings, so it can stop max_length characters before the end: start in range(..., len(mystring) - max_length - 1).
length must...
- cover the shortest substring of length 4: length in range(min_length, ...).
- not exceed the remaining string after i: length in range(..., len(mystring) - i + 1))

The -1 and +1 terms come from converting lengths (>=1) to indices (>=0). You can put this all together into a single comprehension:

substrings = [     mystring[i:i+j]     for i in range(0, len(mystring) - min_length - 1)     for j in range(min_length, len(mystring) - i + 1) ]

2. Count substrings

Trivially, you want to keep a count for each substring. Keeping anything for each specific object is what dicts are made for. So you should use substrings as keys and counts as values in a dict. In essence, this corresponds to this:

counts = {} for substring in substrings:     try:  # increase count for existing keys, set for new keys          counts[substring] += 1     except KeyError:          counts[substring] = 1

You can simply feed your substrings to collections.Counter, and it produces something like the above.

>>> counts = collections.Counter(substrings) >>> print(counts) Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})

Notice how the duplicate 'abcd' maps to the count of 2.

3. Filtering duplicate substrings

So now you have your substrings and the count for each. You need to remove the non-duplicate substrings - those with a count of 1.

Python offers several constructs for filtering, depending on the output you want. These work also if counts is a regular dict:

>>> list(filter(lambda key: counts[key] > 1, counts)) ['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple'] >>> {key: value for key, value in counts.items() if value > 1} {'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}

Using Python primitives

Python ships with primitives that allow you to do this more efficiently.

Use a generator to build substrings. A generator builds its member on the fly, so you never actually have them all in-memory. For your use case, you can use a generator expression:
```
substrings = (     mystring[i:i+j]     for i in range(0, len(mystring) - min_length - 1)     for j in range(min_length, len(mystring) - i + 1) ) 
```
Use a pre-existing Counter implementation. Python comes with a dict-like container that counts its members: collections.Counter can directly digest your substring generator. Especially in newer version, this is much more efficient.
```
counts = collections.Counter(substrings) 
```
You can exploit Python's lazy filters to only ever inspect one substring. The filter builtin or another generator generator expression can produce one result at a time without storing them all in memory.
```
for substring in filter(lambda key: counts[key] > 1, counts):     print(substring, 'occurs', counts[substring], 'times') 
```

Answers 3

Script (explanation where needed, in comments):

from collections import Counter  mystring = "abcdthisisatextwithsampletextforasampleabcd" mystring_len = len(mystring)  possible_matches = [] matches = []  # Range `start_index` from 0 to 3 from the left, due to minimum char count of 4 for start_index in range(0, mystring_len-3):     # Start `end_index` at `start_index+1` and range it throughout the rest of     # the string     for end_index in range(start_index+1, mystring_len+1):         current_string = mystring[start_index:end_index]         if len(current_string) < 4: continue # Skip this interation, if len < 4         possible_matches.append(mystring[start_index:end_index])  for possible_match, count in Counter(possible_matches).most_common():     # Iterate until count is less than or equal to 1 because `Counter`'s     # `most_common` method lists them in order. Once 1 (or less) is hit, all     # others are the same or lower.     if count <= 1: break     matches.append((possible_match, count))  for match, count in matches:     print(f'\'{match}\' {count} times')

Output:

'abcd' 2 times 'text' 2 times 'samp' 2 times 'sampl' 2 times 'sample' 2 times 'ampl' 2 times 'ample' 2 times 'mple' 2 times

Answers 4

$ cat test.py  import collections import sys    S = "abcdthisisatextwithsampletextforasampleabcd"   def find(s, min_length=4):     """      Find repeated character sequences in a provided string.      Arguments:     s -- the string to be searched     min_length -- the minimum length of the sequences to be found     """     counter = collections.defaultdict(int)     # A repeated sequence can't be longer than half the length of s     sequence_length = len(s) // 2     # populate counter with all possible sequences     while sequence_length >= min_length:         # Iterate over the string until the number of remaining characters is          # fewer than the length of the current sequence.         for i, x in enumerate(s[:-(sequence_length - 1)]):             # Window across the string, getting slices             # of length == sequence_length.              candidate = s[i:i + sequence_length]             counter[candidate] += 1         sequence_length -= 1      # Report.     for k, v in counter.items():         if v > 1:             print('{} {} times'.format(k, v))      return    if __name__ == '__main__':     try:         s = sys.argv[1]     except IndexError:         s = S      find(s)  $ python test.py  sample 2 times sampl 2 times ample 2 times abcd 2 times text 2 times samp 2 times ampl 2 times mple 2 times

Answers 5

Here's a Python3 friendly solution:

from collections import Counter  min_str_length = 4 mystring = "abcdthisisatextwithsampletextforasampleabcd"  all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))] counted_substrings = Counter(all_substrings) not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length] counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates} print(counted_final_candidates)

Bonus: largest string

sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2    ] largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))

Everything as a function:

from collections import Counter def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):      all_substrings = [input_string[start_index:][:end_index + 1]                       for start_index in range(len(input_string))                       for end_index in range(len(input_string[start_index:]))]     counted_substrings = Counter(all_substrings)     not_counted_final_candidates = [item[0]                                     for item in counted_substrings.most_common()                                     if item[1] > 1 and len(item[0]) >= min_str_length]     counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}      ### This is just a bit of bonus code for calculating the largest repeating sting       if calculate_largest_repeated_string == True:         sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in                        not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]         largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))          return counted_final_candidates, largest_common_strings     else:         return counted_final_candidates

Example:

mystring = "abcdthisisatextwithsampletextforasampleabcd" print(get_repeated_strings(mystring, min_str_length= 4))

Output:

({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])

Answers 6

CODE:

pattern = "abcdthisisatextwithsampletextforasampleabcd"  string_more_4 = [] k = 4 while(k <= len(pattern)):     for i in range(len(pattern)):         if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:             string_more_4.append( pattern[i:k+i])     k+=1  for i in string_more_4:     if pattern.count(i) >= 2:         print(i + " -> " +  str(pattern.count(i)) + " times")

OUTPUT:

abcd -> 2 times text -> 2 times samp -> 2 times ampl -> 2 times mple -> 2 times sampl -> 2 times ample -> 2 times sample -> 2 times

Hope this helps as my code length was short and it is easy to understand. Cheers!

Answers 7

Nobody is using re! Time for an answer [ab]using the regular expression built-in module ;)

import re

Finding all the maximal substrings that are repeated

repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))

This matches the longest substrings which have at least a single repetition after (without consuming). So it finds all disjointed substrings that are repeated while only yielding the longest strings.

Finding all substrings that are repeated, including overlaps

mystring_overlap = "abcdeabcdzzzzbcde" # In case we want to match both abcd and bcde repeated_ones = set() pos = 0  while True:     match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])     if match:         repeated_ones.add(match.group(1))         pos += match.pos + 1     else:         break

This ensures that all --not only disjoint-- substrings which have repetition are returned. It should be much slower, but gets the work done.

If you want in addition to the longest strings that are repeated, all the substrings, then:

base_repetitions = list(repeated_ones)  for s in base_repetitions:     for i in range(4, len(s)):         repeated_ones.add(s[:i])

That will ensure that for long substrings that have repetition, you have also the smaller substring --e.g. "sample" and "ample" found by the re.search code; but also "samp", "sampl", "ampl" added by the above snippet.

Counting matches

Because (by design) the substrings that we count are non-overlapping, the count method is the way to go:

from __future__ import print_function for substr in repeated_ones:     print("'%s': %d times" % (substr, mystring.count(substr)))

Results

Finding maximal substrings:

With the question's original mystring:

{'abcd', 'text', 'sample'}

with the mystring_overlap sample:

{'abcd'}

Finding all substrings:

With the question's original mystring:

{'abcd', 'ample', 'mple', 'sample', 'text'}

... and if we add the code to get all substrings then, of course, we get absolutely all the substrings:

{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}

with the mystring_overlap sample:

{'abcd', 'bcde'}

Future work

It's possible to filter the results of the finding all substrings with the following steps:

take a match "A"
check if this match is a substring of another match, call it "B"
if there is a "B" match, check the counter on that match "B_n"
if "A_n = B_n", then remove A
go to first step

It cannot happen that "A_n < B_n" because A is smaller than B (is a substring) so there must be at least the same number of repetitions.

If "A_n > B_n" it means that there is some extra match of the smaller substring, so it is a distinct substring because it is repeated in a place where B is not repeated.

Answers 8

This is my approach to this problem:

def get_repeated_words(string, minimum_len):      # Storing count of repeated words in this dictionary     repeated_words = {}      # Traversing till last but 4th element     # Actually leaving `minimum_len` elements at end (in this case its 4)     for i in range(len(string)-minimum_len):          # Starting with a length of 4(`minimum_len`) and going till end of string         for j in range(i+minimum_len, len(string)):              # getting the current word             word = string[i:j]              # counting the occurrences of the word             word_count = string.count(word)              if word_count > 1:                  # storing in dictionary along with its count if found more than once                 repeated_words[word] = word_count      return repeated_words  if __name__ == '__main__':                   mystring = "abcdthisisatextwithsampletextforasampleabcd"     result = get_repeated_words(mystring, 4)

Answers 9

Here is simple solution using more_itertools library.

Given

import collections as ct  import more_itertools as mit   s = "abcdthisisatextwithsampletextforasampleabcd" lbound, ubound = len("abcd"), len(s)

Code

windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound)) filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1} filtered

Output

{'abcd': 2,  'text': 2,  'samp': 2,  'ampl': 2,  'mple': 2,  'sampl': 2,  'ample': 2,  'sample': 2}

Details

The procedures are:

build sliding windows of varying sizes lbound <= n < ubound
count all occurrences and filter replicates

more_itertools is a third-party package installed by pip install more_itertools.

Answers 10

This is how I would do it, but I don't know any other way:

string = "abcdthisisatextwithsampletextforasampleabcd" l = len(string) occurences = {} for i in range(4, l):   for start in range(l - i):     substring = string[start:start + i]     occurences[substring] = occurences.get(substring, 0) + 1 for key in occurences.keys():   if occurences[key] > 1:     print("'" + key + "'", str(occurences[key]), "times")

Output:

'sample' 2 times 'ampl' 2 times 'sampl' 2 times 'ample' 2 times 'samp' 2 times 'mple' 2 times 'text' 2 times

Efficient, no, but easy to understand, yes.

Coding Question

Saturday, July 7, 2018

Finding repeated character combinations in string