Find occurrences of huge list of phrases in text

I'm building a backend and trying to crunch the following problem.

The clients submit text to the backend (around 2000 characters on average)
Backend endpoint that receives the request has to apply phrase highlighting to the submitted text
There is around 80k phrases to match. A phrase is a simple object:
```
{     'phrase': 'phrase to match'     'link': 'link_url' } 
```
After finding all matches of phrases that exist in the text, the backend returns to the client what was matched - basically a map:
```
range in text -> phrase 
```

Most is done. I'm about to tackle coding the phrase matching part. Everything else works smoothly. Since I don't want to reinvent the wheel I tried googling to find a Python library that does the job of efficiently finding phrases (from huge list) in text. However, I couldn't find anything.

I checked out the BlueSoup and Natural Language Toolkit. However they don't seem to be doing what I'm looking for.

Do you guys know if there is a library that would be helpful in such task? Seems like a common thing to implement and I don't want to go custom if there is a well established library for that.

6 Answers

Answers 1

To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore won't help much.

You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick or some generic finite state transducer. The next option is something like a suffix array based index, and the last one that comes to my mind is inverted index.

If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.

If you need approximate matching, character-ngram inverted index is your choice.

Answers 2

Maybe you should try flashtext.
According to the author, it is much more faster than Regex.

The author even published a paper for this library.

I've personally tried this library for one of my project, in my opinion its API is quite friendly and usable.

Hope it helps.

Answers 3

I faced an almost identical problem with my own chat page system. I wanted to be able to add a link to a number of keywords (with slight variations) that were present in the text. I only had around 200 phrases though to check.

I decided to try using a standard regular expression for the problem to see how fast it would be. The main bottleneck was in constructing the regular expression. I decided to pre-compile this and found the match time was very fast for shorter texts.

The following approach takes a list of phrases, where each contains phrase and link keys. It first constructs a reverse lookup dictionary:

{'phrase to match' : 'link_url', 'another phrase' : 'link_url2'}

Next it compiles a regular expression in the following form, this allows for matches which contain different amounts of white space between words:

(phrase\s+to\s+match|another\s+phrase)

Then for each piece of text (e.g. 2000 words each), it uses finditer() to get each match. The match object gives you .span() giving the start and end location of the matching text and group(1) gives the matched text. As the text can possibly have extra whitespace, re_whitespace is first applied to remove it and bring it back to the form stored in the reverse dictionary. With this, it is possible to automatically look up the required link:

import re  texts = ['this is a phrase   to    match', 'another phrase this  is'] phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]  reverse = {d['phrase']:d['link'] for d in sorted(phrases, key=lambda x: x['phrase'])} re_whitespace = re.compile(r'\s+') re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in phrases)))  for text in texts:     matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]     print(matches)

Which would display the matches for the two texts as:

[((0, 7), 'link_url2'), ((10, 30), 'link_url')] [((15, 23), 'link_url2')]

To test how this scales, I have tested it by importing a list of English words from nltk and automatically creating 80,000 two to six word phrases along with unique links. I then timed it on two suitably long texts:

import re import random from nltk.corpus import words import time  english = words.words()  def random_phrase(l=2, h=6):     return ' '.join(random.sample(english, random.randint(l, h)))   texts = ['this is a phrase   to    match', 'another phrase this  is'] # Make texts ~2000 characters texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]  phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}] #Simulate 80k phrases for x in range(80000):     phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})  construct_time = time.time()      reverse = {d['phrase']:d['link'] for d in phrases} re_whitespace = re.compile(r'\s+') re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in sorted(phrases, key=lambda x: len(x['phrase'])))))  print('Time to construct:', time.time() - construct_time) print()  for text in texts:     start_time = time.time()     print('{} characters - "{}..."'.format(len(text), text[:60]))     matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]     print(matches)     print('Time taken:', time.time() - start_time)             print()

This takes ~17 seconds to construct the regular expression and reverse lookup (which is only needed once). It then takes about 6 seconds per text. For very short text it takes ~0.06 seconds per text.

Time to construct: 16.812477111816406  2092 characters - "this is a phrase   to    match totaquine externize intoxatio..." [((0, 7), 'link_url2'), ((10, 30), 'link_url')] Time taken: 6.000027656555176  2189 characters - "another phrase this  is political procoracoidal playstead as..." [((15, 23), 'link_url2')] Time taken: 6.190425715255737

This will at least give you an idea to compare against.

Answers 4

You should try a string search / pattern matching algorithm. Most famous algorithm for you task is the Aho-Corasick there is a python library for it (of the top of google search)

Most of the pattern matching / string search algorithms will require you to convert your "bag of words/phrases" into a trie.

Answers 5

The pyparsing module - a python tool for extracting information from text - will help you with writing phrase matching. It returns all matches, and the index range of each match, of a phrase which you can describe using BNF (Backus-Naur Form) (i.e. a grammar). In my experience, it is easy to use (2), expressive in terms of the kinds patterns you can define, and is impressively fast.

from pyparsing import Word, alphas greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here hello = "Hello, World!" print (hello, "->", greet.parseString( hello ))

Use scanString to return index of match:

for item in greet.scanString(hello):     print(item)  >>> ((['Hello', ',', 'World', '!'], {}), 0, 13)

If you assemble a list of phrases using pyparsing as a dictionary of form

phrase_list = {phrase_defined_with_pyparsing: phrase_name}

then your grammar can be a giant OR statement with labeled phrases.

import pyparsing as pp your_grammar = pp.Or([phrase.setResultsName(phrase_name) for phrase, phrase_name in phrase_list.items()]) all_matches = your_grammar.scanString(big_document)

Each match is a tuple that is labeled (via setResultsName) and has an index range.

Answers 6

Assuming that the list of phrases changes over time and gets bigger, I'd recommend using a software, that already does, what you need. E.g. elasticsearch, it's open source and has a Python client. If running a service like that in the background, this solves all you want and probably more than you ever can imagine. Also it's really not that hard to implement.

Coding Question

Tuesday, March 20, 2018

Find occurrences of huge list of phrases in text

6 Answers

Answers 1

Answers 2

Answers 3

Answers 4

Answers 5

Answers 6

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook