Monday, August 14, 2017

Finding longest regex match in Java?

Leave a Comment

I have this:

import java.util.regex.*;  String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))"; String s = "hello world";  Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(s); while(matcher.find()) {   MatchResult matchResult = m.toMatchResult();   String substring = s.substring(matchResult.start(), matchResult.end());   System.out.println(substring); } 

The above only prints hello whereas I want it to print hello world.

One way to fix this is to re-order the groups in String regex = "(?<m2>(hello world))|(?<m1>(hello|universe))" but I don't have control over the regex I get in my case...

So what is the best way to find the longest match? An obvious way would be to check all possible substrings of s as mentioned here (Efficiently finding all overlapping matches for a regular expression) by length and pick the first but that is O(n^2). Can we do better?

3 Answers

Answers 1

Here is a way of doing it using matcher regions, but with a single loop over the string index:

public static String findLongestMatch(String regex, String s) {     Pattern pattern = Pattern.compile("(" + regex + ")$");     Matcher matcher = pattern.matcher(s);     String longest = null;     int longestLength = -1;     for (int i = s.length(); i > longestLength; i--) {         matcher.region(0, i);         if (matcher.find() && longestLength < matcher.end() - matcher.start()) {             longest = matcher.group();             longestLength = longest.length();         }     }     return longest; } 

I'm forcing the pattern to match until the region's end, and then I move the region's end from the rightmost string index towards the left. For each region's end tried, Java will match the leftmost starting substring that finishes at that region's end, i.e. the longest substring that ends at that place. Finally, it's just a matter of keeping track of the longest match found so far.

As a matter of optimization, and since I start from the longer regions towards the shorter ones, I stop the loop as soon as all regions that would come after are already shorter than the length of longest substring already found.


An advantage of this approach is that it can deal with arbitrary regular expressions and no specific pattern structure is required:

findLongestMatch("(?<m1>(hello|universe))|(?<m2>(hello world))", "hello world") ==> "hello world"  findLongestMatch("hello( universe)?", "hello world") ==> "hello"  findLongestMatch("hello( world)?", "hello world") ==> "hello world"  findLongestMatch("\\w+|\\d+", "12345 abc") ==> "12345" 

Answers 2

If you are dealing with just this specific pattern:

  1. There is one or more named group on the highest level connected by |.
  2. The regex for the group is put in superfluous braces.
  3. Inside those braces is one or more literal connected by |.
  4. Literals never contain |, ( or ).

Then it is possible to write a solution by extracting the literals, sorting them by their length and then returning the first match:

private static final Pattern g = Pattern.compile("\\(\\?\\<[^>]+\\>\\(([^)]+)\\)\\)");  public static final String findLongestMatch(String s, Pattern p) {     Matcher m = g.matcher(p.pattern());     List<String> literals = new ArrayList<>();     while (m.find())         Collections.addAll(literals, m.group(1).split("\\|"));     Collections.sort(literals, new Comparator<String>() {         public int compare(String a, String b) {             return Integer.compare(b.length(), a.length());         }     });     for (Iterator<String> itr = literals.iterator(); itr.hasNext();) {          String literal = itr.next();          if (s.indexOf(literal) >= 0)               return literal;     }     return null; } 

Test:

System.out.println(findLongestMatch(     "hello world",     Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))") )); // output: hello world System.out.println(findLongestMatch(     "hello universe",     Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))") )); // output: universe 

Answers 3

If the structure of the regex is always the same, this should work:

String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))"; String s = "hello world";  //split the regex into the different groups String[] allParts = regex.split("\\|\\(\\?\\<"); for (int i=1; i<allParts.length; i++) {     allParts[i] = "(?<" + allParts[i]; }  //find the longest string int longestSize = -1; String longestString = null; for (int i=0; i<allParts.length; i++) {     Pattern pattern = Pattern.compile(allParts[i]);     Matcher matcher = pattern.matcher(s);     while(matcher.find()) {         MatchResult matchResult = matcher.toMatchResult();         String substring = s.substring(matchResult.start(), matchResult.end());         if (substring.length() > longestSize) {             longestSize = substring.length();             longestString = substring;         }     } } System.out.println("Longest: " + longestString); 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment