Saturday, March 11, 2017

Finding longest regex match in Java?

Leave a Comment

I have this:

import java.util.regex.*;  String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))"; String s = "hello world";  Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(s); while(matcher.find()) {   MatchResult matchResult = m.toMatchResult();   String substring = s.substring(matchResult.start(), matchResult.end());   System.out.println(substring); } 

The above only prints hello whereas I want it to print hello world.

One way to fix this is to re-order the groups in String regex = "(?<m2>(hello world))|(?<m1>(hello|universe))" but I don't have control over the regex I get in my case...

So what is the best way to find the longest match? An obvious way would be to check all possible substrings of s as mentioned here (Efficiently finding all overlapping matches for a regular expression) by length and pick the first but that is O(n^2). Can we do better?

3 Answers

Answers 1

just add the $ (End of string) before the Or separator |.
Then it check whether the string is ended of not. If ended, it will return the string. Otherwise skip that part of regex.

The below code gives what you want

import java.util.regex.*; public class RegTest{   public static void main(String[] arg){         String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))";         String s = "hello world";         Pattern pattern = Pattern.compile(regex);         Matcher matcher = pattern.matcher(s);         while(matcher.find()) {             MatchResult matchResult = matcher.toMatchResult();             String substring = s.substring(matchResult.start(), matchResult.end());             System.out.println(substring);         }     } } 

Likewise, the below code will skip hello , hello world and match hello world there
See the usage of $ there

import java.util.regex.*; public class RegTest{   public static void main(String[] arg){         String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))$|(?<m3>(hello world there))";         String s = "hello world there";         Pattern pattern = Pattern.compile(regex);         Matcher matcher = pattern.matcher(s);         while(matcher.find()) {             MatchResult matchResult = matcher.toMatchResult();             String substring = s.substring(matchResult.start(), matchResult.end());             System.out.println(substring);         }     } } 

Answers 2

If the structure of the regex is always the same, this should work:

String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))"; String s = "hello world";  //split the regex into the different groups String[] allParts = regex.split("\\|\\(\\?\\<"); for (int i=1; i<allParts.length; i++) {     allParts[i] = "(?<" + allParts[i]; }  //find the longest string int longestSize = -1; String longestString = null; for (int i=0; i<allParts.length; i++) {     Pattern pattern = Pattern.compile(allParts[i]);     Matcher matcher = pattern.matcher(s);     while(matcher.find()) {         MatchResult matchResult = matcher.toMatchResult();         String substring = s.substring(matchResult.start(), matchResult.end());         if (substring.length() > longestSize) {             longestSize = substring.length();             longestString = substring;         }     } } System.out.println("Longest: " + longestString); 

Answers 3

If you are dealing with just this specific pattern:

  1. There is one or more named group on the highest level connected by |.
  2. The regex for the group is put in superfluous braces.
  3. Inside those braces is one or more literal connected by |.
  4. Literals never contain |, ( or ).

Then it is possible to write a solution by extracting the literals, sorting them by their length and then returning the first match:

private static final Pattern g = Pattern.compile("\\(\\?\\<[^>]+\\>\\(([^)]+)\\)\\)");  public static final String findLongestMatch(String s, Pattern p) {     Matcher m = g.matcher(p.pattern());     List<String> literals = new ArrayList<>();     while (m.find())         Collections.addAll(literals, m.group(1).split("\\|"));     Collections.sort(literals, new Comparator<String>() {         public int compare(String a, String b) {             return Integer.compare(b.length(), a.length());         }     });     for (Iterator<String> itr = literals.iterator(); itr.hasNext();) {          String literal = itr.next();          if (s.indexOf(literal) >= 0)               return literal;     }     return null; } 

Test:

System.out.println(findLongestMatch(     "hello world",     Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))") )); // output: hello world System.out.println(findLongestMatch(     "hello universe",     Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))") )); // output: universe 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment