Monday, June 18, 2018

Heading identification with Regex

Leave a Comment

I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.

private string GetMarkingSection(string text)     {       int startIndex = 0;       int endIndex = 0;       bool startIndexFound = false;       Regex rx = new Regex(HEADINGREGEX);       foreach (Match match in rx.Matches(text))       {         if (startIndexFound)         {           endIndex = match.Index;           break;         }         if (match.ToString().ToLower().Contains("marking"))         {           startIndex = match.Index;           startIndexFound = true;         }       }       return text.Substring(startIndex, (endIndex - startIndex));     } 

Once the marking section is found, I use this to find subsections.

private Dictionary<string, string> GetSubsections(string text)     {       Dictionary<string, string> subsections = new Dictionary<string, string>();       string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);       string title = "";       string content = "";       foreach(string s in unprocessedSubSecs)       {         if(s != "") //sometimes it pulls in empty strings         {           Match m = Regex.Match(s, SUBSECTIONREGEX);           if (m.Success)           {             title = s;           }           else           {             content = s;             if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title))             {               subsections.Add(title, content);             }           }         }       }       return subsections;     } 

Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me. These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:

  • Heading: (?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
  • Subheading: (?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
  • Master Key: (?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$

Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.

My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.

That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)

I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment