I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>
. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.
private string GetMarkingSection(string text) { int startIndex = 0; int endIndex = 0; bool startIndexFound = false; Regex rx = new Regex(HEADINGREGEX); foreach (Match match in rx.Matches(text)) { if (startIndexFound) { endIndex = match.Index; break; } if (match.ToString().ToLower().Contains("marking")) { startIndex = match.Index; startIndexFound = true; } } return text.Substring(startIndex, (endIndex - startIndex)); }
Once the marking section is found, I use this to find subsections.
private Dictionary<string, string> GetSubsections(string text) { Dictionary<string, string> subsections = new Dictionary<string, string>(); string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX); string title = ""; string content = ""; foreach(string s in unprocessedSubSecs) { if(s != "") //sometimes it pulls in empty strings { Match m = Regex.Match(s, SUBSECTIONREGEX); if (m.Success) { title = s; } else { content = s; if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title)) { subsections.Add(title, content); } } } } return subsections; }
Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me. These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:
- Heading:
(?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
- Subheading:
(?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
- Master Key:
(?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$
Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.
My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.
That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)
I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.
0 comments:
Post a Comment