Wednesday, May 23, 2018

Parse 'ul' and 'ol' tags

Leave a Comment

I have to handle deep nesting of ul, ol, and li tags. I need to give the same view as we are giving in the browser. I want to achieve the following example in a pdf file:

 text = " <body>     <ol>         <li>One</li>         <li>Two              <ol>                 <li>Inner One</li>                 <li>inner Two                      <ul>                         <li>hey                              <ol>                                 <li>hiiiiiiiii</li>                                 <li>why</li>                                 <li>hiiiiiiiii</li>                             </ol>                         </li>                         <li>aniket </li>                     </li>                 </ul>                 <li>sup </li>                 <li>there </li>             </ol>             <li>hey </li>             <li>Three</li>         </li>     </ol>     <ol>         <li>Introduction</li>         <ol>             <li>Introduction</li>         </ol>         <li>Description</li>         <li>Observation</li>         <li>Results</li>         <li>Summary</li>     </ol>     <ul>         <li>Introduction</li>         <li>Description              <ul>                 <li>Observation                      <ul>                         <li>Results                              <ul>                                 <li>Summary</li>                             </ul>                         </li>                     </ul>                 </li>             </ul>         </li>         <li>Overview</li>     </ul> </body>" 

I have to use prawn for my task. But prawn doesn't support HTML tags. So, I came up with a solution using nokogiri:. I am parsing and later removing the tags with gsub. The below solution I have written for a part of the above content but the problem is ul and ol can vary.

     RULES = {   ol: {     1 => ->(index) { "#{index + 1}. " },     2 => ->(index) { "#{}" },     3 => ->(index) { "#{}" },     4 => ->(index) { "#{}" }   },   ul: {     1 => ->(_) { "\u2022 " },     2 => ->(_) { "" },     3 => ->(_) { "" },     4 => ->(_) { "" },   } }  def ol_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ol][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def ul_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ul][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def descend(item, deepness)   item.search('> ol').each do |ol|     ol_rule(ol, deepness: deepness)   end   item.search('> ul').each do |ul|     ul_rule(ul, deepness: deepness)   end end  doc = Nokogiri::HTML.fragment(text)  doc.search('ol').each do |group|   ol_rule(group, deepness: 1) end  doc.search('ul').each do |group|   ul_rule(group, deepness: 1) end     puts doc.inner_text   1. One 2. Two  1. Inner One 2. inner Two  • hey  1. hiiiiiiiii 2. why 3. hiiiiiiiii   • aniket    3. sup  4. there   3. hey  4. Three    1. Introduction  1. Introduction  2. Description 3. Observation 4. Results 5. Summary    • Introduction • Description  • Observation  • Results  • Summary       • Overview 

Problem

1) What I want to achieve is how to handle space when working with ul and ol tags
2) How to handle deep nesting when li come inside ul or li come inside ol

2 Answers

Answers 1

I've come up with a solution that handles multiple identations with configurable numeration rules per level:

require 'nokogiri' ROMANS = %w[i ii iii iv v vi vii viii ix]  RULES = {   ol: {     1 => ->(index) { "#{index + 1}. " },     2 => ->(index) { "#{('a'..'z').to_a[index]}. " },     3 => ->(index) { "#{ROMANS.to_a[index]}. " },     4 => ->(index) { "#{ROMANS.to_a[index].upcase}. " }   },   ul: {     1 => ->(_) { "\u2022 " },     2 => ->(_) { "\u25E6 " },     3 => ->(_) { "* " },     4 => ->(_) { "- " },   } }  def ol_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ol][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def ul_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ul][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def descend(item, deepness)   item.search('> ol').each do |ol|     ol_rule(ol, deepness: deepness)   end   item.search('> ul').each do |ul|     ul_rule(ul, deepness: deepness)   end end  doc = Nokogiri::HTML.fragment(text)  doc.search('ol:root').each do |group|   binding.pry   ol_rule(group, deepness: 1) end  doc.search('ul:root').each do |group|   ul_rule(group, deepness: 1) end 

You can then remove the tags or use doc.inner_text depending on your environment.

Two caveats though:

  1. Your entry selector must be carefully selected. I used your snippet verbatim without root element, thus i had to use ul:root/ol:root. Maybe "body > ol" works for your situation too. Maybe selecting each ol/ul but than walking each and only find those, that have no list parent.
  2. Using your example verbatim, Nokogiri does not handle the last 2 list items of the first group ol very well ("hey", "Three") When parsing with nokogiri, thus elements already "left" their ol tree and got placed in the root tree.

Current Output:

  1. One   2. Two       a. Inner One       b. inner Two         ◦ hey         ◦ hey       3. hey       4. hey   hey   Three    1. Introduction     a. Introduction   2. Description   3. Observation   4. Results   5. Summary    • Introduction   • Description       ◦ Observation           * Results               - Summary   • Overview 

Answers 2

Whenever you are in a ol, li or ul element, you must recursively check for ol, li and ul. If there are none of them, return (what have been discovered as a substructure), if there are, call the same function on the new node and add its return value to the current structure.

You perform a different action on each node no matter where it is depending on its type and then the function automatically repackage everything.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment