I'm parsing content generated by a wysiwyg into a table of contents widget in React.
So far I'm looping through the headers and adding them into an array.
How can I get them all into one multi-dimensional array or object (what's the best way) so that it looks more like:
h1-1 h2-1 h3-1 h1-2 h2-2 h3-2 h1-3 h2-3 h3-3
and then I can render it with an ordered list in the UI.
const str = "<h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3>"; const patternh1 = /<h1>(.*?)<\/h1>/g; const patternh2 = /<h2>(.*?)<\/h2>/g; const patternh3 = /<h3>(.*?)<\/h3>/g; let h1s = []; let h2s = []; let h3s = []; let matchh1, matchh2, matchh3; while (matchh1 = patternh1.exec(str)) h1s.push(matchh1[1]) while (matchh2 = patternh2.exec(str)) h2s.push(matchh2[1]) while (matchh3 = patternh3.exec(str)) h3s.push(matchh3[1]) console.log(h1s) console.log(h2s) console.log(h3s)
5 Answers
Answers 1
I don't know about you, but I hate parsing HTML using regexes. Instead, I think it's a better idea to let the DOM handle this:
const str = "<h1>h1-1</h1><h3>h3-1</h3><h3>h3-2</h3><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3>"; const wrapper = document.createElement('div'); wrapper.innerHTML = str.trim(); let tree = []; let leaf = null; for (const node of wrapper.querySelectorAll("h1, h2, h3, h4, h5, h6")) { c
const str = `<h1>h1-1</h1> <h3>h3-1</h3> <h3>h3-2</h3> <p>something</p> <h1>h1-2</h1> <h2>h2-2</h2> <h3>h3-2</h3>`; const wrapper = document.createElement('div'); wrapper.innerHTML = str.trim(); let tree = []; let leaf = null; for (const node of wrapper.querySelectorAll("h1, h2, h3, h4, h5, h6")) { const nodeLevel = parseInt(node.tagName[1]); const newLeaf = { level: nodeLevel, text: node.textContent, children: [], parent: leaf }; while (leaf && newLeaf.level <= leaf.level) leaf = leaf.parent; if (!leaf) tree.push(newLeaf); else leaf.children.push(newLeaf); leaf = newLeaf; } console.log(tree);
This answer does not require h3
to follow h2
; h3
can follow h1
if you so please. If you want to turn this into an ordered list, that can also be done:
const str = `<h1>h1-1</h1> <h3>h3-1</h3> <h3>h3-2</h3> <p>something</p> <h1>h1-2</h1> <h2>h2-2</h2> <h3>h3-2</h3>`; const wrapper = document.createElement('div'); wrapper.innerHTML = str.trim(); let tree = []; let leaf = null; for (const node of wrapper.querySelectorAll("h1, h2, h3, h4, h5, h6")) { const nodeLevel = parseInt(node.tagName[1]); const newLeaf = { level: nodeLevel, text: node.textContent, children: [], parent: leaf }; while (leaf && newLeaf.level <= leaf.level) leaf = leaf.parent; if (!leaf) tree.push(newLeaf); else leaf.children.push(newLeaf); leaf = newLeaf; } const ol = document.createElement("ol"); (function makeOl(ol, leaves) { for (const leaf of leaves) { const li = document.createElement("li"); li.appendChild(new Text(leaf.text)); if (leaf.children.length > 0) { const subOl = document.createElement("ol"); makeOl(subOl, leaf.children); li.appendChild(subOl); } ol.appendChild(li); } })(ol, tree); // add it to the DOM document.body.appendChild(ol); // or get it as text const result = ol.outerHTML;
Since the HTML is parsed by the DOM and not by a regex, this solution will not encounter any errors if the h1
tags have attributes, for example.
Answers 2
You can simply gather all h*
and then iterate over them to construct a tree as such:
Using ES6 (I inferred this is ok from your usage of const
and let
)
const str = ` <h1>h1-1</h1> <h2>h2-1</h2> <h3>h3-1</h3> <p>something</p> <h1>h1-2</h1> <h2>h2-2</h2> <h3>h3-2</h3> ` const patternh = /<h(\d)>(.*?)<\/h(\d)>/g; let hs = []; let matchh; while (matchh = patternh.exec(str)) hs.push({ lev: matchh[1], text: matchh[2] }) console.log(hs) // constructs a tree with the format [{ value: ..., children: [{ value: ..., children: [...] }, ...] }, ...] const add = (res, lev, what) => { if (lev === 0) { res.push({ value: what, children: [] }); } else { add(res[res.length - 1].children, lev - 1, what); } } // reduces all hs found into a tree using above method starting with an empty list const tree = hs.reduce((res, { lev, text }) => { add(res, lev-1, text); return res; }, []); console.log(tree);
But because your html headers are not in a tree structure themselves (which I guess is your use case) this only works under certain assumptions, e.g. you cannot have a <h3>
unless there's a <h2>
above it and a <h1>
above that. It will also assume a lower-level header will always belong to the latest header of an immediately higher level.
If you want to further use the tree structure for e.g. rendering a representative ordered-list for a TOC, you can do something like:
// function to render a bunch of <li>s const renderLIs = children => children.map(child => `<li>${renderOL(child)}</li>`).join(''); // function to render an <ol> from a tree node const renderOL = tree => tree.children.length > 0 ? `<ol>${tree.value}${renderLIs(tree.children)}</ol>` : tree.value; // use a root node for the TOC const toc = renderOL({ value: 'TOC', children: tree }); console.log(toc);
Hope it helps.
Answers 3
What you want to do is known as (a variant of a) document outline, eg. creating a nested list from the headings of a document, honoring their hierarchy.
A simple implementation for the browser using the DOM and DOMParser APIs goes as follows (put into a HTML page and coded in ES5 for easy testing):
<!DOCTYPE html> <html> <head> <title>Document outline</title> </head> <body> <div id="outline"></div> <script> // test string wrapped in a document (and body) element var str = "<html><body><h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3></body></html>"; // util for traversing a DOM and emit SAX startElement events function emitSAXLikeEvents(node, handler) { handler.startElement(node) for (var i = 0; i < node.children.length; i++) emitSAXLikeEvents(node.children.item(i), handler) handler.endElement(node) } var outline = document.getElementById('outline') var rank = 0 var context = outline emitSAXLikeEvents( (new DOMParser()).parseFromString(str, "text/html").body, { startElement: function(node) { if (/h[1-6]/.test(node.localName)) { var newRank = +node.localName.substr(1, 1) // set context li node to append while (newRank <= rank--) context = context.parentNode.parentNode rank = newRank // create (if 1st li) or // get (if 2nd or subsequent li) ol element var ol if (context.children.length > 0) ol = context.children[0] else { ol = document.createElement('ol') context.appendChild(ol) } // create and append li with text from // heading element var li = document.createElement('li') li.appendChild( document.createTextNode(node.innerText)) ol.appendChild(li) context = li } }, endElement: function(node) {} }) </script> </body> </html>
I'm first parsing your fragment into a Document
, then traverse it to create SAX-like startElement()
calls. In the startElement()
function, the rank of a heading element is checked against the rank of the most recently created list item (if any). Then a new list item is appended at the correct hierarchy level, and possibly an ol
element is created as container for it. Note the algorithm as it is won't work with "jumping" from h1
to h3
in the hierarchy, but can be easily adapted.
If you want to create an outline/table of content on node.js, the code could be made to run server-side, but requires a decent HTML parsing lib (a DOMParser polyfill for node.js, so to speak). There are also the https://github.com/h5o/h5o-js and the https://github.com/hoyois/html5outliner packages for creating outlines, though I haven't tested those. These packages supposedly can also deal with corner cases such as heading elements in iframe
and quote
elements which you generally don't want in the the outline of your document.
The topic of creating an HTML5 outline has a long history; see eg. http://html5doctor.com/computer-says-no-to-html5-document-outline/. HTML4's practice of using no sectioning roots (in HTML5 parlance) wrapper elements for sectioning and placing headings and content at the same hierarchy level is known as "flat-earth markup". SGML has the RANK
feature for dealing with H1
, H2
, etc. ranked elements, and can be made to infer omitted section
elements, thus automatically create an outline, from HTML4-like "flat earth markup" in simple cases (eg. where only section
or another single element is allowed as sectioning root).
Answers 4
I'll use a single regex to get the <hx></hx>
contents and then sort them by x
using methods Array.reduce
.
Here is the base but it's not over yet :
// The string you need to parse const str = "\ <h1>h1-1</h1>\ <h2>h2-1</h2>\ <h3>h3-1</h3>\ <p>something</p>\ <h1>h1-2</h1>\ <h2>h2-2</h2>\ <h3>h3-2</h3>"; // The regex that will cut down the <hx>something</hx> const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g; // We get the matches now const matches = str.match(regex); // We match the hx togethers as requested const matchesSorted = Object.values(matches.reduce((tmp, x) => { // We get the number behind hx ---> the x const hNumber = x[2]; // If the container do not exist, create it if (!tmp[hNumber]) { tmp[hNumber] = []; } // Push the new parsed content into the array // 4 is to start after <hx> // length - 9 is to get all except <hx></hx> tmp[hNumber].push(x.substr(4, x.length - 9)); return tmp; }, {})); console.log(matchesSorted);
As you are parsing html content I want to aware you about special cases like presency of \n
or space
. For example look at the following non-working snippet :
// The string you need to parse const str = "\ <h1>h1-1\n\ </h1>\ <h2> h2-1</h2>\ <h3>h3-1</h3>\ <p>something</p>\ <h1>h1-2 </h1>\ <h2>h2-2 \n\ </h2>\ <h3>h3-2</h3>"; // The regex that will cut down the <hx>something</hx> const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g; // We get the matches now const matches = str.match(regex); // We match the hx togethers as requested const matchesSorted = Object.values(matches.reduce((tmp, x) => { // We get the number behind hx ---> the x const hNumber = x[2]; // If the container do not exist, create it if (!tmp[hNumber]) { tmp[hNumber] = []; } // Push the new parsed content into the array // 4 is to start after <hx> // length - 9 is to get all except <hx></hx> tmp[hNumber].push(x.substr(4, x.length - 9)); return tmp; }, {})); console.log(matchesSorted);
We gotta add .replace()
and .trim()
in order to remove unwanted \n
and spaces
.
Use this snippet
// The string you need to parse const str = "\ <h1>h1-1\n\ </h1>\ <h2> h2-1</h2>\ <h3>h3-1</h3>\ <p>something</p>\ <h1>h1-2 </h1>\ <h2>h2-2 \n\ </h2>\ <h3>h3-2</h3>"; // Remove all unwanted \n const preparedStr = str.replace(/(\r\n\t|\n|\r\t)/gm, ""); // The regex that will cut down the <hx>something</hx> const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g; // We get the matches now const matches = preparedStr.match(regex); // We match the hx togethers as requested const matchesSorted = Object.values(matches.reduce((tmp, x) => { // We get the number behind hx ---> the x const hNumber = x[2]; // If the container do not exist, create it if (!tmp[hNumber]) { tmp[hNumber] = []; } // Push the new parsed content into the array // 4 is to start after <hx> // length - 9 is to get all except <hx></hx> // call trim() to remove unwanted spaces tmp[hNumber].push(x.substr(4, x.length - 9).trim()); return tmp; }, {})); console.log(matchesSorted);
Answers 5
I write this code works with JQuery. (Please don't DV. Maybe someone needs a jquery answer later)
This recursive function creates li
s of string and if one item has some childern, it will convert them to an ol
.
const str = "<div><h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3></div><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3>"; function strToList(stri) { const tags = $(stri); function partToList(el) { let output = "<li>"; if ($(el).children().length) { output += "<ol>"; $(el) .children() .each(function() { output += partToList($(this)); }); output += "</ol>"; } else { output += $(el).text(); } return output + "</li>"; } let output = "<ol>"; tags.each(function(itm) { output += partToList($(this)); }); return output + "</ol>"; } $("#output").append(strToList(str));
li { padding: 10px; }
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <div id="output"></div>
(This code can be converted to pure JS easily)
0 comments:
Post a Comment