I want to open this file and get all elements that start with us-gaap.
ftp://ftp.sec.gov/edgar/data/916789/0001558370-15-001143.txt To get elements I tried like this:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>' doc = Nokogiri::XML(File.read(str)) doc.xpath('//us-gaap:*') Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //us-gaap:* from /Users/ironsand/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/searchable.rb:165:in `evaluate' doc.namespaces returns {}, so I think I have to add namespace us-gaap.
There are some questions about "adding namespace with Nokogiri", but it looks like about how to create a new XML document, not how to add a namespace to existing documents.
How can I add a namespace to existing document?
I know I can remove the namespace by Nokogiri::XML::Document#remove_namespaces!, but I don't want to use it because it removes also necesarry information.
3 Answers
Answers 1
You have asked an XY Problem. You think that the problem is that you need to add a missing namespace; the real problem is that the file you're trying to parse is not valid XML.
require 'nokogiri' doc = Nokogiri.XML( IO.read('0001558370-15-001143.txt') ) doc.errors.length #=> 5716 For example, the <ACCEPTANCE-DATETIME> 'element' opened on line 3 is never closed, and on line 16 there is a raw ampersand in the text:
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRIC HOUSEWARES & FANS [3634]
which ought to be escaped as an entity.
However, the document has valid XML fragments within it! In particular, there is one XML document that defines xmlns:us-gaap namespace, from lines 27243-49312. Let's extract just that, using only the knowledge that the root element defines the namespace we want, and the assumptions that no element with the same name is nested within the document, and that the root element does not have an unescaped > character in any attribute. (These assumptions are valid for this file, but may not be valid for every XML file.)
txt = IO.read('0001558370-15-001143.txt') gaap_finder = %r{(<(\w+) [^>]+xmlns:us-gaap=.+?</\2>)}m txt.scan(gaap_finder) do |xml,_| doc = Nokogiri.XML( xml ) gaaps = doc.xpath('//us-gaap:*') p gaaps.length #=> 569 end The code above handles the case where there may be more than one XML document in the txt file, though in this case there is only one.
Decoded, the gaap_finder regex says this:
%r{...}m— this is a regular expression (that allows slashes in it, unescaped) with "multiline mode", where a period will match newline characters(...)— capture everything we find<— start with a literal "less-than" symbol(\w+)— find one or more word characters (the tag name), and save them— the word characters must be followed by a space (important to avoid capturing the<xsd:xbrl ...>element in this file)[^>]+— followed by one or more characters that is NOT a "greater-than" symbol (to ensure that we stay in the same element that we started in)xmlns:us-gaap\s*=— followed by this literal namespace declaration (which may have whitespace separating it from the equals sign).+?— followed by anything (as little as possible)...</\2>— ...up until you see a closing tag with the same name as what we captured for the name of the starting tag
Because of the way scan works when the regex has capturing groups, each result is a two-element array, where the first element is the entire captured XML and the second element is the name of the tag that we captured (which we "discard" by assigning it to the _ variable).
If you want to be less magic about your capturing, the text file format appears to always wrap each XML document in <XBRL>...</XBRL>. So, you could do this to process every XML file (there are seven, five of which do not happen to have any us-gaap namespaces):
txt = IO.read('0001558370-15-001143.txt') xbrls = %r{(?<=<XBRL>).+?(?=</XBRL>)}m # find text inside <XBRL>…</XBRL> txt.scan(xbrls) do |xml| doc = Nokogiri.XML( xml ) if doc.namespaces["xmlns:us-gaap"] gaaps = doc.xpath('//us-gaap:*') p gaaps.length end end #=> 569 #=> 0 (for the XML Schema document that defines the namespace) Answers 2
I couldn't figure out how to update an existing doc with a new namespace, but since Nokogiri will recognize namespaces on the root element, and those namespaces are, syntactically, just attributes, you can update the document with a new namespace declaration, serialize the doc to a string, and re-parse it:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>' doc_without_ns = Nokogiri::XML(str) doc_without_ns.root['xmlns:us-gaap'] = 'http://your/actual/ns/here' doc = Nokogiri::XML(doc_without_ns.to_xml) doc.xpath("//us-gaap:*") # Returns [#<Nokogiri::XML::Element:0x3ff375583f9c name="foo" namespace=#<Nokogiri::XML::Namespace:0x3ff375583f24 prefix="us-gaap" href="http://your/actual/ns/here"> children=[#<Nokogiri::XML::Text:0x3ff375583768 "foo">]>] Answers 3
I think you can refer to w3school also below is the site :- http://www.w3schools.com/xml/xml_namespaces.asp
0 comments:
Post a Comment