The other day, I had to parse an XML document with more than 40 namespaces. In order to get the content of a node using the Ruby Nokogiri library, you need to use an xpath expression and pass in information about the node’s namespace.
To get a nodeset with Nokogiri, you do something like this:
# Create a document from a string of XML
doc = Nokogiri::XML(xml_string)
# Set up namespace hash
namespaces = { 'ns1' => 'http://www.someorg.org/',
'ns2' => 'http://www.someotherorg.org' }
# Get a nodeset from the document.
# This returns all the "warning" nodes in the document.
warning_nodeset = doc.xpath("//ns1:warning", namespaces)
# Get a single node from the document.
# This returns the first message node under the document root.
first_message_node = doc.at_xpath("/ns1:message", namespaces)
# Get the brand attribute from the first "item" node
brand_name = doc.at_xpath("/ns2:item", namespaces).attribute('brand').value
This is easy when you’re working with a document that has only a handful of namespaces, but 40+ namespaces makes the task difficult.
Fortunately, Nokogiri has two methods to help with this.
If doc is an instance of Nokogiri::XML::Document, calling doc.remove_namespaces! will remove all namespaces from all nodes in the document. You can then fetch node and attribute data without regard to namespaces. The code above could be rewritten as follows:
# Create a document from a string of XML
doc = Nokogiri::XML(xml_string)
# Remove all namespaces from the entire document
doc.remove_namespaces!
# Get a nodeset from the document. You don't need
# to pass in the second namespaces param.
# This returns all the "warning" nodes in the document.
warning_nodeset = doc.xpath("//ns1:warning")
# Get a single node from the document. You don't need
# to pass in the second namespaces param.
# This returns the first message node under the document root.
first_message_node = doc.at_xpath("/ns1:message")
# Get the brand attribute from the first "item" node
brand_name = doc.at_xpath("/ns2:item").attribute('brand').value
This is convenient, but it may not be safe. Namespaces exist to avoid collisions. If you remove them, you may wind up processing nodes you don’t intend to process. For example, the document may include “product” nodes from several different organizations. With namespaces, you know which nodes came from each organization. Without namespaces, you have no idea.
A safer way to parse documents with multiple namespaces is to use Nokogiri’s Nokogiri::XML::Document#collect_namespaces() method to collect all of the namespaces from all of the document’s nodes into a single hash. Using this method, we can safely rewrite our original code sample as follows:
# Create a document from a string of XML
doc = Nokogiri::XML(xml_string)
# Collect all the namespaces in the document
namespaces = doc.collect_namespaces
# Strip the leading xmlns: from each namespace key, and store in a new hash
namespaces.each_pair do |key, value|
ns[key.sub(/^xmlns:/, '')] = value
end
# Get a nodeset from the document.
# This returns all the "warning" nodes in the document.
warning_nodeset = doc.xpath("//ns1:warning", ns)
# Get a single node from the document.
# This returns the first message node under the document root.
first_message_node = doc.at_xpath("/ns1:message", ns)
# Get the brand attribute from the first "item" node
brand_name = doc.at_xpath("/ns2:item", ns).attribute('brand').value
For more information, see the Nokogiri documentation here: