Using Nokogiri to Parse XML with Multiple Namespaces

The other day, I had to parse an XML document with more than 40 namespaces. In order to get the content of a node using the Ruby Nokogiri library, you need to use an xpath expression and pass in information about the node’s namespace.

To get a nodeset with Nokogiri, you do something like this:

This is easy when you’re working with a document that has only a handful of namespaces, but 40+ namespaces makes the task difficult.

Fortunately, Nokogiri has two methods to help with this.

If doc is an instance of Nokogiri::XML::Document, calling doc.remove_namespaces! will remove all namespaces from all nodes in the document. You can then fetch node and attribute data without regard to namespaces. The code above could be rewritten as follows:

This is convenient, but it may not be safe. Namespaces exist to avoid collisions. If you remove them, you may wind up processing nodes you don’t intend to process. For example, the document may include “product” nodes from several different organizations. With namespaces, you know which nodes came from each organization. Without namespaces, you have no idea.

A safer way to parse documents with multiple namespaces is to use Nokogiri’s Nokogiri::XML::Document#collect_namespaces() method to collect all of the namespaces from all of the document’s nodes into a single hash. Using this method, we can safely rewrite our original code sample as follows:

For more information, see the Nokogiri documentation here:

http://nokogiri.org/Nokogiri/XML/Document.html

This entry was posted in Ruby, XML. Bookmark the permalink.

Leave a Reply