python parsestring / silently skips entities

python parsestring / silently skips entities

  • Written by
    Walter Doekes
  • Published on

The Python xml.dom.minidom parseString silently skips over unknown entities.

The only entities it does know, are <, >, &, ' and " and of course the numeric entities &#nn; and &#xhh;.

That’s obvious, because those are the only ones defined in the XML 1.0 spec.

However, if you’re parsing XHTML documents, it’s not nice that the entity references to special characters silently get dropped.

Other people have stubled on the same issue, like in parsing xml containing &entities; with minidom and Problem with minidom and special chars in HTML.

The Python minidom documentation for the parse states that “[the] function will change the document handler of the parser and activate namespace support; other parser configuration (like setting an entity resolver) must have been done in advance.”

Ah! Something about entities, but no example or further explanation.

So, how do I tell the parseString function what the defined entities are?

That’s where minidom_xhtml comes in. The parseStringXHTML function as defined therein handles adding all the XHTML entities you need into the DOCTYPE declaration.

Download as a package (includes the xhtml*.ent files): minidom_xhtml-1.tar.gz (or view the code)

Example usage:

from minidom_xhtml import parseStringXHTML

doc = parseStringXHTML('<html><body>Voil&agrave;!</body></html>')
body = doc.getElementsByTagName('body')[0]
print body.firstChild.wholeText.encode('utf-8')

Back to overview Newer post: DDoS mitigated; NTP Amplification attack Older post: bson / json / converter