htmlentities() vs htmlspecialchars() for a valid XML with PHP
Electronic invoice issuance is compulsory for 100% of issuers in Brazil. It is the greatest electronic invoice infrastructure I have seen so far and runs on top of the SOAP messaging protocol, using XML and the e-signature XMLDsig format.
Here is a small, although interesting, challenge we faced years ago while integrating with government systems using SOAP: how to properly escape special characters in XML with PHP?
htmlentities()
We started using htmlentities()
to scape some contents of the XML. This worked well for a while by coincidence. Then we noticed htmlentities()
was not well suited to create safe strings to XML. Because it transforms any special character to HTML entities, including some that are invalid for the XML.
Example:
The &
entity is valid to XML, but ê
, ç
, and ã
are not!
If you validate this XML:
It would throw:
error on line 3 at column 15: Entity 'ecirc' not defined
XML recommendation
According to the W3C XML recommendation, this is the set of general entities specified for scaping left angle bracket, ampersand, and other delimiters in an XML document:
< (replace with <)
> (replace with >)
& (replace with &)
' (replace with ')
" (replace with ")
Other HTML entities are invalid.
htmlspecialchars()
To solve that, you have to use htmlspecialchars()
instead. This function converts only a small set of special characters to HTML entities (see “Performed translations” and its flags):
If you validate this XML:
It would throw:
Valid XML
Bottom line
Use htmlspecialchars()
if you want to build a safe XML. htmlentities()
is not a guaranteed way to do that.