c# - XDocument will not parse html entities (e.g. ) but XmlDocument will -
i converting our old parsers run on xmldocument xdocument. linq querying , added linenumber info.
my xml contains element this:
<?xml version="1.0"?> <fulltext> hello failed textnode  , don't know how parse it. </fulltext>
my problem while xmldocument seems have no problem reading node with:
var xmldocument = new xmldocument(); var physicalpath = getphysicalpath(uploadfolderfile); try { xmldocument.load(physicalpath); } catch (xmlexception xmlexception) { _log.warn("problems document", xmlexception); }
the example above parses document fine when try do:
xdocument xmldocument; var physicalpath = getphysicalpath(uploadfolderfile); var xmlstream = new system.io.streamreader(physicalpath); try { xmldocument = xdocument.load(xmlstream, loadoptions.setlineinfo | loadoptions.setbaseuri); } catch (xmlexception) { _log.warn("trying clean document hexadecimal", xmlexception); }
it fails read document because of character 
special character seems allowed in xml version 1.1 changing description doesn't help. have thought parsing document xmldocument , converting it; seems counterintuitive. can problem?
ok...so sort of found solution problem.
first of try parse xml using following code:
private xdocument getxmldocument(string physicalpath) { xdocument xmldocument; var xmlstream = new system.io.streamreader(physicalpath); try { xmldocument = xdocument.load(xmlstream, loadoptions.setlineinfo); } catch (xmlexception) { //_log.warn("trying clean document hexadecimal", xmlexception); xmldocument = xmlsanitizingstream.trytocleanxmlbeforeparsing(physicalpath); } return xmldocument; }
if fails load document, try clean using technique used in blogpost: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
it not remove character mentioned before, remove character not allowed xml standard.
then, after sanitizing xml, add xmlreader , set settings not check characters:
public static xdocument trytocleanxmlbeforeparsing(string physicalpath) { string xml; encoding encoding; using (var reader = new xmlsanitizingstream(file.openread(physicalpath))) { xml = reader.readtoend(); encoding = reader.currentencoding; } byte[] encodedstring; if (encoding.equals(encoding.utf8)) encodedstring = encoding.utf8.getbytes(xml); else if (encoding.equals(encoding.utf32)) encodedstring = encoding.utf32.getbytes(xml); else encodedstring = encoding.unicode.getbytes(xml); var ms = new memorystream(encodedstring); ms.flush(); ms.position = 0; var settings = new xmlreadersettings {checkcharacters = false}; xmlreader xmlreader = xmlreader.create(ms, settings); var xmldocument = xdocument.load(xmlreader); ms.close(); return xmldocument; }
since i've cleaned document removing illegal characters before add ignore characters reader, pretty sure not read malformed xml document. worst case scenario malformed xml , throw error anyways.
i use parsing , should used read data. not make xml well-formed , in many cases throw exceptions elsewhere in code. using because cannot change customer sending , have read is.
Comments
Post a Comment