c# - XDocument will not parse html entities (e.g. ) but XmlDocument will -


i converting our old parsers run on xmldocument xdocument. linq querying , added linenumber info.

my xml contains element this:

<?xml version="1.0"?> <fulltext>     hello failed textnode     &#xc;     , don't know how parse it. </fulltext> 

my problem while xmldocument seems have no problem reading node with:

var xmldocument = new xmldocument();  var physicalpath = getphysicalpath(uploadfolderfile); try {     xmldocument.load(physicalpath); } catch (xmlexception xmlexception) {     _log.warn("problems document", xmlexception); } 

the example above parses document fine when try do:

xdocument xmldocument; var physicalpath = getphysicalpath(uploadfolderfile); var xmlstream = new system.io.streamreader(physicalpath); try {    xmldocument = xdocument.load(xmlstream, loadoptions.setlineinfo | loadoptions.setbaseuri); } catch (xmlexception) {    _log.warn("trying clean document hexadecimal", xmlexception); } 

it fails read document because of character &#xc; special character seems allowed in xml version 1.1 changing description doesn't help. have thought parsing document xmldocument , converting it; seems counterintuitive. can problem?

ok...so sort of found solution problem.

first of try parse xml using following code:

private xdocument getxmldocument(string physicalpath)     {         xdocument xmldocument;         var xmlstream = new system.io.streamreader(physicalpath);         try         {             xmldocument = xdocument.load(xmlstream, loadoptions.setlineinfo);         }         catch (xmlexception)         {             //_log.warn("trying clean document hexadecimal", xmlexception);             xmldocument = xmlsanitizingstream.trytocleanxmlbeforeparsing(physicalpath);         }          return xmldocument;     } 

if fails load document, try clean using technique used in blogpost: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/

it not remove character mentioned before, remove character not allowed xml standard.

then, after sanitizing xml, add xmlreader , set settings not check characters:

public static xdocument trytocleanxmlbeforeparsing(string physicalpath) {     string xml;      encoding encoding;     using (var reader = new xmlsanitizingstream(file.openread(physicalpath)))     {         xml = reader.readtoend();         encoding = reader.currentencoding;     }     byte[] encodedstring;     if (encoding.equals(encoding.utf8)) encodedstring = encoding.utf8.getbytes(xml);     else if (encoding.equals(encoding.utf32)) encodedstring = encoding.utf32.getbytes(xml);     else encodedstring = encoding.unicode.getbytes(xml);      var ms = new memorystream(encodedstring);     ms.flush();     ms.position = 0;      var settings = new xmlreadersettings {checkcharacters = false};     xmlreader xmlreader = xmlreader.create(ms, settings);     var xmldocument = xdocument.load(xmlreader);     ms.close();     return xmldocument; } 

since i've cleaned document removing illegal characters before add ignore characters reader, pretty sure not read malformed xml document. worst case scenario malformed xml , throw error anyways.

i use parsing , should used read data. not make xml well-formed , in many cases throw exceptions elsewhere in code. using because cannot change customer sending , have read is.


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

inno setup - TLabel or TNewStaticText - change .Font.Style on Focus like Cursor changes with .Cursor -