PHP's html_entity_decode() in Python -


let's have file a.txt contains html encoded html, sth this:

        <!doctype html public "-//w3c//dtd html 4.01 transitional//en">         <html>         <head>             <title>html preview</title>             <link rel="stylesheet" href="style.css" type="text/css" media="screen">             <meta http-equiv="content-type" content="text/html; charset=utf-8">         </head>         <body>&lt;!doctype html&gt;&lt;html itemscope=&quot;&quot;         ...         &lt;/script&gt;&lt;/body&gt;&lt;/html&gt;</body>         </html> 

in php can do:

<?php $content = file_get_contents('a.txt'); $start = strpos ($content, '<body>') + 6; $end = strpos ($content, '</body>'); $html = html_entity_decode(substr($content, $start, $end-$start)); file_put_contents('b.html'); 

and works perfectly. file b.html becomes proper html.

my question is: how can in python, assuming file , encoded content in utf-8?

edit: experimented bit htmlparser , beautifulstonesoup, corrupt utf-8 encoding. experimented unicodedammit, returning string console or file brings exception chars out of range.

edit 2: please answer code examples work in similar manner.

solution 1

python's version of html_entity_decode()


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -