PHP's html_entity_decode() in Python -
let's have file a.txt contains html encoded html, sth this:
<!doctype html public "-//w3c//dtd html 4.01 transitional//en"> <html> <head> <title>html preview</title> <link rel="stylesheet" href="style.css" type="text/css" media="screen"> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body><!doctype html><html itemscope="" ... </script></body></html></body> </html>
in php can do:
<?php $content = file_get_contents('a.txt'); $start = strpos ($content, '<body>') + 6; $end = strpos ($content, '</body>'); $html = html_entity_decode(substr($content, $start, $end-$start)); file_put_contents('b.html');
and works perfectly. file b.html becomes proper html.
my question is: how can in python, assuming file , encoded content in utf-8?
edit: experimented bit htmlparser , beautifulstonesoup, corrupt utf-8 encoding. experimented unicodedammit, returning string console or file brings exception chars out of range.
edit 2: please answer code examples work in similar manner.
solution 1
Comments
Post a Comment