HTMLを可愛く出力 - None is None is None

HTMLを丸ごと書くと大変なので、

自作WEBページでは、一部を自動生成しているのですが、

できれば、読みやすいHTMLを出力したいものです。

いわゆるpretty print

lxmlやBeautifulSoapには、HTMLを可愛く出力する関数・メソッドがあります。

まず、BeautifulSoup版

#encoding:utf-8
from __future__ import print_function
from BeautifulSoup import BeautifulSoup

source_str &#061; u"<h1>ああああ</h1><div><p>hello
world!</p></div>"

# BeautifulSoup版
soap &#061; BeautifulSoup(source_str)
pretty_bytes &#061; soap.prettify(encoding&#061;'utf-8')

print(unicode(pretty_bytes, 'utf-8'))

#<h1>
# ああああ
#</h1>
#<div>
# <p>
#  hello
#  

#  world!
# </p>
#</div>

# 可愛い！

次にlxml版。
なぜかデフォルトでは、可愛く出力してくれません。

#encoding:utf-8
from __future__ import print_function
import lxml.html
source_str &#061; u"<h1>ああああ</h1><div><p>hello
world!</p></div>"

et &#061; lxml.html.fromstring(source_str)

pretty_bytes &#061; lxml.html.tostring(et)
print(unicode(pretty_bytes, 'utf-8'))

#<div><h1>&#227;&#129;&#130;&#227;&#129;&#130;&#227;&#129;&#130;&#227;&#129;&#130
#;</h1><div><p>hello
world!</p></div></div>

#うげげっ！？

print()
pretty_bytes &#061; lxml.html.tostring(et, encoding&#061;"utf-8", 
                                  pretty_print&#061;True, method&#061;'xml')
print(unicode(pretty_bytes, 'utf-8'))

#<div>
#  <h1>ああああ</h1>
#  <div>
#    <p>hello
world!</p>
#  </div>
#</div>

#かわいい！

ちなみに、method="html" も指定できますが、

が
になるなど、古いHTMLの表記になるようです。

実際、xml は xml用と言うより、xhtml用と言った方が正確です。

lxmlにはもう一つの注意点にmetaタグがあります。

なぜか、デフォルトではcharsetのmetaタグが出力されません。

include_meta_content_type=Trueが必要です。

#encoding:utf-8
from __future__ import print_function
import lxml.html

source_str &#061; u'''
<html>
<head>
 </meta>
</head>
<body>
<h1>ああああ</h1><div><p>hello
world!</p></div>
</body>
</html>
'''

et &#061; lxml.html.fromstring(source_str)

pretty_bytes &#061; lxml.html.tostring(et, encoding&#061;"utf-8", 
                                  method&#061;'xml', pretty_print&#061;True)
print(unicode(pretty_bytes, 'utf-8'))

#<html>
#  <head>
#           <&#061; ！？
#  </head>
#  <body>
#<h1>ああああ</h1><div><p>hello
world!</p></div>
#</body>
#</html>

pretty_bytes &#061; lxml.html.tostring(et, encoding&#061;"utf-8", method&#061;'xml',
                                  pretty_print&#061;True, include_meta_content_type&#061;True, )
                                  
print(unicode(pretty_bytes, 'utf-8'))

#<html>
#  <head>
#    <meta http-equiv&#061;"Content-Type" content&#061;"text/html; charset&#061;utf-8"/>
#  </head>
#  <body>
#<h1>ああああ</h1><div><p>hello
world!</p></div>
#</body>
#</html>

BeautifulSoupはmetaタグは勝手に処理してくれます。

しかし、こういった単純な処理ではBeautifulSoupはlxmlより遅い。

lxmlでは、こんな関数でも定義して使う事になりそうです。

def to_pretty_string(et, pretty_print=True, include_meta_content_type=True, 
        encoding="utf-8", method='xml', *a, **kw):
    
    return lxml.html.tostring(et, pretty_print, include_meta_content_type, 
        encoding, method, *a, **kw)