HTML indenter written in Python
I am looking for a free (as in freedom) HTML indenter (or re-indenter) written in Python (module or command line). I don't need to filter HTML with a white list. I just want to indent (or re-indent) HTML source to make it more readable. For example, say I have the开发者_高级运维 following code:
<ul><li>Item</li><li>Item
</li></ul>
the output could be something like:
<ul>
<li>Item</li>
<li>Item</li>
</ul>
Note: I am not looking for an interface to a non-Python software (for example Tidy, written in C), but a 100% Python script.
Thanks a lot.
you can use the built-in module xml.dom.minidom
's toprettyxml
function:
>>> from xml.dom import minidom
>>> x = minidom.parseString("<ul><li>Item</li><li>Item\n</li></ul>")
>>> print x.toprettyxml()
<?xml version="1.0" ?>
<ul>
<li>
Item
</li>
<li>
Item
</li>
</ul>
Using BeautifulSoup
There are a dozen ways to use the BeautifulSoup module and it's prettify function. Here are some examples to get you started.
From the Commandline
$ python -m BeautifulSoup < somefile.html > prettyfile.html
Within VIM (manually)
You don't have to write the file back to disk if you don't want to, but I included the step that would get the identical effect as the commandline example.
$ vi somefile.html
:!python -m BeautifulSoup < %
:w prettyfile.html
Within VIM (define key-mapping)
In ~/.vimrc define:
nmap =h !python -m BeautifulSoup < %<CR>
Then, when you open a file in vim and it needs beautification
$vi somefile.html
=h
:w prettyfile.html
Once again, saving the beautification is optional.
Python Shell
$ python
>>> from BeautifulSoup import BeautifulSoup as parse_html_string
>>> from os import path
>>> uglyfile = path.abspath('somefile.html')
>>> path.isfile(uglyfile)
True
>>> prettyfile = path.abspath(path.join('.', 'prettyfile.html'))
>>> path.exists(prettyfile)
>>> doc = None
>>> with open(uglyfile, 'r') as infile, open(prettyfile, 'w') as outfile:
... # Assuming very simple case
... htmldocstr = infile.read()
... doc = parse_html_string(htmldocstr)
... outfile.write(doc.prettify())
# That's it; you can manually manipulate the dom too though
>>> scripts = doc.findAll('script')
>>> meta = doc.findAll('meta')
>>> print doc.prettify()
[imagine beautiful html here]
>>> import jsbeautifier
>>> print jsbeautifier.beautify(script.string)
[imagine beautiful script here]
>>>
BeautifulSoup has a function called prettify
which does this.
See this question
There's also the html5print module. Key features from the description page:
- Pretty print HTML as well as embedded CSS and JavaScript within it
- Pretty print pure CSS and JavaScript
- Try to fix fragmented HTML5
- Try to fix HTML with broken unicode encoding
- Try to guess encoding of the document, and in some cases manage to convert 8-bit byte code back into correct UTF-8 format
- Support both Python 2 and 3
Here's my pure python solution:
from xml.dom.minidom import parseString as string_to_dom
def prettify(string, html=True):
dom = string_to_dom(string)
ugly = dom.toprettyxml(indent=" ")
split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
if html:
split = split[1:]
pretty = '\n'.join(split)
return pretty
def pretty_print(html):
print(prettify(html))
When used on your block of html:
html = """<ul><li>Item</li><li>Item</li></ul>"""
pretty_print(html)
I get:
<ul>
<li>Item</li>
<li>Item</li>
</ul>
精彩评论