Justified plain text from HTML
I need a plain text representation of an arbitrary HTML file (e.g., a blog post). So far that's not a problem, there are dozens of HTML to txt converters. However, the text in paragraphs (read "p
elements") should be justified in the plain text view (to a certain amount of columns) and, if possible, hyphenated to give a better readable result. Also, the resulting text file must be UTF-8 or UTF-16.
Simple plain text conversation I can do with XSLT, that's near to trivial. But the justification of text is beyond its possibilities (not quite true, because XSLT开发者_Python百科 is Turing complete, but close enough to reality).
FOP and XSL-FO don't work either. They do as requested, but FOP's plain text output is horrible (the developers say, that it is not intended for such usage).
I also experimented with HTML -> XSLT -> Roff, but I'm stuck with groff and its Unicode support is far from optimal. Since there are characters like ellipses ("...") and typographically correct quotaion marks, it is quite cumbersome to tell groff in the XSLT stylesheet the escape sequences for dozens of Unicode characters.
Another way could be conversion to TeX and output as plain text, but I have never tried this before with (La)TeX.
Perhaps I have missed something really simple. Has anyone an idea, how I could achieve the above? By the way: A solution should preferably work without root rights to install, with PHP, Python, Perl, XSLT or any program found in a half-decent Linux distro.
Try Python. Use BeautifulSoup to parse the HTML. The textwrap module will allow you to format the text.
There are two features missing, though. To justify the text, you'll need to add spaces to each line but that shouldn't be a big issue (see this code example).
For hyphenation, try this project.
If you are familiar with Emacs, you may open the HTML file in Emacs-W3M (i.e. M-x w3m-find-file foo.html
), save the rendered page as a plain text file, and then call M-x set-justification-full
on it.
You can even write a small function to do the job:
(defun my-html-to-justifed-text (html-file text-file)
"Convert HTML-FILE to plain TEXT-FILE."
(find-file html-file)
(w3m-rendering-buffer)
(set-justification-full (point-min) (point-max))
(write-file text-file))
(my-html-to-justifed-text "~/tmp/2.html" "~/tmp/2.txt")
Links or lynx might be worth a try, see the -dump
switch. The encoding part you can then easily solve separately using iconv or something similar.
精彩评论