How can I get valid first 300 characters of HTML or Markdown?

2023-01-18 22:01 问答作者：

I'm creating a blog (and the rest of a website) using Python and Flask. Blog posts are written in Markdown and converted to HTML using the creatively named Markdown in Python. Both the Markdown (for future editing) and the HTML (for display) are stored in the database.

I want to be able to automatically get the first 300 characters of text (or 500, or 200 — I haven't worked out the number) to use on pages when I don't want to display the full blog post (like on the front page). However, the problem is that any simple way of doing it will potentially leave me with invalid HTML or Mark开发者_StackOverflow中文版down:

HTML:

<p><em>Here</em> is <strong>formatted</strong> text.</p>

If I get the first ten characters of this, it will leave me halfway through formatted, and I would somehow need to close the <strong> and <p> tag.

Markdown:

*Here* is **formatted** text.

Likewise, getting the first ten characters will leave me needing to close the ** for bold.

Is there any way I can do this without needing to write a HTML or Markdown parser? Or, would I be better off just converting the HTML into plain text?

If you're okay with summaries just being plain text, then Adam's answer is certainly the best -- convert to plain text first, and then truncate.

If you want to maintain formatting, then here's another idea:

Convert from Markdown to HTML.
Run through the HTML with a parser of the sort that will give you a token stream (e.g. Perl's HTML::TokeParser::Simple, but I'm sure there's something comparable for Python -- or you can turn any event-based parser into one of these).
When you get element tokens, copy them to the output, while maintaining a stack of unclosed tags.
When you get text tokens, copy them to the output, while maintaining a count of the amount of text you've outputted.
When you get to a text token that would put you over the limit, copy only enough characters to reach the limit, generate closing tags for any unclosed tags on your stack, and stop processing.

If you were doing this with arbitrary HTML then you would have a lot of weird things to worry about, but since you're coming from markdown it should actually work pretty well. Any decent markdown converter should generate well-formed HTML with a fairly small number of tags in it.

Indeed, the easiest and safest method would be to generate HTML from the Markdown source, convert it to plain text (see html2plaintext), and then trim it down to 300 characters.

A more efficient method might be to modify the Markdown parser to output only the first 300 characters of all the text nodes but I really don't think the modifications justify the performance benefits.

don't know if it applicable in Python but this tutorial may help you. Basically it scan for unclosed tag after the text is trimmed and auto-close it.

Use an evented parser, ignore non text events, capture text events until you reach 300 characters, then stop parsing.

libxml supports event based parsing of html. I'm sure there is one for markdown, but haven't looked.

You should measure though to make sure the performance benefit is worth the added complexity.

继续阅读：markdown

How can I get valid first 300 characters of HTML or Markdown?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？