Computing article abstracts
I'm looking for a way to automatically produce an abstract, basically the first few sentances/paragraphs of a blog entry, to display in a list of articles (which are written in markdown). Currently, I'm doing something like this:
def abstract(article, paras=3):
return '\n'.join(article.split('\n')[0:paras])
to just grab the first few lines worth of text, but i'm not totally happy with the results.
What I'm really looking for is to end up with about 1/3 of a screenful of formatted text to display in the list of entries, but using the algorithm above, the amount pulled ends up with wildly varying amounts, as little as a li开发者_运维知识库ne or two, is frequently mixed with more ideal sized abstracts.
Is there a library that's good at this kind of thing? if not, do you have any suggestions to improve the output?
EDIT:
You can do something like this:
from textwrap import wrap
def getAbstract(text, lines=5, screenwidth=100):
width = len(' '.join([
line for block in text.splitlines()
for line in wrap(block, width=screenwidth)
][:lines]))
return text[:width] + '...'
This makes use of the textwrap algorithm to get the ideal text length. It will break the text into screen-sized lines and use them to calculate the length of the desirable number of lines.
For example applying this algorithm on the python wikipedia page entry:
print getAbstract(text, lines=7)
will give you this output:
Python is a general-purpose high-level programming language.2 Its design philosophy emphasizes code readability.[3] Python claims to "[combine] remarkable power with very clear syntax",[4] and its standard library is large and comprehensive. Its use of indentation as block delimiters is unusual among popular programming languages.
Python supports multiple programming paradigms (primarily object oriented, imperative, and functional) and features a fully dynamic type system and automatic memory management, similar to Perl, Ruby, Scheme, and Tcl. Like other dynamic languages, Python is often used as a scripting...
Without further details it's hard to help you. But if your problem was that taking the first few lines was too much for some entries you may need to have a look at textwrap
For example if you only want 100 character abstracts you can do the following:
import textwrap
abstract = textwrap.wrap(text, 100)[0]
That will also replace newlines with spaces which might be desirable depending on your requirements.
I'm not exactly sure of what you want.
However, I would suggest cutting the article after X characters and put "...". Then you have more control over the size of your "abstract" (if that's what bothers you in your current implementation).
精彩评论