Project Gutenberg Python problem?
I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First, here is my algorithm:
Enter a sentence as input -this is called trigger string-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the w开发者_JS百科ord I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far I have been able to do this for first two sentences but I cannot perform a case insensitive search. Entire sentence database of Project Gutenberg is available via gutenberg.sents()
function but regex - case insensitive search is practically impossible since the gutenberg.sents()
outputs the sentences in books as following -in a list of list format-:
EXAMPLE: all the sentences of shakespeare's macbeth is called by typing
import nltk
from nltk.corpus import gutenberg
gutenberg.sents('shakespeare-macbeth.txt')
into the python shell command line and output is:
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
['Actus', 'Primus', '.'], .......]
with [The Tragedie of Macbeth by William Shakespare, 1603] and Actus Primus. being the first two sentences.
How can I find the word I'm looking for regardless of it being uppercase/lowercase ? I'm desperately in need of help since I have been tinkering with this for the past two days and it's starting to wear on my nerves. Thanks a lot.
Given a list L
of words, and a target word t
,
any(t.lower()==w.lower() for w in L)
tells you whether L has word t in a case-insensitive way. It's faster, of course, to do
lt = t.lower()
any(lt==w.lower() for w in L)
since Python does not "hoist" the constant computation out of the loop and, unless you hoist it yourself, it will be performed repeatedly.
Given a list of lists lol
, the longest sub-list including t
can be found by
longest = max((L for L in lol if any(lt==w.lower() for w in L)), key=len)
If multiple sub-lists include t
and are of the same maximal length, this will give you the first one, as it happens.
How about using the built-in function: str.lower()¶ Return a copy of the string converted to lowercase.
Then just compare the strings.
精彩评论