Failed loading english.pickle with nltk.data.load
When trying to load the punkt
tokenizer...
import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
...a LookupError
was raised:
> LookupError:
> *********************************************************************
> Resource 'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: nltk.download(). Searched in:
> - 'C:\\Users\\Martinos/nltk_data'
> - 'C:\\nltk_data'
> - 'D:\\nltk_data'
> 开发者_StackOverflow - 'E:\\nltk_data'
> - 'E:\\Python26\\nltk_data'
> - 'E:\\Python26\\lib\\nltk_data'
> - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
> **********************************************************************
I had this same problem. Go into a python shell and type:
>>> import nltk
>>> nltk.download()
Then an installation window appears. Go to the 'Models' tab and select 'punkt' from under the 'Identifier' column. Then click Download and it will install the necessary files. Then it should work!
The main reason why you see that error is nltk couldn't find punkt
package. Due to the size of nltk
suite, all available packages are not downloaded by default when one installs it.
You can download punkt
package like this.
import nltk
nltk.download('punkt')
from nltk import word_tokenize,sent_tokenize
This is also recommended in the error message in more recent versions:
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'
- ''
**********************************************************************
If you do not pass any argument to the download
function, it downloads all packages i.e chunkers
, grammars
, misc
, sentiment
, taggers
, corpora
, help
, models
, stemmers
, tokenizers
.
nltk.download()
The above function saves packages to a specific directory. You can find that directory location from comments here. https://github.com/nltk/nltk/blob/67ad86524d42a3a86b1f5983868fd2990b59f1ba/nltk/downloader.py#L1051
This is what worked for me just now:
# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')
# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize
sentences = [
"Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
"Professor Plum has a green plant in his study.",
"Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]
sentences_tokenized = []
for s in sentences:
sentences_tokenized.append(word_tokenize(s))
sentences_tokenized is a list of a list of tokens:
[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]
The sentences were taken from the example ipython notebook accompanying the book "Mining the Social Web, 2nd Edition"
From bash command line, run:
$ python -c "import nltk; nltk.download('punkt')"
This Works for me:
>>> import nltk
>>> nltk.download()
In windows you will also get nltk downloader
Simple nltk.download()
will not solve this issue. I tried the below and it worked for me:
in the nltk
folder create a tokenizers
folder and copy your punkt
folder into tokenizers
folder.
This will work.! the folder structure needs to be as shown in the picture!1
nltk have its pre-trained tokenizer models. Model is downloading from internally predefined web sources and stored at path of installed nltk package while executing following possible function calls.
E.g. 1 tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
E.g. 2 nltk.download('punkt')
If you call above sentence in your code, Make sure you have internet connection without any firewall protections.
I would like to share some more better alter-net way to resolve above issue with more better deep understandings.
Please follow following steps and enjoy english word tokenization using nltk.
Step 1: First download the "english.pickle" model following web path.
Goto link "http://www.nltk.org/nltk_data/" and click on "download" at option "107. Punkt Tokenizer Models"
Step 2: Extract the downloaded "punkt.zip" file and find the "english.pickle" file from it and place in C drive.
Step 3: copy paste following code and execute.
from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer
sentences = [
"Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
"Professor Plum has a green plant in his study.",
"Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]
tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize
wordToken = []
for sent in sentences:
subSentToken = []
for subSent in tokenizer.tokenize(sent):
subSentToken.extend([token for token in treebank_word_tokenize(subSent)])
wordToken.append(subSentToken)
for token in wordToken:
print token
Let me know, if you face any problem
On Jenkins this can be fixed by adding following like of code to Virtualenv Builder under Build tab:
python -m nltk.downloader punkt
In Spyder, go to your active shell and download nltk using below 2 commands. import nltk nltk.download() Then you should see NLTK downloader window open as below, Go to 'Models' tab in this window and click on 'punkt' and download 'punkt'
I had similar issue when using an assigned folder for multiple downloads, and I had to append the data path manually:
single download, can be achived as followed (works)
import os as _os
from nltk.corpus import stopwords
from nltk import download as nltk_download
nltk_download('stopwords', download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)
stop_words: list = stopwords.words('english')
This code works, meaning that nltk remembers the download path passed in the download fuction. On the other nads if I download a subsequent package I get similar error as described by user:
Multiple downloads raise an error:
import os as _os
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import download as nltk_download
nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)
print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))
Error:
Resource punkt not found. Please use the NLTK Downloader to obtain the resource:
import nltk nltk.download('punkt')
Now if I append the ntlk data path with my download path, it works:
import os as _os
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import download as nltk_download
from nltk.data import path as nltk_path
nltk_path.append( _os.path.join(get_project_root_path(), 'temp'))
nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)
print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))
This works... Not sure why works in one case but not the other, but error message seems to imply that it doesn't check into the download folder the second time. NB: using windows8.1/python3.7/nltk3.5
i came across this problem when i was trying to do pos tagging in nltk.
the way i got it correct is by making a new directory along with corpora directory named "taggers" and copying max_pos_tagger in directory taggers.
hope it works for you too. best of luck with it!!!.
In Python-3.6
I can see the suggestion in the traceback. That's quite helpful.
Hence I will say you guys to pay attention to the error you got, most of the time answers are within that problem ;).
And then as suggested by other folks here either using python terminal or using a command like python -c "import nltk; nltk.download('wordnet')"
we can install them on the fly.
You just need to run that command once and then it will save the data locally in your home directory.
you just need to go to python console and type->
import nltk
press enter and retype->
nltk.download()
and then a interface will come. Just search for download button and press it. It will install all the required items and will take time. Give the time and just try it again. Your problem will get solved
Check if you have all NLTK libraries.
The punkt tokenizers data is quite large at over 35 MB, this can be a big deal if like me you are running nltk in an environment such as lambda that has limited resources.
If you only need one or perhaps a few language tokenizers you can drastically reduce the size of the data by only including those languages .pickle
files.
If all you only need to support English then your nltk data size can be reduced to 407 KB (for the python 3 version).
Steps
- Download the nltk punkt data: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
- Somewhere in your environment create the folders:
nltk_data/tokenizers/punkt
, if using python 3 add another folderPY3
so that your new directory structure looks likenltk_data/tokenizers/punkt/PY3
. In my case I created these folders at the root of my project. - Extract the zip and move the
.pickle
files for the languages you want to support into thepunkt
folder you just created. Note: Python 3 users should use the pickles from thePY3
folder. With your language files loaded it should look something like: example-folder-stucture - Now you just need to add your
nltk_data
folder to the search paths, assuming your data is not in one of the pre-defined search paths. You can add your data using either the environment variableNLTK_DATA='path/to/your/nltk_data'
. You can also add a custom path at runtime in python by doing:
from nltk import data
data.path += ['/path/to/your/nltk_data']
NOTE: If you don't need to load in the data at runtime or bundle the data with your code, it would be best to create your nltk_data
folders at the built-in locations that nltk looks for.
nltk.download()
will not solve this issue. I tried the below and it worked for me:
in the '...AppData\Roaming\nltk_data\tokenizers'
folder, extract downloaded punkt.zip
folder at the same location.
Simply add two lines which is given below:-
import nltk
nltk.download('punkt')
If all of the above strategies don't work (which is the case for me) just run the following code:
import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
I must have wasted hours because of this, and this code seems to have solved my problem.
Reference:
https://www.nltk.org/howto/data.html
精彩评论