Starting out NLP - Python + large data set [closed]
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this questionI've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the past unfort).
But I'm guessing sqlite is not the way to go for a full blown nlp project(/experiment :) - what would be the sort of things I should look at ? HBase (.. and hadoop) seem interesting, i guess i could run then im java, prototype in python and maybe migrate the really slow bits to java... alternatively just run Mysql.. but the dataset is 12gb, i wonder if that will be a problem? Also looked at lucene, but not sure how (other than breaking the wiki articles into chunks) i'd get that to work..
What comes to mind for a really flexible NLP platform (i dont really know at this stage WHAT i want to do.. just want to learn large scale lang analysis tbh)开发者_StackOverflow中文版 ?
Many thanks.
NLTK is where you should start from (it's Python-based -- not sure why you're already thinking about parallelizing your processing at such an early stage... start with a more flexible experimental setup, is my advice). sqlite should be fine for a few GB -- if you need more advanced and standard SQL power you could consider postgresql.
There is a related talk on PyCon 2010 "The Python and the Elephant: Large Scale Natural Language Processing with NLTK and Dumbo".
The link has introductory information, slides and video.
I think sqlite is still a good choice for 12G size data. I have a text classification training set which has the similar size, both sqlite and plain text is fine as long as just iterator it line by line.
It is most likely that you are going to use Vector Space Model to represent the text while doing the anlaysis.
In which case, you should look at platforms that can help you store term vectors with term frequencies. It makes your life so much easier.
Have a look at Apache Lucene which has a python library to access Java Lucene. Elasticsearch is also a good alternative, which uses Apache Lucene underneath and has a really good python package. Elasticsearch also exposes a REST API.
Postgresql is also really good at storing tokens. Check out this article to learn more.
I have worked with sizable language data before and I personally prefer Lucene/Elasticsearch for analysis projects.
Cheers.
Summary from the internet:
Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start. Gensim is a topic modelling library for Python that provides access to Word2Vec and other word embedding algorithms for training, and it also allows pre-trained word embeddings that you can download from the internet to be loaded.
NLTK details already given above.
Standford NLP has recently launched 50+ langauge supported python framework. You should check it out for sure. There are many others but the above 4 are most usable in the sense of community support and latest features
I personally prefer Spacy. Spacy is one of fastest of all and can use gensim/other APIs integrated into its model. Moreover, Spacy models has a lots of languages in its alpha stage making it a perfect choice for multilingual apps.
Scaling is whole different thing[you can use alot of tools].But lets stick to scaling in NLP: Spacy gives so much control over different pipelines that you can disable unwanted pipelines making it faster.
Look into it try yourself and explore.
精彩评论