Starting out NLP - Python + large data set [closed]

2022-12-25 19:31 问答作者：

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 3 years ago.

Improve this question

I've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the past unfort).

But I'm guessing sqlite is not the way to go for a full blown nlp project(/experiment :) - what would be the sort of things I should look at ? HBase (.. and hadoop) seem interesting, i guess i could run then im java, prototype in python and maybe migrate the really slow bits to java... alternatively just run Mysql.. but the dataset is 12gb, i wonder if that will be a problem? Also looked at lucene, but not sure how (other than breaking the wiki articles into chunks) i'd get that to work..

What comes to mind for a really flexible NLP platform (i dont really know at this stage WHAT i want to do.. just want to learn large scale lang analysis tbh)开发者_StackOverflow中文版 ?

Many thanks.

NLTK is where you should start from (it's Python-based -- not sure why you're already thinking about parallelizing your processing at such an early stage... start with a more flexible experimental setup, is my advice). sqlite should be fine for a few GB -- if you need more advanced and standard SQL power you could consider postgresql.

There is a related talk on PyCon 2010 "The Python and the Elephant: Large Scale Natural Language Processing with NLTK and Dumbo". The link has introductory information, slides and video.
I think sqlite is still a good choice for 12G size data. I have a text classification training set which has the similar size, both sqlite and plain text is fine as long as just iterator it line by line.

It is most likely that you are going to use Vector Space Model to represent the text while doing the anlaysis.

In which case, you should look at platforms that can help you store term vectors with term frequencies. It makes your life so much easier.

Have a look at Apache Lucene which has a python library to access Java Lucene. Elasticsearch is also a good alternative, which uses Apache Lucene underneath and has a really good python package. Elasticsearch also exposes a REST API.

Postgresql is also really good at storing tokens. Check out this article to learn more.

I have worked with sizable language data before and I personally prefer Lucene/Elasticsearch for analysis projects.

Cheers.

Summary from the internet:

Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start. Gensim is a topic modelling library for Python that provides access to Word2Vec and other word embedding algorithms for training, and it also allows pre-trained word embeddings that you can download from the internet to be loaded.

NLTK details already given above.

Standford NLP has recently launched 50+ langauge supported python framework. You should check it out for sure. There are many others but the above 4 are most usable in the sense of community support and latest features

I personally prefer Spacy. Spacy is one of fastest of all and can use gensim/other APIs integrated into its model. Moreover, Spacy models has a lots of languages in its alpha stage making it a perfect choice for multilingual apps.

Scaling is whole different thing[you can use alot of tools].But lets stick to scaling in NLP: Spacy gives so much control over different pipelines that you can disable unwanted pipelines making it faster.

Starting out NLP - Python + large data set [closed]

Look into it try yourself and explore.

继续阅读：database python

Starting out NLP - Python + large data set [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？