开发者

Streaming API with languages

Is there anyway i 开发者_开发问答can retrieve only English tweets using the Twitter's Live Straeming API? It seems like using "sample" or "filter" results around 60-70 percent of non-English tweets.

Thanks

Joel


I haven't found a good solution to this, I've solved this using the following:

1) filter by lang attribute equal to "en".

2) I found that several non-english languages are still in the english labelled tweets. So, I downloaded spanish, dutch, and indonesian word lists, and checked for number of non-english word occurrences in tweets. More than 1, and I discard it as non-english.

3) I think I need to filter for portuguese as well, need to investigate this.


Filtering only English-language messages from the twitter stream is an active research area. You could use an off-the-shelf language identification system to locally process the stream and select only messages in English. One such system is langid.py. Full disclosure, I am the author of langid.py.

Another system I know of is ldig by Nakatani Shuyo. I haven't had a chance to experiment with it yet, but it is made specifically for language identification of Twitter messages.


Twitter will soon be releasing a new (or updated) attribute just for this purpose! See their blog post, Introducing new metadata for Tweets

The new lang attribute specifies the language the Tweet was written in, as identified by Twitter's machine language detection algorithms.

At the time of this writing the lang attribute and language parameter haven't yet appeared, however check the Calendar of API changes to see when they plan on releasing it (currently just specifies "2013").

Update 3/30/2013:

The lang attribute was added to the Streaming API on March 26, 2013. In addition, it was also made available on the REST API on March 6, 2013.


For use in the Twitter Streaming API, language is now a request parameter:

https://dev.twitter.com/docs/streaming-apis/parameters#language

So for English you'd add 'language=en' into your request parameter string.


Twitter just finished it!! cf calendar API:

https://dev.twitter.com/calendar

March 26, 2013 lang attribute & language parameter appears on streaming Blog post Streaming API.

The twitter API rocks!!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜