开发者

PHP Twitter Tweets Language

I'm building a 开发者_开发知识库site that uses tweets from Twitters public timeline.

http://twitter.com/statuses/public_timeline.xml

I don't want tweets in Chinese, Russian, etc. I want everything but the tweets that are written in symbols.

Here is an example of what I don't want: スポーツブランドPR、マーケティング。2児の母。好きなもの:ユニコーン、着物、駅伝。

I've tried mb_detect_encoding UTF8 but that isn't working.


You can simply use the Google Language API:

GET https://www.googleapis.com/language/translate/v2?key=INSERT-YOUR-KEY&target=de&q=Hello%20world

and it will return the language in JSON:

{
    "data": {
        "translations": [
            {
                "translatedText": "Hallo Welt",
                "detectedSourceLanguage": "en"
            }
        ]
    }
}

Example taken from the official documentation, search for "Here is an another example in which the language of the source text is auto-detected:"


All the encoding is the same, the english posts are in UTF-8 too ;)

There are two options, either find a solution from the Twitter API that you can filter English only posts.

Or you can use a regex and a loop to filter the posts with non-roman/latin chars in them.

preg_match('/[^\00-\255]+/u', $post);

Hope this helps,

Niko


I don't think there's a way to declare a language filter when querying the public timeline.

However, a language field is returned in a public timeline query for the user that posted the tweet. You could filter on this with a pretty high degree of confidence.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜