PHP Twitter Tweets Language
I'm building a 开发者_开发知识库site that uses tweets from Twitters public timeline.
http://twitter.com/statuses/public_timeline.xml
I don't want tweets in Chinese, Russian, etc. I want everything but the tweets that are written in symbols.
Here is an example of what I don't want: スポーツブランドPR、マーケティング。2児の母。好きなもの:ユニコーン、着物、駅伝。
I've tried mb_detect_encoding UTF8 but that isn't working.
You can simply use the Google Language API:
GET https://www.googleapis.com/language/translate/v2?key=INSERT-YOUR-KEY&target=de&q=Hello%20world
and it will return the language in JSON:
{
"data": {
"translations": [
{
"translatedText": "Hallo Welt",
"detectedSourceLanguage": "en"
}
]
}
}
Example taken from the official documentation, search for "Here is an another example in which the language of the source text is auto-detected:"
All the encoding is the same, the english posts are in UTF-8 too ;)
There are two options, either find a solution from the Twitter API that you can filter English only posts.
Or you can use a regex and a loop to filter the posts with non-roman/latin chars in them.
preg_match('/[^\00-\255]+/u', $post);
Hope this helps,
Niko
I don't think there's a way to declare a language filter when querying the public timeline.
However, a language
field is returned in a public timeline query for the user that posted the tweet. You could filter on this with a pretty high degree of confidence.
精彩评论