Problem trying to extract words from string in PHP
I'm trying to extract all words from a string into an array, but i am having some problems with spaces (
).
This is what I do:
//Clean data to text only
$data = strip_tags($data);
$data = htmlentities($data, ENT_QUOTES, 'UTF-8');
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
$data = htmlspecialchars_decode($data);
$data = mb_strtolower($data, 'UTF-8');
//Clean up text from special chrs I don't want as words
$data = str_replace(',', '', $data);
$data = str_replace('.', '', $data);
$data = str_replace(':', '', $data);
$data = str_replace(';', '', $data);
$data = str_replace('*', '', $data);
$data = str_replace('?', '', $data);
$data = str_replace('!', '', $data);
$data = str_replace('-', ' ', $data);
$data = str_replace("\n", ' ', $data);
$data = str_replace("\r", ' ', $data);
$data = str_replace("\t", ' ', $data);
$data = str_replace("\0", ' ', $data);
$data = str_replace("\x0B", ' ', $data);
$data = str_replace(" ", ' ', $data);
//Clean开发者_如何学Python up duplicated spaces
do {
$data = str_replace(' ', ' ', $data);
} while(strpos($data, ' ') !== false);
//Make array
$clean_data = explode(' ', $data);
echo "<pre>";
var_dump($clean_data);
echo "</pre>";
This outputs:
array(58) {
[0]=>
string(5) " "
[1]=>
string(5) " "
[2]=>
string(11) "anläggning"
[3]=>
string(3) "med"
[4]=>
string(3) "den"
[5]=>
string(10) "erfarenhet"
[6]=>
string(3) "som"
}
If i check source for output i see that the first 2 array values is
.
UPDATE:
After some tweaking with code i manage to get following output:array(56) {
[0]=>
string(1) "�" //Notice change. Instead of string length 5 it now says 1. But still its garbage.
[1]=>
string(1) "�"
[2]=>
string(11) "anläggning"
[3]=>
string(3) "med"
[4]=>
string(3) "den"
[5]=>
string(10) "erfarenhet"
[6]=>
string(3) "som"
[7]=>
string(5) "finns"
[8]=>
string(4) "inom"
Thanks!
ANSWER (for lazy people):
Even thou this is a slightly different approach to the problem, and it never really answers why I had the problems I had above (like leftover
and other extra weird spaces), I like it and it is a lot better than my original code.
Thanks to all who contributed to this!
//Clean data to text only
$data = strip_tags($data);
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
$data = htmlspecialchars_decode($data);
$data = mb_strtolower($data, 'UTF-8');
//Clean up text from special chrs
$data = str_replace(array("-"), ' ', $data);
$clean_data = str_word_count($data, 1, 'äöå');
echo "<pre>";
var_dump($clean_data);
echo "</pre>";
Ok, the only thing you would have to do is to replace
with a space as you already do (only if the string really still contains
check @Andy E's answer to make sure that that your data does not contain any HTML entities.):
$data = str_replace(" ", ' ', $data);
Then you can use str_word_count
to get the words:
$words = str_word_count($data, 1, 'äöåÄÖÅ');
P.S.: What is the sense of calling htmlentities
first and then revert it again in with html_entity_decode
anyway?
Update: Example:
$str = ' anläggning med den erfahrenhet som åååÅ ÅÅ';
print_r(str_word_count($str, 1, 'äöåÄÖÅ'));
prints
Array
(
[0] => anläggning
[1] => med
[2] => den
[3] => erfahrenhet
[4] => som
[5] => åååÅ
[6] => ÅÅ
)
Reading documentation helps :)
Is it possible you're "double encoding" any existing
parts of the string? You call htmlentities
on the string before html_entity_decode
, so any existing
characters would become &nbsp;
. You can prevent htmlentities
from double encoding by providing false
as the fourth parameter.
$data = htmlentities($data, ENT_QUOTES, 'UTF-8', false);
$data = html_entity_decode($data, ENT_QUOTES, 'UTF-8');
Also, bare in mind that you can pass an array for matches in str_replace
:
$data = str_replace(array(',','.',':',';','*','?','!','-'), '', $data);
Instead of:
14x str_replace
do {
$data = str_replace(' ', ' ', $data);
} while(strpos($data, ' ') !== false);
do:
$data = preg_replace('/[.*,:;?!]/', '', $data);
$data = preg_replace('/(?:\xC2\xA0|\s{2,}|-)/', ' ', $data);
Whereas 0xC2A0
is the non-breaking space (
) and \s
is any white-space character covering the repeated str_replace
calls.
print_r( explode(" ", $data));
Update
define("WORD_COUNT_MASK", "/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u");
function str_word_count_utf8($str)
{
preg_match_all(WORD_COUNT_MASK, $str, $matches);
print_r( $matches);
}
str_word_count_utf8( $str);
$data = ' cesadasdsadas <br /> dsadsadas';
$data = preg_replace('/ /', ' ', $data);
var_dump($data);
maybe you should try this : http://php.net/manual/en/function.str-word-count.php
I've made something close to your goal recently :
$words = array_unique(str_word_count($CONTENT." ".$TITLE, 1));
sort($words);
$words = addslashes (implode(" ", array_values($words)));
Bye.
精彩评论