How to strip out strange characters when consuming a feed?
I am consuming a couple of feeds at the same time and assembling one single feed. When grabbing and 'cleaning up' the description for a particular tag, I find bullet characters, that I cannot for the life of me 'remove' from the output.
Doing a simple str_replace
to find the •
(just like that, not an li
or ascii value) character does nothing at all for me. I'm scratching my head and wondering why this is? This d开发者_C百科oes not seem to be an encoding issue, simply a bullet point being sent over in a non ascii safe format.
Anyone run into this? A character you couldn't identify or remove?
Here is some example text:
Required Qualifications:
•BSME or equivalent four year degree
•Minimum four years in blahblah industry experience
The above is an example of a description I wish to clean up (would love to replace the bullet with a -
, but would settle for just removing it.
Ideas?
EDIT -------
Based on feedback, here is some additional detail. The character just comes through as is •
. I doubt it is an encoding issue as this particular location ouputs this data set to either HTML (webpage with the details) or to an XML feed (packaged html tags inside the description field).
I consume the multiple xml feeds using xml2array
(php). I have not had any issues with it before. I am pretty sure it is UTF-8
, just the bullet comes through.
To assemble the feeds, I build my own array server side, and once I correlate the proper values from the other feeds, I output the final 'built' xml feed (which I then have an internal app consume).
The reason for consuming multiple sources? Gaps in the data that are not available in 1 format.
MORE EDITING -------
Ok looks like this is an encoding issue, but I still have yet to remove the •
bullet. I convert it using utf8_encode
however I get odd symbols that don't copy identically, so I get something like â[]¢
.
Again I am doing something like xml2array(URL)
, which converts the XML @ the url to an array, then simply grabbing data from the built array.
the html code for that character is •
and the numeric code is •
. Might try searching on those
btw: maybe a preg_replace() will do the trick
$str2 = preg_replace("/•/", "", $str);
If the feed contains a literal bullet character, check if the encoding of your PHP file matches the encoding of the feed. Otherwise str_replace
will miss the char.
Try preg_replace
and search for \u2022
2022 is a unicode code-point for bullet character.
精彩评论