开发者

remove similar characters that appear in all rows

So I have a table with two columns "title" and "url". The rows go as such:

Title                              url

    Galago - Wikipedia                  http://en.wikipedia.org/wiki/Galago         
    Characteristics - Wikipedia          http://en.wikipedia.org/wiki/Galago
    Classification - Wikipedia           http://en.wikipedia.org/wiki/Galago
    Myst- Gamestop                       http://www.gamestop.com/ds/games/myst/69424
    Plot- Gamestop                       http://www.gamestop.com/ds/games/myst/69424

my question is, how would I remove the common characters that are present in all rows from a certain url (remove - Wikipedia from the first three, and - Gamestop from the other 2). This is just a minor example....I have many other rows that have the same pattern (they have开发者_JS百科 common characters, words, that reoccur in all of the rows from a certain url). I wanted to add that I store these values from a javacript array


If all of your strings are in the format shown above for the title column, I think the best approach may be to apply a regular expression to the title before inserting into the database table. This regular expression could capture all data preceding the "-" character and discard the "duplicate" data succeeding the "-".

Info on regular expressions on strings in PHP can be found here: http://php.net/manual/en/function.preg-match.php


I think that most automated solutions to this risk removing data that you want to keep. A word or phrase that occurs on more than one row is not necessarily redundant. A couple of potential, but still unreliable, methods come to mind. These would work only if you are looking for whole words.

  1. Read all the titles into an array, and create a wordlist array by splitting each title into words. You can then determine the frequency of each word, and use that information to remove the unwanted words from the titles. If you have a lot of data, this method could use a lot of memory...

  2. Parse each URL, extract the hostname, split it using a period (.) As the delimiter, and then search for and remove occurrences of those strings from the title. You might choose to create a whitelist of strings to ignore, like www, com, co, uk, net, org, and so on. This method may work if the unwanted words are found in the domain name (as in your examples).


You could normalize out the url info into another table...so like take the url column and make it url_id and create a url table that provides a url column and a title column. Title would be like Wikipedia or Gamestop etc. Then in the original table store the title with just the title not including the url title.

Maybe that won't work very well with the queries you are trying to do, but in that way you could probably search by url, url title, or title or any combination of those pretty easily.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜