remove similar characters that appear in all rows
So I have a table with two columns "title" and "url". The rows go as such:
Title url
Galago - Wikipedia http://en.wikipedia.org/wiki/Galago
Characteristics - Wikipedia http://en.wikipedia.org/wiki/Galago
Classification - Wikipedia http://en.wikipedia.org/wiki/Galago
Myst- Gamestop http://www.gamestop.com/ds/games/myst/69424
Plot- Gamestop http://www.gamestop.com/ds/games/myst/69424
my question is, how would I remove the common characters that are present in all rows from a certain url (remove - Wikipedia from the first three, and - Gamestop from the other 2). This is just a minor example....I have many other rows that have the same pattern (they have开发者_JS百科 common characters, words, that reoccur in all of the rows from a certain url). I wanted to add that I store these values from a javacript array
If all of your strings are in the format shown above for the title column, I think the best approach may be to apply a regular expression to the title before inserting into the database table. This regular expression could capture all data preceding the "-" character and discard the "duplicate" data succeeding the "-".
Info on regular expressions on strings in PHP can be found here: http://php.net/manual/en/function.preg-match.php
I think that most automated solutions to this risk removing data that you want to keep. A word or phrase that occurs on more than one row is not necessarily redundant. A couple of potential, but still unreliable, methods come to mind. These would work only if you are looking for whole words.
Read all the titles into an array, and create a wordlist array by splitting each title into words. You can then determine the frequency of each word, and use that information to remove the unwanted words from the titles. If you have a lot of data, this method could use a lot of memory...
Parse each URL, extract the hostname, split it using a period (.) As the delimiter, and then search for and remove occurrences of those strings from the title. You might choose to create a whitelist of strings to ignore, like www, com, co, uk, net, org, and so on. This method may work if the unwanted words are found in the domain name (as in your examples).
You could normalize out the url info into another table...so like take the url column and make it url_id and create a url table that provides a url column and a title column. Title would be like Wikipedia or Gamestop etc. Then in the original table store the title with just the title not including the url title.
Maybe that won't work very well with the queries you are trying to do, but in that way you could probably search by url, url title, or title or any combination of those pretty easily.
精彩评论