开发者

php parsing speed optimization

I would like to add tooltip or generate link according to the element available in the database, for exemple if the html page printed is:

to reboot your linux host in single-user mode you can ...

I will use explode(" ", $row[page]) and the idea is now to lookup for every single word in the page to find out if they have a related referance in this exemple let's say i've got a table referance an one entry for reboot and one for linux reboot: restart a computeur linux: operating system

now my output will look like (replaced < and > by @)

to @a href="ref/reboot"@reboot@/a@ your @a href="ref/linux"@linux@/a@ host in single-user mode you can ...

  • I开发者_StackOverflownstead of have a static list generated when I saved the content, if I add more keyword in the future, then the text will become more interactive.

My main concerne and question is how can I create a efficient enough process to do it ?

  • Should I store all the db entry in an array and compare them ?
  • Do an sql query for each word (seems to be crazy)
  • Dump the table in a file and use a very long regex or a "grep -f pattern data" way of doing it?
  • Or or or or I'm sure it must be a better way of doing it, just don't have a clue about it, or maybe this will be far too resource un-friendly and I should avoid doing such things.

Cheers!


depending on the amount of keywords in the db there are two solutions. 1. if amount of keywords is less then amount of words in the text. Then you just pull all the keywords from db and compare them. 2. if amount of keywords is more then words in text. Dynamically create a single query which will bring all necessary words. eg. SELECT * FROM keywords WHERE keyword='system' OR keyword='linux' etc.

However if you are really concerned about resources i would suggest you to create a caching system. you process each page once, then store both original text and result in the db. if keyword table is updated you can reprocess all the pages once again.


I would have added an additional field for each article that would contain 'keyword table version' which was used to process this article.

Each time a user opens an article, you should compare this version with version of the keyword list. If it is outdated, you process the article and save the results to the articles table. Otherwise you just show the article.

You can control the load by adding a date column fro processing, and check it as well. If the item is relatively fresh, you may want to postpone the processing. Again, you may compare version difference, if it is greater than 5 or 10, for instance, you should update the article. If you have an important keyword added, just increase the version of keywords table by 10 and all your articles will be forced to update.

The main idea is distributing the load to user requests, and caching the results.

If your system is heavily-loaded, you may want to use random number generator to define that you should update the article only with a 10% chance, for instance.


You can have an index of keywords stored somewhere statically (database, file, or in an array). When the content is updated, you can rebuild or update the index accordingly. You just have to make sure that it can be looked up very quickly.

When you have it, you can then look up if there is a that word in the database very quickly, because the index is optimized for this.

I would store the index in a sorted list in a file, and look them up using binary search. This is a simple solution, and I think that should be quick enough if there are not too many data to process. Or maybe you can send a list of words in the article to the database in one SQL query and have it return the list of articles that matches any of the word in the list.

Also, after the article is processed, you should also cache your data, so that in subsequent requests to the same article, you can give them the processed article instead of processing them everytime.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜