sitemap generation strategy
i have a huge site, with more than 5 millions url.
We have already pagerank 7/10. The problem is that because of 5 millions url and because we add/remove new urls daily (we add ± 900 and we remove ± 300) google is not fast enough to index all of them. We have a huge and intense perl module to generate this sitemap that normally is composed by 6 sitemap files. For sure google is not faster enough to add all urls, specially because normally we recreate all those sitemaps daily and submit to google. My question is: what should be a better approach? Should i really care to send 5 millions urls to google daily even if i know that google wont be able to process? Or should i send just permal开发者_运维问答inks that wont change and the google crawler will found the rest, but at least i will have a concise index at google (today i have less than 200 from 5.000.000 urls indexed)
What is the point of having a lot of indexed sites which are removed right away? Temporary pages are worthless for search engines and their users after they are disposed. So I would go for letting search engine crawlers decides whether a page is worth indexing. Just tell them the URLs that will persist... and implement some list pages (if there aren't any yet), which allow your pages to be crawled easier.
Note below: 6 sitemap files for 5m URLs? AFAIK, a sitemap file may no contain more than 50k URLs.
When URLs change you should watch out that you work properly with 301 status (permanent redirect).
Edit (refinement): Still you should try that your URL patterns are getting stable. You can use 301 for redirects, but maintaining a lot of redirect rules is cumbersome.
Why don't you just compare your sitemap to the previous one each time, and only send google the URLs that have changed!
精彩评论