Speed up SQL queries by converting data to XML?
I'm taking over a very small niche classifieds site (maybe 5 new entries a day) which runs fine until Google and Bing hit it. Every category, subcategory, and ad is stored in the database (MS SQL 2005) so each page they crawl hits these queries and spikes the CPU.
To combat this my boss wants me to 开发者_运维知识库offload all the data into XML files and query those instead (whaa?). I think this has to be a lot more work than necessary but being an amateur DBA at best I don't have any solid alternatives other than tuning queries, using stored procedures, creating proper indexes, etc...
Is there any basis to the XML argument? What would you do to dodge this spider issue?
Category:
SELECT Category.CatID, Category.CatName, Category.Remove, CategorySub.SubCatID, CategorySub.SubCatName, CategorySub.TitleText, CategorySub.ShowPrice, CategorySub.ShowLocation, CategorySub.Remove
FROM Category INNER JOIN CategorySub ON Category.CatID = CategorySub.CatID
WHERE (((Category.Remove)=0) AND ((CategorySub.SubCatID)=" & intSubCatID & ") AND ((CategorySub.Remove)=0))
Subcategory:
SELECT CategorySub.CatID, CategorySub.SubCatName, CategorySub.SubCatOrder, CategorySub.SubCatID, CategorySub.Remove FROM CategorySub
WHERE (((CategorySub.CatID)=" & intCatID & ") AND ((CategorySub.Remove)=0))
ORDER BY CategorySub.SubCatOrder
Ads:
SELECT Ad.AdID, Ad.ModifiedDate, Ad.CatID, CategorySub.SubCatName, Ad.Title, Ad.ShortDesc, Ad.Price, Ad.Location, Count(Pictures.PictureID) AS CountOfPictureID
FROM (Ad INNER JOIN CategorySub ON Ad.SubCatID = CategorySub.SubCatID) LEFT JOIN Pictures ON Ad.AdID = Pictures.AdID
WHERE (((Ad.SubCatID)=" & intSubCatID & ") AND ((Ad.Activated)<>0) AND ((Ad.Remove)=0))
GROUP BY Ad.AdID, Ad.ModifiedDate, Ad.CatID, CategorySub.SubCatName, Ad.Title, Ad.ShortDesc, Ad.Price, Ad.Location
ORDER BY Ad.ModifiedDate DESC
- You could prevent spiders from hitting your site or certain pages (by using robots.txt).
- Or you could run different queries (no ads) when they index you (by checking the User-Agent header)
Which of those queries are problematic? The only thing that pops out at me just reading them is that your ad query has a lot in the group by clause- that could be eating a lot of tempdb, and it then has to sort the result by something else. Since the only reason for doing aggregation there seems to be to collapse the pictures into one row, and it looks like you only care about whether there is a picture there or not, consider changing that to have:
SELECT Ad.AdID, ..., Ad.Location,
(SELECT COUNT(*) FROM Pictures WHERE Pictures.AdID = Ad.AdID) AS CountOfPictureID
FROM Ad
WHERE ...
ORDER BY ...
or alternatively:
SELECT Ad.AdID, ..., Ad.Location,
CASE WHEN EXISTS (SELECT COUNT(*) FROM Pictures WHERE Pictures.AdID = Ad.AdID)
THEN 1 ELSE 0 END AS PictureExists
FROM Ad
WHERE ...
ORDER BY ...
If you needed to get some data on the first picture, you could use OUTER APPLY, but that's not needed to produce the same results you already get.
Realistically, unless you have a lot of data (you didn't mention an absolute volume), these sort of queries shouldn't be a problem. Another thing to consider is whether old ads are kept in the main Ad table or not (if they are, consider writing a process for archiving them off after they have expired; or create an additional table with just the AdID values of ads that should be shown and INNER JOIN that with Ad). And of course, look at the plans of the main queries that are causing slowdowns... do you have enough logging that you can see which queries are taking up the time? Can you show that it is the SQL queries slowing down the page or is something else taking up the time?
There is some merit to caching data that doesn't change much. For example, do your headings and subheadings change as often as your ads? I would expect there to be quite a difference. Some sort of off-DB caching- XML files, memcached- can be used to remove the queries that are fetching the same old data over and over again, at the expense of needing a way to make sure that the data being reused is acceptably up-to-date (e.g. refresh it at least every X minutes).
Ourselves, we had to change a bunch of our browse pages to take their data from a local document database rather than RDBMS because the RDBMS in question couldn't combine several orthogonal not-very-selective criteria together very well (not while running the rest of the site, anyway). I hesitate to use the buzzword of the moment "NoSQL" but that's what it is (although the problem wasn't really with using a SQL database in theory, just the one we have not planning queries well enough).
精彩评论