Optimize solution for categories tree search
I'm creating some kind of auction application, and I have to decide what is the most optimize way for this problem. I'm using BL Toolkit as my OR Mapper (It have nice Linq support) and ASP.NET MVC 2.
Background
I've got multipleCategory
objects that are created dynamically and that are saved in my database as a representation of this class:
class Category
{
public int Id { get; set; }
public int ParentId { get; set; }
public string Name { get; set; }
}
Now every Category
object can have associated multiple of the InformatonClass
objects that represents single information in that category, for example it's a price or colour. Those classes are also dynamicly created by administator and stored in database. There are specific for a group of categories. The class that represents it looks following:
class InformationClass
{
public int Id { get; set; }
public InformationDataType InformationDataType { get; set; }
public string Name { get; set; }
public string Label { get; set; }
}
Now I've got third table that represents the join between the开发者_开发知识库m like this:
class CategoryInformation
{
public int InformationClassId { get; set; }
public int AuctionCategoryId { get; set; }
}
Problem
Now the problem is that I need to inherit all categoryInformationClass
in the child categories. For example every product will have a price so I need to add this InformationClass
only to my root category. The frequency information can be added to base CPU category and it should be avaible in AMD and Intel categories that will derive from CPU category.
I have to know which InformationClass
objects are related to specifed Category
very often in my application.
So here is my question. What will be the most optimize solution for this problem? I've got some ideas but I cant decide.
- Load all categories from database to
Application
table and take them from this place everytime - as far as the categories will not change too often it will reduce number of database requests but it will still require to tree search using Linq-to-Objects - Invent (I don't know if it's possible) some fancy Linq query that can tree search and get all information class id's without stressing the database too much.
- Some other nice ideas?
I will be grateful for every answers and ideas. Thank you all in advice.
Sounds like a case for an idea I once had which I blogged about:
- Tree structures and DAGs in SQL with efficient querying using transitive closures
The basic idea is this: In addition to the Category
table, you also have a CategoryTC
table which contains the transitive closure of the parent-child relationship. It allows you to quickly and efficiently retrieve a list of all ancestor or descendant categories of a particular category. The blog post explains how you can keep the transitive closure up-to-date every time a new category is created, deleted, or a parent-child relationship changed (it’s at most two queries each time).
The post uses SQL to express the idea, but I’m sure you can translate it to LINQ.
You didn’t specify in your question how the InformationClass
table is linked to the Category
table, so I have to assume that you have a CategoryInformation
table that looks something like this:
class CategoryInformation
{
public int CategoryId { get; set; }
public int InformationClassId { get; set; }
}
Then you can get all the InformationClasses associated with a specific category by using something like this:
var categoryId = ...;
var infoClasses = db.CategoryInformation
.Where(cinf => db.CategoryTC.Where(tc => tc.Descendant == categoryId)
.Any(tc => tc.Ancestor == cinf.CategoryId))
.Select(cinf => db.InformationClass
.FirstOrDefault(ic => ic.Id == cinf.InformationClassId));
Does this make sense? Any questions, please ask.
In the past (pre SQLServer 2005 and pre LINQ) when dealing with this sort of structure (or the more general case of a directed acyclic graph, implemented with a junction table so that items can have more than one "parent"), I've either done this by loading the entire graph into memory, or by creating a tigger-updated lookup table in the database that cached in relationship of ancestor to descendant.
There are advantages to either and which wins out depends on update frequency, complexity of the objects outside of the matter of the parent-child relationship, and frequency of updating. In general, loading into memory allows for faster individual look-ups, but with a large graph it doesn't natively scale as well due to the amount of memory used in each webserver ("each" here, because webfarm situations are one where having items cached in memory brings extra issues), meaning that you will have to be very careful about how things are kept in synch to counter-act that effect.
A third option now available is to do ancestor lookup with a recursive CTE:
CREATE VIEW [dbo].[vwCategoryAncestry]
AS
WITH recurseCategoryParentage (ancestorID, descendantID)
AS
(
SELECT parentID, id
FROM Categories
WHERE parentID IS NOT NULL
UNION ALL
SELECT ancestorID, id
FROM recurseCategoryParentage
INNER JOIN Categories ON parentID = descendantID
)
SELECT DISTINCT ancestorID, descendantID
FROM recurseCategoryParentage
Assuming that root categories are indicated by having a null parentID.
(We use UNION ALL since we're going to SELECT DISTINCT afterwards anyway, and this way we have a single DISTINCT operation rather than repeating it).
This allows us to do the look-up table approach without the redundancy of that denormalised table. The efficiency trade-off is obviously different and generally poorer than with a table but not much (slight hit on select, slight gain on insert and delete, negliable space gain), but guarantee of correctness is greater.
I've ignored the question of where LINQ fits into this, as the trade-offs are much the same whatever way this is queried. LINQ can play nicer with "tables" that have individual primary keys, so we can change the select clause to SELECT DISTINCT (cast(ancestorID as bigint) * 0x100000000 + descendantID) as id, ancestorID, descendantID
and defining that as the primary key in the [Column]
attribute. Of course all columns should be indicated as DB-generated.
Edit. Some more on the trade-offs involved.
Comparing the CTE approach with look-up maintained in database:
Pro CTE:
- The CTE code is simple, the above view is all the extra DB code you need, and the C# needed is identical.
- The DB code is all in one place, rather than there being both a table and a trigger on a different table.
- Inserts and deletes are faster; this doesn't affect them, while the trigger does.
- While semantically recursive, it is so in a way the query planner understands and can deal with, so it's typically (for any depth) implemented in just two index scans (likely clustered) two light-weight spools, a concatenation and a distinct sort, rather than in the many many scans that you might imagine. So while certainly a heavier scan than a simple table lookup, it's nowhere near as bad as one might imagine at first. Indeed, even the nature of those two index scans (same table, different rows) makes it less expensive than you might think when reading that.
- It is very very easy to replace this with the table look-up if later experience proves that to be the way to go.
- A lookup table will, by its very nature, denormalise the database. Purity issues aside, the "bad smell" involved means that this will have to be explained and justified to any new dev, as until then it may simply "look wrong" and their instincts will send them on a wild-goose chase trying to remove it.
Pro Lookup-Table:
- While the CTE is faster to select from than one might imagine, the lookup is still faster, especially when used as part of a more complicated query.
- While CTEs (and the WITH keyword used to create them) are part of the SQL 99 standard, they are relatively new and some devs don't know them (though I think this particular CTE is so straightforward to read that it counts as a good learning example anyway, so maybe this is actually pro CTE!)
- While CTEs are part of the SQL 99 standard, they aren't imlemented by some SQL databases, including older versions of SQLServer (which are still in live use), which may affect any porting efforts. (They are though supported by Oracle, and Postgres among others, so at this point this may not really be an issue).
- It's reasonably easy to replace this with the CTE version later, if later experience suggests you should.
Comparing (both) the db-heavy options with in-memory caching.
Pro In-Memory:
- Unless your implementation really sucks, it is going to be much faster than DB lookups.
- It makes some secondary optimisations possible on the back of this change.
- It is reasonably difficult to change from DB to in-memory if later profiling shows that in-memory is the way to go.
Pro Querying DB:
- Start-up time can be very slow with in-memory.
- Changes to the data are much much simpler. Most of the points are aspects of this. Really, if you go the in-memory route then the question of how to handle changes invalidating the cached information becomes a whole new ongoing concern for the lifetime of the project, and not a trivial one at all.
- If you use in-memory, you are probably going to have to use this in-memory store even for operations where it is not relevant, which may complicate where it fits with the rest of your data-access code.
- It is not necessary to track changes and cache freshness.
- It is not necessary to ensure that every webserver in a web-farm and/or web-garden solution (a certain level of success will necessitate this) has precisely the same degree of freshness.
- Similarly, the degree of scalability across machines (how close to 100% extra performance you get by doubling the number of webservers and DB slaves) is higher.
- With in-memory, memory use can become very high, if either (a) the number of objects is high or (b) the size of the objects (fields, esp. strings, collections and objects which themselves have a sting or collection). Possibly "we need a bigger webserver" amounts of memory, and that goes for every machine in the farm. 7a. That heavy memory use is particularly like to continue to grow as the project evolves.
- Unless changes cause an immediate refresh of the in-memory store, the in-memory solution will mean that the view used by the people in charge of administrating these categories will differ from what is seen by customers, until they are re-synchronised.
- In-memory resynching can be very expensive. Unless you're very clever with it, it can cause random (to the user) massive performance spikes. If you are clever with it, it can exasperate the other issues (esp. in terms of keeping different machines at an equiv. level of freshness).
- Unless you're clever with in-memory, those spikes can accumulate, putting the machine into a long-term hang. If you are clever with avoiding this, you may exasperate other issues.
- It is very difficult to move from in-memory to hitting the db should that prove the way to go.
None of this leans with 100% certainty to one solution or the other, and I certainly aren't going to give a clear answer as doing so is premature optimsiation. What you can do a priori is make a reasonable decision about which is likely to be the optimal solution. Whichever you go for you should profile afterwards, esp. if the code does turn out to be a bottleneck and possibly change. You should also do so over the lifetime of the product as both changes to the code (fixes and new features) and changes to the dataset can certainly change which option is optimal (indeed, it can change from one to another and then change back to the previous one, over the course of the lifetime). This is why I included considerations of the ease of moving from one approach to another in the above list of pros and cons.
精彩评论