Searching data which spans through a lot of tables - design question
I have a structure of about 10 tables. This structure works great for data entry. However, I also need to preform complex and fast searches on that dataset. There are three approaches to this I can think of:
Join all those tables in select. This is pretty slow and likely not a good approach. If it's relevant, database is Informix; I've looked into creating views hoping that they'd be more optimized, but testing shows that selects on views are even slower than a lot of joins. Maybe there is some way to make Informix pre-join tables and create indexes on those, but from what I've seen it's not likely. I've done some preliminary testing and it seems that view is even slower than joins, but maybe I'm missing some Informix options. Both joins and view are slower than the approach #2:
Single synthetic table which is updated periodically. This seems the right approach, especially since searches don't need to be on real-time data - actually, I can probably get away with updating synthetic table daily. Data size would be about 500k-1000k rows.
Memcached and similiar in-memory solutions. At the moment there is no such infrastructure in place, and this probably doesn't warrant implementing it, however this is something I'll look at once the synthetic table beco开发者_JS百科mes too slow. Also, there are lots of search parameters and even first query has to be fast, so this approach will have to eagerly cache all data. Of course, I'll probably cache anything I can even with approaches 1 and 2.
I'd like your thoughts on this. Is there a magic bullet I'm missing? What have you used in similar situations?
Option 1.
Depending on the volume of data in your tables, 10 tables should be able to be joined in a reasonable amount of time. How slow is too slow for you?
Here are the biggest two things you can do to make sure your queries are running smoothly.
First make sure your logical table design is really logical. Bad table design and bad column design are responsible for a large amount of unnecessary slowdowns in database apps. The fact that data entry is working well is a pretty strong indication that your table design is pretty good. Is your design normalized? Or somewhat normalized?
Second, create the right indexes. The right indexes can make a query run a hundred times faster, depending on the circumstances. In order to build the right indexes you need to know a little bit about how indexes work, about the query you are giving, about the volume of data, and about the strategy that the DBMS chooses when executing the query.
Option 2.
This may well be your best bet. Learn a little about data marts or data warehouses. That's how database people deal with design issues involving one schema for data entry, a different schema for queries, and a process to keep the two schemas in synch.
There are a number of design issues here, and rather than try to enumerate them, I'm just going to suggest that you bone up on data marts.
In-memory databases take millisecond database access time and turn them into microsecond access time. Is this an automated trading system or 911 dispatch or aviation traffic control system? If not you'd be hard pressed to show a requirement for microsecond access times.
Walter has it correct when he said "how slow is too slow?" Define your requirements clearly, is this an internal or external SLA? Do you have requirements? or does this just 'feel' too slow.
Learn to read an execution plan and examine the plan for your slow query. Is there a cardinality estimate that way off? Does it anticipate 1 row when you know there are 100k rows? Is it doing a full table scan on a table you expect 1 row from?
If the query looks as efficient as it can be, trace it... see if you can identify if there are any time sinks that you're not expecting. Is it fine when done solo, but poor performing under load? Honestly, 10 tables with not a lot of data to begin with really shouldn't be super slow.
I think Gilbert is overestimating your problem. With no more than 1M records, a full dimensional model seems overkill. For the tone of your question it sounds like you're really just trying to speed up a query or three - not creating the start of an entire BI platform. If that's the case, look back at the explain plan, see if you can identify major amounts of work that could be reduced via the pre-calculation of some joins (denormalization), build that new materialize view... try the query, if no improvement then drop that and try something else... do not keep building on unsuccessful attempts.
Now I see the travel industry comment
So you have 2 classes of room, 30 doubles and 20 singles, and you have 80 seats on the plane. But the doubles can add an extra bed so you might run out of seats before you run out of rooms.
Rooms Remaining
---------------
5 Single Remain
10 Doubles Remain
Seats Remaining
---------------
8 Plane seats
Since there's one plane and 2 room types, you'll just Cartesian those together.
Package Type Rooms Seats Packages Available
------------ ------ ----- ------------------
Single 5 8 5
Double 10 8 8
Notice the Packages available is a simple calculation LOWEST(Rooms, Seats)
In your comment you said
even if rooms are available, the package is officially sold out.
Package Type Rooms Seats Packages Available
------------ ------ ----- ------------------
Single 5 0 0
Double 0 0 0
So here's that case... you've filled the double rooms and 5 of them are triples... so the plane is full and there are 5 extra single rooms. But our LOWEST calculation does the work to show there are no Single Packages available.
am I close?
You're right on track.
There's no magic bullet for this, because your tables are really spread out. What I've done in the past is do something like your Option 2.
Let's say I have a table of Accounts with an AccountID as the PK. I'd create another table called AccountSearch which would be related in a many-to-one relationship with Accounts. AccountSearch would contain a set of strings and their associated IDs.
If you wanted fuzzier searching, you could also manipulate the strings using NYIIS or Soundex (yuck) or simply removing whitespace. You could also implement full-text searching, but that is often overkill.
Account
-------
AccountID (PK)
Name
OwnerName
AccountSearch
-------------
SearchString (PK)
AccountID (PK)
Option 2 is called a data mart or a data warehouse. For the cost of additional storage, you can have an operational database and a query database.
Since you say there are lots of search parameters, you can create your query tables using a star schema, based on the search parameters.
How often do you need search criteria on all the tables?
One way of working that may improve performance is to make sure the main query adapts to the search criteria, only joining the necessary tables, and retrieving just the primary key values from the main table. This data might be saved into a temporary table, or in a scroll cursor, or fetched back to the client.
Then, when you need to collect the information for display, you use a (prepared) SELECT that collects exactly the data you need for the row(s) you need.
The advantage of this is that (for many queries) you seldom specify conditions on all 10 tables, so you don't need to do the 10-way join while discovering the relevant records. And the single-row operation is all joins on keys so the lookups are indexed with no scanning.
Clearly, you can juggle the criteria; you might select all the data from the primary table, and all the relevant values from one of the secondary tables (which is always needed) but decide not to select any values from the other 8 tables (because they don't always appear in the search criteria), or other variants along these lines.
This assumes you can build dynamic SQL, but that is very seldom an issue.
In the past I have used an implementation simillar to #2. You could try creating a view which would basically consist of the searchable fields for each table eg.
SELECT Name From Person
UNION SELECT Name FROM Company
Then feed that view into a full text indexing product such as Sphinx which can optimize your searching and provide flexible options for weights, terms etc.. as well as scheduling how often your indexes are updated.
Consolidate your 10 tables into one temporary table.. See: Should I denormalize Loans, Purchases and Sales tables into one table?
精彩评论