Problem: Writing a MySQL parser to split JOIN's and run them as individual queries (denormalizing the query dynamically)
I am trying to figure out a script to take a MySQL query and turn it into individual queries, i.e. denormalizing the query dynamically.
As a test I have built a simple article system that has 4 tables:
- articles
- article_id
- article_format_id
- article_title
- article_body
- article_date
- article_categories
- article_id
- category_id
- categories
- category_id
- category_title
- formats
- format_id
- format_title
An article can be in more than one category but only have one format. I feel this is a good example of a real-life situation.
On the category page which lists all of the articles (pulling in the format_title as well) this could be easily achieved with the following query:
SELECT articles.*, formats.format_title
FROM articles
INNER JOIN formats ON articles.article_format_id = formats.format_id
INNER JOIN article_categories ON articles.article_id = article_categories.article_id
WHERE article_categories.category_id = 2
ORDER BY articles.article_date DESC
However the script I am trying to build would receive this query, parse it and run the queries individually.
So in this category page example the script would effectively run this (worked out dynamically):
// Select article_categories
$sql = "SELECT * FROM article_categories WHERE category_id = 2";
$query = mysql_query($sql);
while ($row_article_categories = mysql_fetch_array($query, MYSQL_ASSOC)) {
// Select articles
$sql2 = "SELECT * FROM articles WHERE article_id = " . $row_article_categories['article_id'];
$query2 = mysql_query($sql2);
while ($row_articles = mysql_fetch_array($query2, MYSQL_ASSOC)) {
// Select formats
$sql3 = "SELECT * FROM formats WHERE format_id = " . $row_articles['article_format_id'];
$query3 = mysql_query($sql3);
$row_formats = mysql_fetch_array($query3, MYSQL_ASSOC);
// Merge articles and formats
$row_articles = array_merge($row_articles, $row_formats);
// Add to array
$out[] = $row_articles;
}
}
// Sort articles by date
foreach ($out as $key => $row) {
$arr[$key] = $row['article_date'];
}
array_multisort($arr, SORT_DESC, $out);
// Output articles - this would not be part of the script obviously it should just return the $out array
foreach ($out as $row) {
echo '<p><a href="article.php?id='.$row['article_id'].'">'.$row['article_title'].'</a> <i>('.$row['format_title'].')</i><br />'.$row['article_body'].'<br /><span class="date">'.date("F jS Y", strtotime($row['article_date'])).'</span></p>';
}
The challenges of this are working out the correct queries in the right order, as you can put column names for SELECT and JOIN's in any order in the query (this is what MySQL and other SQL databases translate so well) and working out the information logic in PHP.
I am currently parsing the query using SQL_Parser which works 开发者_Python百科well in splitting up the query into a multi-dimensional array, but working out the stuff mentioned above is the headache.
Any help or suggestions would be much appreciated.
From what I gather you're trying to put a layer between a 3rd-party forum application that you can't modify (obfuscated code perhaps?) and MySQL. This layer will intercept queries, re-write them to be executable individually, and generate PHP code to execute them against the database and return the aggregate result. This is a very bad idea.
It seems strange that you imply the impossibility of adding code and simultaneously suggest generating code to be added. Hopefully you're not planning on using something like funcall to inject code. This is a very bad idea.
The calls from others to avoid your initial approach and focus on the database is very sound advice. I'll add my voice to that hopefully growing chorus.
We'll assume some constraints:
- You're running MySQL 5.0 or greater.
- The queries cannot change.
- The database tables cannot be changed.
- You already have appropriate indexes in place for the tables the troublesome queries are referencing.
- You have triple-checked the slow queries (and run EXPLAIN) hitting your DB and have attempted to setup indexes that would help them run faster.
- The load the inner joins are placing on your MySQL install is unacceptable.
Three possible solutions:
- You could deal with this problem easily by investing money into your current database by upgrading the hardware it runs on to something with more cores, more (as much as you can afford) RAM, and faster disks. If you've got the money Fusion-io's products come highly recommended for this sort of thing. This is probably the simpler of the three options I'll offer
- Setup a second master MySQL database and pair it with the first. Make sure you have the ability to force AUTO_INCREMENT id alternation (one DB uses even id's, the other odd). This doesn't scale forever, but it does offer you some breathing room for the price of the hardware and rack space. Again, beef up the hardware. You may have already done this, but if not it's worth consideration.
- Use something like dbShards. You still need to throw more hardware at this, but you have the added benefit of being able to scale beyond two machines and you can buy lower cost hardware over time.
To improve database performance you typically look for ways to:
- Reduce the number of database calls
- Making each database call as efficient as possible (via good design)
- Reduce the amount of data to be transfered
...and you are doing the exact opposite? Deliberately?
On what grounds?
I'm sorry, you are doing this entirely wrong, and every single problem you encounter down this road will all be consequences of that first decision to implement a database engine outside of the database engine. You will be forced to work around work-arounds all the way to delivery date. (if you get there).
Also, we are talking about a forum? I mean, come on! Even on the most "web-scale-awesome-sauce" forums we're talking about less than what, 100 tps on average? You could do that on your laptop!
My advice is to forget about all this and implement things the most simple possible way. Then cache the aggregates (most recent, popular, statistics, whatever) in the application layer. Everything else in a forum is already primary key lookups.
I agree it sounds like a bad choice, but I can think of some situations where splitting a query could be useful.
I would try something similar to this, relying heavily on regular expressions for parsing the query. It would work in a very limited of cases, but it's support could be expanded progressively when needed.
<?php
/**
* That's a weird problem, but an interesting challenge!
* @link http://stackoverflow.com/questions/5019467/problem-writing-a-mysql-parser-to-split-joins-and-run-them-as-individual-query
*/
// Taken from the given example:
$sql = "SELECT articles.*, formats.format_title
FROM articles
INNER JOIN formats ON articles.article_format_id = formats.format_id
INNER JOIN article_categories ON articles.article_id = article_categories.article_id
WHERE article_categories.category_id = 2
ORDER BY articles.article_date DESC";
// Parse query
// (Limited to the clauses that are present in the example...)
// Edit: Made WHERE optional
if(!preg_match('/^\s*'.
'SELECT\s+(?P<select_rows>.*[^\s])'.
'\s+FROM\s+(?P<from>.*[^\s])'.
'(?:\s+WHERE\s+(?P<where>.*[^\s]))?'.
'(?:\s+ORDER\s+BY\s+(?P<order_by>.*[^\s]))?'.
'(?:\s+(?P<desc>DESC))?'.
'(.*)$/is',$sql,$query)
) {
trigger_error('Error parsing SQL!',E_USER_ERROR);
return false;
}
## Dump matches
#foreach($query as $key => $value) if(!is_int($key)) echo "\"$key\" => \"$value\"<br/>\n";
/* We get the following matches:
"select_rows" => "articles.*, formats.format_title"
"from" => "articles INNER JOIN formats ON articles.article_format_id = formats.format_id INNER JOIN article_categories ON articles.article_id = article_categories.article_id"
"where" => "article_categories.category_id = 2"
"order_by" => "articles.article_date"
"desc" => "DESC"
/**/
// Will only support WHERE conditions separated by AND that are to be
// tested on a single individual table.
if(@$query['where']) // Edit: Made WHERE optional
$where_conditions = preg_split('/\s+AND\s+/is',$query['where']);
// Retrieve individual table information & data
$tables = array();
$from_conditions = array();
$from_tables = preg_split('/\s+INNER\s+JOIN\s+/is',$query['from']);
foreach($from_tables as $from_table) {
if(!preg_match('/^(?P<table_name>[^\s]*)'.
'(?P<on_clause>\s+ON\s+(?P<table_a>.*)\.(?P<column_a>.*)\s*'.
'=\s*(?P<table_b>.*)\.(?P<column_b>.*))?$/im',$from_table,$matches)
) {
trigger_error("Error parsing SQL! Unexpected format in FROM clause: $from_table", E_USER_ERROR);
return false;
}
## Dump matches
#foreach($matches as $key => $value) if(!is_int($key)) echo "\"$key\" => \"$value\"<br/>\n";
// Remember on_clause for later jointure
// We do assume each INNER JOIN's ON clause compares left table to
// right table. Forget about parsing more complex conditions in the
// ON clause...
if(@$matches['on_clause'])
$from_conditions[$matches['table_name']] = array(
'column_a' => $matches['column_a'],
'column_b' => $matches['column_b']
);
// Match applicable WHERE conditions
$where = array();
if(@$query['where']) // Edit: Made WHERE optional
foreach($where_conditions as $where_condition)
if(preg_match("/^$matches[table_name]\.(.*)$/",$where_condition,$matched))
$where[] = $matched[1];
$where_clause = empty($where) ? null : implode(' AND ',$where);
// We simply ignore $query[select_rows] and use '*' everywhere...
$query = "SELECT * FROM $matches[table_name]".($where_clause? " WHERE $where_clause" : '');
echo "$query<br/>\n";
// Retrieve table's data
// Fetching the entire table data right away avoids multiplying MySQL
// queries exponentially...
$table = array();
if($results = mysql_query($table))
while($row = mysql_fetch_array($results, MYSQL_ASSOC))
$table[] = $row;
// Sort table if applicable
if(preg_match("/^$matches[table_name]\.(.*)$/",$query['order_by'],$matched)) {
$sort_key = $matched[1];
// @todo Do your bubble sort here!
if(@$query['desc']) array_reverse($table);
}
$tables[$matches['table_name']] = $table;
}
// From here, all data is fetched.
// All left to do is the actual jointure.
/**
* Equijoin/Theta-join.
* Joins relation $R and $S where $a from $R compares to $b from $S.
* @param array $R A relation (set of tuples).
* @param array $S A relation (set of tuples).
* @param string $a Attribute from $R to compare.
* @param string $b Attribute from $S to compare.
* @return array A relation resulting from the equijoin/theta-join.
*/
function equijoin($R,$S,$a,$b) {
$T = array();
if(empty($R) or empty($S)) return $T;
foreach($R as $tupleR) foreach($S as $tupleS)
if($tupleR[$a] == @$tupleS[$b])
$T[] = array_merge($tupleR,$tupleS);
return $T;
}
$jointure = array_shift($tables);
if(!empty($tables)) foreach($tables as $table_name => $table)
$jointure = equijoin($jointure, $table,
$from_conditions[$table_name]['column_a'],
$from_conditions[$table_name]['column_b']);
return $jointure;
?>
Good night, and Good luck!
In instead of the sql rewriting I think you should create a denormalized articles table and change it at each article insert/delete/update. It will be MUCH simpler and cheaper.
Do the create and populate it:
create table articles_denormalized
...
insert into articles_denormalized
SELECT articles.*, formats.format_title
FROM articles
INNER JOIN formats ON articles.article_format_id = formats.format_id
INNER JOIN article_categories ON articles.article_id = article_categories.article_id
Now issue the appropriate article insert/update/delete against it and you will have a denormalized table always ready to be queried.
精彩评论