MYSQL match text fields
I h开发者_开发技巧ave a mysql database with around 1.5 million company records(name, country and other small text fields) I want to mark the same records with a flag (for example if two companies with the same name are in USA then I have to set a field (match_id) equal to say an integer 10) and likewise for other matches. At the moment its taking a long time (days) I feel I am not utilizing MYsql properly I am posting my code below, Is there a faster way to do this???
<?php
//Create the table if does not already exist
mysql_query("CREATE TABLE IF NOT EXISTS proj (
id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
company_id text NOT NULL ,
company_name varchar(40) NOT NULL ,
company_name_text varchar(33) NOT NULL,
company_name_metaphone varchar(19) NOT NULL,
country varchar(20) NOT NULL ,
file_id int(2) NOT NULL ,
thompson_id varchar(11) NOT NULL ,
match_no int(7) NOT NULL ,
INDEX(company_name_text))")
or die ("Couldn't create the table: " . mysql_error());
//********Real script starts********
$countries_searched = array(); //To save record ids already flagged (save time)
$counter = 1; //Flag
//Since the company_names which are same are going to be from the same country so I get all the countries first in the below query and then in the next get all the companies in that country
$sql = "SELECT DISTINCT country FROM proj WHERE country='Canada'";
$result = mysql_query($sql) or die(mysql_error());
while($resultrow = mysql_fetch_assoc($result)) {
$country = $resultrow['country'];
$res = mysql_query("SELECT company_name_metaphone, id, company_name_text
FROM proj
WHERE country='$country'
ORDER BY id") or die (mysql_error());
//Loop through the company records
while ($row = mysql_fetch_array($res, MYSQL_NUM)) {
//If record id is already flagged (matched and saved in the countries searched array) don't waste time doing anything
if ( in_array($row[1], $countries_searched) ) {
continue;
}
if (strlen($row[0]) > 9) {
$row[0] = substr($row[0],0,9);
$query = mysql_query("SELECT id FROM proj
WHERE country='$country'
AND company_name_metaphone LIKE '$row[0]%'
AND id<>'$row[1]'") or die (mysql_error());
while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
}
if(mysql_num_rows($query) > 0) {
mysql_query("UPDATE proj SET match_no='$counter'
WHERE country='$country'
AND company_name_metaphone LIKE '$row[0]%'")
or die (mysql_error()." ".mysql_errno());
$counter++;
}
}
else if(strlen($row[0]) > 3) {
$query = mysql_query("SELECT id FROM proj WHERE country='$country'
AND company_name_text='$row[2]' AND id<>'$row[1]'")
or die (mysql_error());
while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
}
if(mysql_num_rows($query) > 0) {
mysql_query("UPDATE proj SET match_no='$counter'
WHERE country='$country'
AND company_name_text='$row[2]'") or die (mysql_error());
$counter++;
}
}
}
}
?>
I would go for pure sql solution, something like :
SELECT
GROUP_CONCAT(id SEPARATOR ' '), "name"
FROM proj
WHERE
LENGTH(company_name_metaphone) < 9 AND
LENGTH(company_name_metaphone) > 3
GROUP BY country, UPPER(company_name_text)
HAVING COUNT(*) > 1
UNION
SELECT
GROUP_CONCAT(id SEPARATOR ' '), "metaphone"
FROM proj
WHERE
LENGTH(company_name_metaphone) > 9
GROUP BY country, LEFT(company_name_metaphone, 9)
HAVING COUNT(*) > 1
then loop through this results to update ids.
I'm not sure what your are trying to do, but what I can see in your code is that you are making a lot of searches in arrays with a lot of data, I think your problem is your PHP code and not SQL statements.
you will need to adjust the group by fields to suit your matching requirements
if your script times out (quite likely due to the large amount of data), set_time_limit(0) otherwise you can also add a limit of 1000 or something to the $sql, and run the script multiple times as the where clause will exclude any matched rows already processed (but will not keep track of $match_no inbetween calls, so your would need to handle that yourself)
// find all companies that have multiple rows grouped by identifying fields
$sql = "select company_name, country, COUNT(*) as num_matches from proj
where match_no = 0
group by company_name, country
having num_matches > 1";
$res = mysql_query($sql);
$match_no = 1;
// loop through all duplicate companies, and set match_id
while ($row = mysql_fetch_assoc($res)) {
$company_name = mysql_escape_string($row['company_name']);
$country = mysql_escape_string($row['country']);
$sql = "update proj set match_no = $match_no where
company_name = '$company_name', country = '$country';
mysql_query($sql);
$match_no++;
}
精彩评论