开发者

MYSQL match text fields

I h开发者_开发技巧ave a mysql database with around 1.5 million company records(name, country and other small text fields) I want to mark the same records with a flag (for example if two companies with the same name are in USA then I have to set a field (match_id) equal to say an integer 10) and likewise for other matches. At the moment its taking a long time (days) I feel I am not utilizing MYsql properly I am posting my code below, Is there a faster way to do this???

<?php

//Create the table if does not already exist
mysql_query("CREATE TABLE IF NOT EXISTS proj ( 
  id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
  company_id text NOT NULL ,
  company_name varchar(40) NOT NULL ,
  company_name_text varchar(33) NOT NULL,
  company_name_metaphone varchar(19) NOT NULL,
  country varchar(20) NOT NULL ,
  file_id int(2) NOT NULL ,
  thompson_id varchar(11) NOT NULL ,
  match_no int(7) NOT NULL ,
  INDEX(company_name_text))") 
  or die ("Couldn't create the table: " . mysql_error());


//********Real script starts********
$countries_searched = array(); //To save record ids already flagged (save time)
$counter = 1; //Flag

//Since the company_names which are same are going to be from the same country so I    get all the countries first in the below query and then in the next get all the   companies in that country
$sql = "SELECT DISTINCT country FROM proj WHERE country='Canada'";
$result = mysql_query($sql) or die(mysql_error());

while($resultrow = mysql_fetch_assoc($result)) {
  $country = $resultrow['country'];
  $res = mysql_query("SELECT company_name_metaphone, id, company_name_text 
  FROM proj 
  WHERE country='$country' 
  ORDER BY id") or die (mysql_error());


  //Loop through the company records 
  while ($row = mysql_fetch_array($res, MYSQL_NUM)) {

  //If record id is already flagged (matched and saved in the countries searched      array) don't waste time doing anything    
    if ( in_array($row[1], $countries_searched) ) {
      continue;
    }

    if (strlen($row[0]) > 9) {
      $row[0] = substr($row[0],0,9);
      $query = mysql_query("SELECT id FROM proj 
        WHERE country='$country' 
        AND company_name_metaphone LIKE '$row[0]%' 
        AND id<>'$row[1]'") or die (mysql_error());

      while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
        if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
      }
      if(mysql_num_rows($query) > 0) {

        mysql_query("UPDATE proj SET match_no='$counter' 
                    WHERE country='$country' 
                    AND company_name_metaphone LIKE '$row[0]%'") 
          or die (mysql_error()." ".mysql_errno());
        $counter++;
      }
    }
    else if(strlen($row[0]) > 3) {
      $query = mysql_query("SELECT id FROM proj WHERE country='$country' 
               AND company_name_text='$row[2]' AND id<>'$row[1]'") 
        or die (mysql_error());
      while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
        if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
      }
      if(mysql_num_rows($query) > 0) {
        mysql_query("UPDATE proj SET match_no='$counter' 
                    WHERE country='$country' 
                    AND company_name_text='$row[2]'") or die (mysql_error());
        $counter++;
      }
    }   
  }
}
?>


I would go for pure sql solution, something like :

SELECT 
    GROUP_CONCAT(id SEPARATOR ' '), "name"
FROM proj 
WHERE 
    LENGTH(company_name_metaphone) < 9 AND 
    LENGTH(company_name_metaphone) > 3
GROUP BY country, UPPER(company_name_text)
HAVING COUNT(*) > 1
UNION
SELECT 
    GROUP_CONCAT(id SEPARATOR ' '), "metaphone"
FROM proj 
WHERE 
    LENGTH(company_name_metaphone) > 9
GROUP BY country, LEFT(company_name_metaphone, 9)
HAVING COUNT(*) > 1

then loop through this results to update ids.


I'm not sure what your are trying to do, but what I can see in your code is that you are making a lot of searches in arrays with a lot of data, I think your problem is your PHP code and not SQL statements.


you will need to adjust the group by fields to suit your matching requirements

if your script times out (quite likely due to the large amount of data), set_time_limit(0) otherwise you can also add a limit of 1000 or something to the $sql, and run the script multiple times as the where clause will exclude any matched rows already processed (but will not keep track of $match_no inbetween calls, so your would need to handle that yourself)

// find all companies that have multiple rows grouped by identifying fields

$sql = "select company_name, country, COUNT(*) as num_matches from proj 
where match_no = 0
group by company_name, country 
having num_matches > 1";

$res = mysql_query($sql);

$match_no = 1;

// loop through all duplicate companies, and set match_id
while ($row = mysql_fetch_assoc($res)) {

  $company_name = mysql_escape_string($row['company_name']);
  $country = mysql_escape_string($row['country']);

   $sql = "update proj set match_no = $match_no where 
       company_name = '$company_name', country = '$country';

     mysql_query($sql);

     $match_no++;
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜