Hard to append a table with many records into another without generating duplicates

2023-01-03 00:29 问答作者：

I may seem to be a bit wordy at first but for the hope it will be easier for all of you to understand what I am doing in the first place. I have an uncommon but enjoyable activity of collecting as many species of wood from around the world as I can (over 2,900 so far). Ok, that is the real world.

Meanwhile I have spent over 8 years compiling over 5.8 meg of text data on all the woods of the world. That got so large that learning some basic PHP and MySQL was most welcome so I could build a new database driven home for all this research. I am still slow at it but getting there. The original premise was to find evidence of as many species of woods in the world I can. The more names identified, the more successful the project.

I have named the project TAXA for ease of conversation (short for Taxonomy). You are most welcome to take a look at what I have so far at www.prowebcanada.com/taxa. It is 95% dynamically driven. So far I am reporting about 6,500 botanical wood names and, as said above, the more I can report, the more successful is the project.

I have a file of all the woods in the second largest wood collection in the world, the Tervuren wood collection in the Netherlands with over 11,300 wood names even after cleaning out all duplicates. That is almost twice the number I am reporting now so porting all the new wood names from Tervuren to the 'species' table where I keep the reported data would be a major desirable advancement in the project.

At one point I was able to add all the Tervuren records to the species table but over 3,000 duplicates also formed. They were not in the Tervuren file in the first place but represent the same wood names common to both files. It is common sense that there would be woods common to both that when merged would create new duplicates.

At one point and with the help of others from another forum, I may very well have finally got the proper SQL statement. When I ran it, though, the system said (semi-amusingly at first) ----- that it had gone away! After looking up on the Net what could have have done this, one reason is that the MySQL timeout lapses and probably because of the large size of files I am running. I am running this on a rented account on Godaddy so I cannot go about trying to adjust any config file.

For safety, I copied the tervuren.sql file as tervuren_target.sql and the species.sql file as species_master.sql tp use as working files just to make sure I protect the original files from destruction or damage. Later I can name the species_master back to just species.sql once I am happy all worked well.

The species file has about 18 columns in it but only 5 columns match the columns in the Tervuren file (name for name and collation also). The rest of the columns are just along for the ride, so to speak. The common key in both is the 'species_name" columns in both. I am not sure it is at all proper to call one a primary key and the other a foreign key since there really is no relational connection to them. One is just more data for the other and can disappear after, never to be referred to the working code in the application.

I have been very surprised and flabbergasted on how hard it can be to append record开发者_运维百科s from one large table into another (with same column names plus others) without generating NEW duplicates in the first place. Watch out thinking that a SELECT DISTINCT statement may do the job because absolutely NO records in the species table must get destroyed in the process and there is no way (well, that I know of) to tell the 'DISTINCT" command this. Yes, the original 'species' table has duplicates in it even before all this but, trust me ---- they have to be removed the long hard way manually record by record or I will lose precious information. It is more important to just make sure no NEW duplicates form through bringing in new names in the tervuren_target.species_name into species.species_name.

I am hoping and thinking that a straight SQL solution should work --- except for that nasty timeout. How do I get past that? Could it mean that I may have to turn to a PHP plus SQL method?? Or ..... would I have to break up the Tervuren files into a few smaller ones and run them independently (hope not....)"

So far, what seems should be easy has proven to be unexpectedly tricky. I appreciate any help you can give but start from the assumption that this may be harder to do right than it may seem on the surface.

By the way --- I am running a quad 64 bit system with Windows 7, so at least I have some fairly hefty power on the client end. I have a direct ethernet cable feeding a cable connection to the Internet.

Once I get an algorithm and code working for this, I also have many other lists to process that could make the 'species' table grow even more. It could be equivalent to (ahem) lighting a rocket under my project (especially compared to do this record by record manually)!

This is my first time in this forum, so I do not know how I can receive any replies. Do I have to to come back here periodically or are replies emailed out also? It would be great if you CC'd copies to me at billmudry at rogers.com :-)

Much thanks for your patience and help,

Bill Mudry Mississauga, Ontario Canada (next to Toronto).

First, make sure you have a backup of your data. One small mistake could wipe everything you've done for the last 8 years. Don't touch a thing until you're sure you have a working backup.

Then I would suggest first creating a new table similar to your target table:

CREATE TABLE table2 LIKE table1

Import the data into that new table. Then run this query to find the duplicates between table1 and table2:

SELECT * FROM table2
JOIN table1
ON table2.species_name = table1.species_name

Then you can use a multi-table delete being careful to only delete from table2:

DELETE FROM table2
USING table2
JOIN table1
ON table2.species_name = table1.species_name

Rerun the first query to check that there are no duplicates remaining. Run this query to check how many new rows you will add:

SELECT COUNT(*) FROM table2

When you are sure everything looks good you can import your data into table1:

INSERT INTO table1
SELECT * FROM table2

Then you can delete table2 as you no longer need it.

Place a UNIQUE constraint on the 5 columns that you will use to detect duplicates. When importing data, use MySQL's INSERT IGNORE statement to discard duplicates.

ALTER TABLE Taxa 
    ADD UNIQUE KEY (Field3,Field7,Field8,etc.);

INSERT IGNORE INTO Taxa...;

Hard to append a table with many records into another without generating duplicates

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？