SQL Server Query Returning Multiple Rows
I am currently working on a SSIS package that is extracting a table from one database to another. The tables in both of the databases use the same column as the primary开发者_开发问答 key. My select statement to extract the data is a simple select statement. When I ran the package I was receiving a error that there where duplicate primary key values.
I reviewed my select statement and verified that my select statement was not returning duplicate rows. So to test this I removed the primary key from the table that I am inserting the data into and reran the SSIS package. After it ran I looked at the table to see what rows where being duplicated. What I found was that rows that where being edited while the extract was running where being duplicated, there was a record of it before the edit, and a record of it after the edit. I could easily tell this because the table has a last modified field that gets updated every time a record is updated.
I added a NOLOCK hint to my select statement, and it stopped returning duplicate rows.
So my question is why? I would have expected that a select statement with a NOLOCK table hint would have a higher chance of returning duplicate rows because it is not using locking, and that a select statement without the NOLOCK hint should use locking to make sure that it does not return duplicate rows.
Here is the select statement that I am using to select the data. I did verify that the joins are not causing it to duplicate rows:
SELECT pe.enc_id,
pe.enc_nbr,
pe.billable_ind,
pe.clinical_ind AS clinical_ind,
pe.budget_ind,
pe.print_stmt_ind,
pe.send_coll_letter_ind,
pe.outsource_exempt_ind,
cb.First_name + ' ' + cb.last_name AS CreatedBy,
pe.create_timestamp AS create_timestamp,
mb.first_name + ' ' + mb.last_name AS ModifiedBy,
pe.modify_timestamp AS modify_timestamp
FROM patient_encounter pe WITH(NOLOCK)
LEFT OUTER JOIN user_mstr cb WITH(NOLOCK) ON
pe.created_by = cb.user_id
LEFT OUTER JOIN user_mstr mb WITH(NOLOCK) ON
pe.modified_by = mb.user_id
NOLOCK hint causes dirty read anomallies, and one such anomaly is a duplicate read. Such reads are frequent if an update changes the position of the row in the index scanned by the query:
- say you have 2 rows in the table, with an ID key, rows with key values 1 and 2
- one request (T1) runs UPDATE table SET key=3 WHERE key=1;
- second request (T2) runs SELECT ... FROM table WITH(NOLOCK);
- T1 locks the row with key value 1
- T2 ignores the lock T1 has and reads the row with key value 1
- T2 continue and reads row with key value 2
- T1 update the row, and the row is moved in the index int he new position for key value 3
- T2 continues to scan and reads the row with key value 3
So the SELECT has read a row twice, once while it had the key value 1 and once while it had a key value 3. This is just a trivial example of what can happen. In reality more complicated queries can run complex plans and use other indexes, all subject to such anomalies.
In short: NOLOCK hint is evil. If you want to avoid contention, use snapshot isolation.
The WITH NOLOCK hint just tells the database server to ignore the locks, and just select the current values from the database - hence it simply selects all the current row value at the time it hits that row.
Note that you will NOT get the updates in the new table of the rows that were being updated.
Without seeing your SQL, I would guess that the way it was constructed, it grabbed the current row, waited for a lock to clear, then selected the new row as well.
Locking the entire table would prevent the changes/duplicates, but you risk locking everyone out of the table while you do your select.
EDIT: FYI ALTERNATIVES: use the READPAST—Rows locked by other processes are skipped and TABLOCK—Lock at the table level which of course will block other processes and might not be desired.
NOTE: UPDLOCK gets converted to XLOCK during the transaction write.
There are two categories for the hints: granularity and isolation-level. Granularity includes PAGLOCK, NOLOCK, ROWLOCK, and TABLOCK. Isolation-level hints include HOLDLOCK, NOLOCK, READCOMMITTED, REPEATABLEREAD, and SERIALIZABLE. A maximum of one from each group may be used.
EDIT2: Just for completeness: READCOMMITTED—Read only data from transactions that have been committed. This is the default behavior.
EDIT3: More info: The NOLOCK WILL read the rows, but you risk reading "dirty" data that will change or having data that will not exist if a ROLLBACK happens to a transaction which may impact accuracy of the selected set.
The other important information is to discover what kinds of locks the transactions are using so you can plan accordingly.
精彩评论