Get duplicates for two columns with LINQ
LINQ drives me crazy. Why does following query not return the duplicates, whereas it works with only one identifier? Where is my error?
' generate some test-data '
Dim source As New DataTable
source.Columns.Add(New DataColumn("RowNumber", GetType(Int32)))
source.Columns.Add(New DataColumn("Value1", GetType(Int32)))
source.Columns.Add(New DataColumn("Value2", GetType(Int32)))
source.Columns.Add(New DataColumn("Text", GetType(String)))
Dim rnd As New Random()
For i As Int32 = 1 To 100
Dim newRow = source.NewRow
Dim value = rnd.Next(1, 20)
newRow("RowNumber") = i
newRow("Value1") = value
newRow("Value2") = (value + 1)
newRow("Text") = String.Format("RowNumber{0}-Text", i)
source.Rows.Add(newRow)
Next
' following query does not work, it always has Count=0 '
' although it works with only one identifier '
Dim dupIdentifiers = From row In source
Group row By grp = New With {.Val1 = row("Value1"), .Val2 = row("Value2")}
Into Group
Where Group.Count > 1
Select idGroup = New With {grp.Val1, grp.Val2, Group.Count}
Edit: Following is the complete solution, thanks to @Jon Skeet's answer :)
Dim dupKeys = From row In source
Group row By grp = New With {Key .Val1 = CInt(row("Value1")), Key .Val2 = CInt(row("Value2"))}
Into Group Where Group.Count > 1
Select RowNumber = CInt(Group.FirstOrDefault.Item("RowNumber"))
Dim dupRows = From row In source
Join dupKey In dupKeys
On row("RowNumber") Equals dupKey
Select row
If dupRows.Any Then
' create a new DataTable from the first duplicate rows '
Dim dest = dupRows.CopyToDataTable
End If
The main problem with grouping was that i must make them key
properties.
The next problem in my above code was to get the duplicate rows from the original table.
Because nearly every row has a duplicate(according to two fields), the result DataTable contained 99 of 100 rows and not only the 19 duplicate values. I needed to select only the first duplicate row and join them with the original table on the P开发者_Python百科K.
Select RowNumber = CInt(Group.FirstOrDefault.Item("RowNumber"))
Although this works in my case, maybe someone can explain me how to select only the duplicates from the original table if i would have had only composite keys.
Edit: I'v answered the last part of the question myself, so here is all i need:
Dim dups = From row In source
Group By grp = New With {Key .Value1 = CInt(row("Value1")), Key .Value2 = CInt(row("Value2"))}
Into Group Where Group.Count > 1
Let Text = Group.First.Item("Text")
Select Group.First
If dups.Any Then
Dim dest = dups.CopyToDataTable
End If
I needed the Let-Keyword in order to keep the other column(s) into the same context and return only the first row of the grouped dups. On this way i can use CopyToDataTable
to create a DataTable from the duplicate rows.
Only a few lines of code overall (i can save the second query to find the rows in the original table) to find duplicates on multiple columns and create a DataTable of them.
The problem is the way anonymous types work in VB - they're mutable by default; only Key
properties are included for hashing and equality. Try this:
Group row By grp = New With {Key .Val1 = row("Value1"), Key .Val2 = row("Value2")}
(In C# this wouldn't be a problem - anonymous types in C# are always immutable in all properties.)
What I use to get duplicate rows across two columns in a EF table to show up as duplicates using Lin-q with C Sharp:
var DuplicatesFoundInTable =
entities.LocationDatas
.Where(c => c.TrailerNumber != null && c.CarrierName != null && (c.TrailerNumber ?? string.Empty) != string.Empty && (c.CarrierName ?? string.Empty) != string.Empty)
.GroupBy(o => new { o.TrailerNumber, o.CarrierName }, l => new { customer.TrailerNumber, customer.CarrierName })
.Where(g => g.Count() > 1)
.Select(y => y.Key)
.ToList();
When I want to see if it is a duplicate on inputs (if the entry already exists in two columns):
//Check to see if any rows are the same values on TrailerNumber and CarrierName for inputs.
bool AlreadyInTableComparer = entities.LocationDatas.Any(l => String.Compare(l.TrailerNumber, customer.TrailerNumber, StringComparison.InvariantCulture) == 0 && String.Compare(l.CarrierName, customer.CarrierName, StringComparison.InvariantCulture) == 0);
bool AlreadyInTable = entities.LocationDatas.Any(t => t.TrailerNumber == customer.TrailerNumber && t.CarrierName == customer.CarrierName);
SQL Server Checking for duplicates (commented out delete duplicates):
WITH CTE
AS
(
SELECT [TrailerNumber], [CarrierName]
,ROW_NUMBER() OVER(Partition BY TrailerNumber Order by TrailerNumber,
CarrierName) AS NumRows, ROW_NUMBER() OVER(Partition BY TrailerNumber,
CarrierName Order by CarrierName) AS NumRows2
FROM [dbo].[LocationData] --Please note, duplicates are shown in this
table.
WHERE TrailerNumber != '' AND CarrierName != ''
)
SELECT [TrailerNumber], [CarrierName], [NumRows2] FROM CTE WHERE NumRows2 > 1
--DELETE FROM CTE WHERE NumRows2 > 1 --Delete Duplicates.
Validate SQL Server to prove correct against CTE filtering:
SELECT TrailerNumber, CarrierName, COUNT(*) AS Duplicates
FROM [dbo].[LocationData]
WHERE TrailerNumber IS NOT NULL OR CarrierName IS NOT NULL
GROUP BY TrailerNumber, CarrierName
HAVING COUNT(*) >1 AND TrailerNumber != '' AND CarrierName != ''
精彩评论