开发者

Excel macro VBA to sum up duplicate values and then remove duplicate records

I am trying to sum up values based on duplicate's found across "A-O" columns. Am using the below macro. There are around 500k+ records and the below macro hangs bad.

 Sub Formulae(TargetCol1, TargetCol2, ConcatCol, Col1, Col2, StartRow, EndRow, Sheet)

         Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Formula = "=SUMIF($" & ConcatCol & "$" & CStr(StartRow) & ":$" & ConcatCol & "$" & CStr(EndRow) & "," & ConcatCol & CStr(StartRow) & ",$" & Col1 & "$" & CStr(StartRow) & ":$" & Col1 & "$" & CStr(EndRow) & ")"

     Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Select
    Selection.Copy
    Sheets(Sheet).Range(TargetCol1 & CStr(EndRow)).Select
    Range(Selection, Selection.End(xlUp)).Select
    Application.CutCopyMode = False
    Selection.FillDown

    Call PasteSpecial(TargetCol1, "T", StartRow, EndRow)

    Sheets(Sheet).Range(TargetCol2 & CStr(StartRow)).Formula = "=SUMIF($" & ConcatCol & "$" & CStr(StartRow) & ":$" & ConcatCol & "$" & CStr(EndRow) & "," & ConcatCol & CStr(StartRow) & ",$" & Col2 & "$" & CStr(StartRow) & ":$" & Col2 & "$" & CStr(EndRow) & ")"

     Sheets(Sheet).Range(TargetCol2 & CStr(StartRow)).Select
    Selection.Copy
    Sheets(Sheet).Range(TargetCol2 & CStr(EndRow)).Select
    Range(Selection, Selection.End(xlUp)).Select
    Application.CutCopyMode = False
    Selection.FillDown

    Call PasteSpecial(TargetCol2, "U", StartRow, EndRow)


 End Sub


Sub PasteSpecial(Col1, Col2, StartRow, EndRow)

    Range(Col1 & CStr(StartRow)).Select
    Range(Selection, Selection.End(xlDown)).Select
    Selection.Copy
    Range(Col2 & CStr(StartRow)).Select
    Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
        :=False, Transpose:=False

End Sub

Let me explain the macro in short. I have Columns "A-O" and I have to group them...based on grouping I have to sum columns "P,Q". I have a function that makes a concatenated string out of the 16 columns and stores in "AA" column. Based on this column I use the sumif function to sum all duplicate values

 =SUMIF($AA$2:$AA$500000,$AA2,$P$2:$P$500000)
 =SUMIF($AA$2:$AA$500000,$AA2,$Q$2:$Q$500000)

Then I copy paste special as 'values' the above values to remove the formula, in 2 new cols (pasteSpecial function in above macro code).

Finally I call the remove duplicates to remove the duplicate values

I have used the .removeduplicates method which seems to work pretty fast even on such a huge dataset. Is there any predefined function in excel which would even sum the values of the duplicates and then remove the duplicate entries?

 Su开发者_运维技巧b Remove_Duplicates_In_A_Range(StartRow, EndRow, Sheet, StartCol, EndCol, level)



Sheets(Sheet).Range(StartCol & CStr(StartRow) & ":" & EndCol & CStr(EndRow)).RemoveDuplicates Columns:=20, Header:=xlNo

End Sub

The above logic hangs bad eating all CPU resources and crashing badly...

Someone please optimize the above macro to make it work with 500k+ records. A performance of 1-2 mins max is acceptable.

Please help!!!

EDIT: By 500k+ records I mean A1:O500000. Am supposed to check for duplicates in this manner a combination of A1,B1,C1,D1,E1,F1,G1,H1,I1,J1,K1,L1,M1,N1,O1 with A2,B2,C2,D2,E2,F2,G2,H2,I2,J2,K2,L2,M2,N2,O2 and A3,B3,C3,D3,E3,F3,G3,H3,I3,J3,K3,L3,M3,N3,O3 and so on....till A500000,B500000 etc... .

In short am supposed to check the entire A1-O1 set matches with the entire A2-O2 or A3-O3 or..... A500k-O500k and so on

For every match between the entire A-O recordset I need to sum their respective P,Q columns . Say for example A1-O1 set matched with A2-O2 set then add P1,Q1 and P2,Q2 and store in P1,Q1 or something..

In either case, I need to retain each original recordset say,A1-O1 with the summed up values of its duplicates and its own in P1,Q1

I dont suppose we can attach a demo of the excel sheet here now, can we? :(

EDIT2:

Function for replicating sumif formula across all cells

 Sub PreNettingBenefits(StartRow1, EndRow1, StartRow2, EndRow2, Col_Asset, Col_Liab, Src_Col_Asset, Src_Col_Liab, ConcatCol, Src_ConcatCol, level, Sheet2, Sheet1)

'=SUMIF(Sheet1!$AA$2:$AA$81336,Sheet2!AA2,Sheet1!$P$2:$P$81336)
Application.Calculation = xlCalculationAutomatic
Sheets(Sheet2).Range(Col_Asset & CStr(StartRow2)).Formula = "=SUMIF(" & Sheet1 & "!$" & Src_ConcatCol & "$" & CStr(StartRow1) & ":$" & Src_ConcatCol & "$" & CStr(EndRow1) & "," & Sheet2 & "!" & ConcatCol & CStr(StartRow2) & "," & Sheet1 & "!$" & Src_Col_Asset & "$" & CStr(StartRow1) & ":$" & Src_Col_Asset & "$" & CStr(EndRow1) & ")"
Sheets(Sheet2).Range(Col_Asset & CStr(StartRow2)).Select
Selection.Copy
MsgBox Sheets(Sheet2).Range(Col_Asset & CStr(EndRow2)).Address
Sheets(Sheet2).Range(Col_Asset & CStr(EndRow2)).Select
Range(Col_Asset & CStr(StartRow2) & ":" & Col_Asset & CStr(EndRow2)).Select
Application.CutCopyMode = False
Selection.FillDown




Sheets(Sheet2).Range(Col_Liab & CStr(StartRow2)).Formula = "=SUMIF(" & Sheet1 & "!$" & Src_ConcatCol & "$" & CStr(StartRow1) & ":$" & Src_ConcatCol & "$" & CStr(EndRow1) & "," & Sheet2 & "!" & ConcatCol & CStr(StartRow2) & "," & Sheet1 & "!$" & Src_Col_Liab & "$" & CStr(StartRow1) & ":$" & Src_Col_Liab & "$" & CStr(EndRow1) & ")"
Sheets(Sheet2).Range(Col_Liab & CStr(StartRow2)).Select
Selection.Copy
MsgBox Sheets(Sheet2).Range(Col_Liab & CStr(EndRow2)).Address
Sheets(Sheet2).Range(Col_Liab & CStr(EndRow2)).Select
Range(Col_Liab & CStr(StartRow2) & ":" & Col_Liab & CStr(EndRow2)).Select
Application.CutCopyMode = False
Selection.FillDown


Application.Calculation = xlCalculationManual


End Sub

It hangs pretty bad. Whts the problem in replicating the formula across 30k-40k rows. Could someone please optimise the code?


Something must be terribly wrong with how you are doing the adding of the duplicates. Since you were scant on details of the data you are working with, I don't know if this is the same, but I populated A1:O33334 (over 500k cells) with a random number between 1 and 10,000.

Using a dictionary object (I am known for my love and over-use of it), I went through all of them and summed only the duplicate values and then slapped the unique list of elements into column A in sheet2.

Reasons why a dictionary might be the thing to use:

  • You can weed out duplicates
  • You can check if a value exists in the dictionary or not
  • You can transpose the unique list easily onto Excel

The dupe checking and addition, and copying the unique cells only takes 2 seconds. Here is the code for your reference.

Sub test()

Application.ScreenUpdating = False
Dim vArray As Variant
Dim result As Long
Dim dict As Object
Set dict = CreateObject("scripting.dictionary")

vArray = Range("A1:O33334").Value

On Error Resume Next
For i = 1 To UBound(vArray, 1)
    For j = 1 To UBound(vArray, 2)
        If dict.exists(vArray(i, j)) = False Then
            dict.Add vArray(i, j), 1
        Else
            result = result + vArray(i, j)
        End If
    Next
Next

Sheet2.Range("a1").Resize(dict.Count).Value = _
Application.Transpose(dict.keys)

Application.ScreenUpdating = True
MsgBox "Total for duplicate cells: " & result & vbLf & _
    "Unique cells copied: " & dict.Count

End Sub


You shouldn't select every cell when executing code.

Btw, if you have a look at your code, some statements are useless:

Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Select
Selection.Copy

is never pasted

For performance issue, see some tips within this thread: Benchmarking VBA Code


The essence of the question, as I understand it, is to find the duplicates and add them up, and then delete them. You also mentioned grouping them but it is not clear how. In any case, I would ditch the macros. Operations on individual rows aren't going to work on that dataset.

Here are some steps I would take. Modify them to fit your needs:

Use the concatenate function to create a new column to the right of your dataset. For example

=concatenate(a2,b2,c2,d2,e2)

Create a column called Dups and use the following to populate it:

=if(countif(dataSetNamedRange,aa2)>1,1,0)

In the code above, aa2 refers to the concatenated column for that row. The result of the above is that you now have all dups flagged. Now use the filter tools in the Data menu to create a sort or a filter to fit your grouping needs. To add up the values, use DSum. To delete the dups, use an advanced filter. Good luck.


I am adding this as a second answer since it's going to get long...

Becuase I am a stubborn mule, I tried many different things, I think you've reached the limit of what Excel can do. The best function I could come up with was was this, and note I am using 50,000 rows, not your 500,000:

  • 50,000 rows with 100 unique rows, randomly spread: 1m:47s
  • 50,000 rows with 50 unique rows, randomly spread: 57s
  • 50,000 rows with 25 unique rows, randomly spread: 28s
  • 50,000 rows with 10 unique rows, randomly spread: 12s
  • 50,000 rows with 5 unique rows, randomly spread: 6s

As you can see, the function will deteriorate as the number of unique rows increases. I have a lot of wacky ideas here, so I thought I'd share my code for the sake of research:

  • I take the entire range of 750k cells and dump it into a variant array (.2 seconds)
  • I dump the P & Q rows into a similar variant array for use later
  • I make an array of 50,000 strings (rows) from the variant array (only 1 seconds or so!)
  • I say goodbye to the massive variant array to clean up memory
  • I start my loop through each row, comparing against all 50k rows...
  • If a dupe row is found, it's added to the dupe dictionary so we don't do the same process on that row later
  • When the dupe is found, it's P&Q totals are added to the P&Q of the row in question
  • After checking all 50k rows, we slap the total into the R column of the row and move on
  • If the row has been noted as a dupe in the dupedict, we skip it (evil GoTo beware!)
Sub test()

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual

Dim rowArray As Variant
Dim totalArray As Variant
Dim i As Long, j As Long
Dim dupeDict As Object
Set dupeDict = CreateObject("scripting.dictionary")
Dim count As Long
Dim rowData() As String

'dump the cells into an single array
rowArray = Range("A1:O50000").Value

'grab totals from P and Q to keep them seperate
totalArray = Range("P1:Q50000").Value

'create strings for each row
ReDim rowData(1 To 50000)

'create a string for each row
For i = 1 To 50000
    For j = 1 To 15
        rowData(i) = rowData(i) & rowArray(i, j)
    Next
Next

'free up that memory
Set rowArray = Nothing

'check all rows, total P & Q if match
On Error Resume Next
For i = 1 To 50000
    'skip row and move to next if we've seen it
    If dupeDict.exists(i) = True Then
        GoTo Dupe
    End If
    count = 0
    For j = 1 To 50000
        If rowData(i) = rowData(j) Then
            dupeDict.Add j, 1 'add that sucker to the dupe dict
            count = count + totalArray(j, 1) + totalArray(j, 2)
        End If
        'enter final total in column R
        Cells(i, 18).Value = count
    Next
Dupe:
Next

Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic

End Sub
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜