Easiest way to decide if List contains duplicates?

2023-04-11 18:15 问答作者：

list.distinct.size != list.size

Is there any better way? It would have been nice to have a containsDuplicates method

Assuming "better" means "faster", see the alternative approaches benchmarked in this question, which seems to show some quicker methods (although note that distinct uses a HashSet and is already O(n)). YMMV of course, depending on specific test case, scala version etc. Probably any significant improvement over the "distinct.size" approach would come from an early-out as soon as a duplicate is found, but how much of a speed-up is actually obtained would depend strongly on how common duplicates actually are in your use-case.

If you mean "better" in that you want to write list.containsDuplicates instead of containsDuplicates(list), use an implicit:

implicit def enhanceWithContainsDuplicates[T](s:List[T]) = new {
  def containsDuplicates = (s.distinct.size != s.size)
}

assert(List(1,2,2,3).containsDuplicates)
assert(!List("a","b","c").containsDuplicates)

You can also write:

list.toSet.size != list.size

But the result will be the same because distinct is already implemented with a Set. In both case the time complexity should be O(n): you must traverse the list and Set insertion is O(1).

I think this would stop as soon as a duplicate was found and is probably more efficient than doing distinct.size - since I assume distinct keeps a set as well:

@annotation.tailrec
def containsDups[A](list: List[A], seen: Set[A] = Set[A]()): Boolean = 
  list match {
    case x :: xs => if (seen.contains(x)) true else containsDups(xs, seen + x)
    case _ => false
}

containsDups(List(1,1,2,3))
// Boolean = true

containsDups(List(1,2,3))
// Boolean = false

I realize you asked for easy and I don't now that this version is, but finding a duplicate is also finding if there is an element that has been seen before:

def containsDups[A](list: List[A]): Boolean =  {
  list.iterator.scanLeft(Set[A]())((set, a) => set + a) // incremental sets
    .zip(list.iterator)
    .exists{ case (set, a) => set contains a }
}

@annotation.tailrec 
def containsDuplicates [T] (s: Seq[T]) : Boolean = 
  if (s.size < 2) false else 
    s.tail.contains (s.head) || containsDuplicates (s.tail)

I didn't measure this, and think it is similar to huynhjl's solution, but a bit more simple to understand.

It returns early, if a duplicate is found, so I looked into the source of Seq.contains, whether this returns early - it does.

In SeqLike, 'contains (e)' is defined as 'exists (_ == e)', and exists is defined in TraversableLike:

def exists (p: A => Boolean): Boolean = {
  var result = false
  breakable {
    for (x <- this)
      if (p (x)) { result = true; break }
  }
  result
}

I'm curious how to speed things up with parallel collections on multi cores, but I guess it is a general problem with early-returning, while another thread will keep running, because it doesn't know, that the solution is already found.

Summary: I've written a very efficient function which returns both List.distinct and a List consisting of each element which appeared more than once and the index at which the element duplicate appeared.

Note: This answer is a straight copy of the answer on a related question.

Details: If you need a bit more information about the duplicates themselves, like I did, I have written a more general function which iterates across a List (as ordering was significant) exactly once and returns a Tuple2 consisting of the original List deduped (all duplicates after the first are removed; i.e. the same as invoking distinct) and a second List showing each duplicate and an Int index at which it occurred within the original List.

Here's the function:

def filterDupes[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(remaining: List[A], index: Int, accumulator: (List[A], List[(A, Int)])): (List[A], List[(A, Int)]) =
    if (remaining.isEmpty)
      accumulator
    else
      recursive(
          remaining.tail
        , index + 1
        , if (accumulator._1.contains(remaining.head))
            (accumulator._1, (remaining.head, index) :: accumulator._2)
          else
            (remaining.head :: accumulator._1, accumulator._2)
      )
  val (distinct, dupes) = recursive(items, 0, (Nil, Nil))
  (distinct.reverse, dupes.reverse)
}

An below is an example which might make it a bit more intuitive. Given this List of String values:

val withDupes =
  List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")

...and then performing the following:

val (deduped, dupeAndIndexs) =
  filterDupes(withDupes)

...the results are:

deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))

And if you just want the duplicates, you simply map across dupeAndIndexes and invoke distinct:

val dupesOnly =
  dupeAndIndexs.map(_._1).distinct

...or all in a single call:

val dupesOnly =
  filterDupes(withDupes)._2.map(_._1).distinct

...or if a Set is preferred, skip distinct and invoke toSet...

val dupesOnly2 =
  dupeAndIndexs.map(_._1).toSet

...or all in a single call:

val dupesOnly2 =
  filterDupes(withDupes)._2.map(_._1).toSet

This is a straight copy of the filterDupes function out of my open source Scala library, ScalaOlio. It's located at org.scalaolio.collection.immutable.List_._.

If you're trying to check for duplicates in a test then ScalaTest can be helpful.

import org.scalatest.Inspectors._
import org.scalatest.Matchers._
forEvery(list.distinct) { item =>
  withClue(s"value $item, the number of occurences") {
    list.count(_ == item) shouldBe 1
  }
}

// example:
scala> val list = List(1,2,3,4,3,2)
list: List[Int] = List(1, 2, 3, 4, 3, 2)

scala> forEvery(list) { item => withClue(s"value $item, the number of occurences") { list.count(_ == item) shouldBe 1 } }
org.scalatest.exceptions.TestFailedException: forEvery failed, because:
  at index 1, value 2, the number of occurences 2 was not equal to 1 (<console>:19),
  at index 2, value 3, the number of occurences 2 was not equal to 1 (<console>:19)
in List(1, 2, 3, 4)

继续阅读：scala

Easiest way to decide if List contains duplicates?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？