开发者

Splitting string into groups

I'm trying to 'group' a string into segments, I guess this example would explain it more succintly

scala> val str: String = "aaaabbcddeeeeeeffg"
... (do something)
res0: List("aaaa","bb","c","dd","eeeee","ff","g")
开发者_如何学Python

I can thnk of a few ways to do this in an imperative style (with vars and stepping through the string to find groups) but I was wondering if any better functional solution could be attained? I've been looking through the Scala API but there doesn't seem to be something that fits my needs.

Any help would be appreciated


You can split the string recursively with span:

def s(x : String) : List[String] = if(x.size == 0) Nil else {
    val (l,r) = x.span(_ == x(0))
    l :: s(r) 
}

Tail recursive:

@annotation.tailrec def s(x : String, y : List[String] = Nil) : List[String] = {
    if(x.size == 0) y.reverse 
    else {
        val (l,r) = x.span(_ == x(0))
        s(r, l :: y)
    }
}


Seems that all other answers are very concentrated on collection operations. But pure string + regex solution is much simpler:

str split """(?<=(\w))(?!\1)""" toList

In this regex I use positive lookbehind and negative lookahead for the captured char


def group(s: String): List[String] = s match {
  case "" => Nil
  case s  => s.takeWhile(_==s.head) :: group(s.dropWhile(_==s.head))
}

Edit: Tail recursive version:

def group(s: String, result: List[String] = Nil): List[String] = s match {
  case "" => result reverse
  case s  => group(s.dropWhile(_==s.head), s.takeWhile(_==s.head) :: result)
}

can be used just like the other because the second parameter has a default value and thus doesnt have to be supplied.


Make it one-liner:

scala>  val str = "aaaabbcddddeeeeefff"
str: java.lang.String = aaaabbcddddeeeeefff

scala> str.groupBy(identity).map(_._2)
res: scala.collection.immutable.Iterable[String] = List(eeeee, fff, aaaa, bb, c, dddd)

UPDATE:

As @Paul mentioned about the order here is updated version:

scala> str.groupBy(identity).toList.sortBy(_._1).map(_._2)
res: List[String] = List(aaaa, bb, c, dddd, eeeee, fff)


You could use some helper functions like this:

val str = "aaaabbcddddeeeeefff"

def zame(chars:List[Char]) = chars.partition(_==chars.head)

def q(chars:List[Char]):List[List[Char]] = chars match {
    case Nil => Nil
    case rest =>
        val (thesame,others) = zame(rest)
        thesame :: q(others)
}

q(str.toList) map (_.mkString)

This should do the trick, right? No doubt it can be cleaned up into one-liners even further


A functional* solution using fold:

def group(s : String) : Seq[String] = {
  s.tail.foldLeft(Seq(s.head.toString)) { case (carry, elem) =>
    if ( carry.last(0) == elem ) {
      carry.init :+ (carry.last + elem)
    }
    else {
      carry :+ elem.toString
    }
  }
}

There is a lot of cost hidden in all those sequence operations performed on strings (via implicit conversion). I guess the real complexity heavily depends on the kind of Seq strings are converted to.

(*) Afaik all/most operations in the collection library depend in iterators, an imho inherently unfunctional concept. But the code looks functional, at least.


Starting Scala 2.13, List is now provided with the unfold builder which can be combined with String::span:

List.unfold("aaaabbaaacdeeffg") {
  case ""   => None
  case rest => Some(rest.span(_ == rest.head))
}
// List[String] = List("aaaa", "bb", "aaa", "c", "d", "ee", "ff", "g")

or alternatively, coupled with Scala 2.13's Option#unless builder:

List.unfold("aaaabbaaacdeeffg") {
  rest => Option.unless(rest.isEmpty)(rest.span(_ == rest.head))
}
// List[String] = List("aaaa", "bb", "aaa", "c", "d", "ee", "ff", "g")

Details:

  • Unfold (def unfold[A, S](init: S)(f: (S) => Option[(A, S)]): List[A]) is based on an internal state (init) which is initialized in our case with "aaaabbaaacdeeffg".
  • For each iteration, we span (def span(p: (Char) => Boolean): (String, String)) this internal state in order to find the prefix containing the same symbol and produce a (String, String) tuple which contains the prefix and the rest of the string. span is very fortunate in this context as it produces exactly what unfold expects: a tuple containing the next element of the list and the new internal state.
  • The unfolding stops when the internal state is "" in which case we produce None as expected by unfold to exit.


Edit: Have to read more carefully. Below is no functional code.

Sometimes, a little mutable state helps:

def group(s : String) = {
  var tmp = ""
  val b = Seq.newBuilder[String]

  s.foreach { c =>
    if ( tmp != "" && tmp.head != c ) {
      b += tmp
      tmp = ""
    }

    tmp += c
  }
  b += tmp

  b.result
}

Runtime O(n) (if segments have at most constant length) and tmp.+= probably creates the most overhead. Use a string builder instead for strict runtime in O(n).

group("aaaabbcddeeeeeeffg")
> Seq[String] = List(aaaa, bb, c, dd, eeeeee, ff, g)


If you want to use scala API you can use the built in function for that:

str.groupBy(c => c).values

Or if you mind it being sorted and in a list:

str.groupBy(c => c).values.toList.sorted
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜