I have an iterator of strings, where each string can be either "H"
(header) or "D"
(detail). I want to split this iterator into blocks, where each block starts with one header and can have 0 to many detail.
I know how to solve this problem loading everything into memory. For example, the code bellow:
Seq("H","D","D","D","H","D","H","H","D","D","H","D").toIterator
.foldLeft(List[List[String]]())((acc, x) => x match {
case "H" => List(x) :: acc
case "D" => (x :: acc.head) :: acc.tail })
.map(_.reverse)
.reverse
returns 5 blocks - List(List(H, D, D, D), List(H, D), List(H), List(H, D, D), List(H, D))
- which is what I want.
However, instead of List[List[String]]
in the result, I want either Iterator[List[String]]
or some other structure that allows me to evaluate the result lazily and do not load the entire input into memory if the entire iterator in consumed, I want to load into memory only the block being consumed at a time (e.g.: when I call iterator.next
).
How can I modify the code above to achieve the result I want?
EDIT: I need this in Scala 2.11 specifically, as the environment I use sticks to it. Glad to also accept answers for other versions though.
Here is the simplest implementation I could find (It's generic and lazy):
/** takes 'it' and groups consecutive elements
* until next item that satisfy 'startGroup' predicate occures.
* It returns Iterator[List[T]] and is lazy
* (keeps in memory only last group, not whole 'it').
*/
def groupUsing[T](it:Iterator[T])(startGroup:T => Boolean):Iterator[List[T]] = {
val sc = it.scanLeft(List.empty[T]) {
(a,b) => if (startGroup(b)) b::Nil else b::a
}
(sc ++ Iterator(Nil)).sliding(2,1).collect {
case Seq(a,b) if a.length >= b.length => a.reverse
}
}
use it like that:
val exampleIt = Seq("H1","D1","D2","D3","H2","D4","H3","H4","D5","D6","H5","D7").toIterator
groupUsing(exampleIt)(_.startsWith("H"))
// H1 D1 D2 D3 / H2 D4 / H3 / H4 D5 D6 / H5 D7
here is specyfication:
X | GIVEN | EXPECTED |
O | | | empty iterator
O | H | H | single header
O | D | D | single item (not header)
O | HD | HD |
O | HH | H,H | only headers
O | HHD | H,HD |
O | HDDDHD | HDDD,HD |
O | DDH | DD,H | heading D's have no Header as you can see.
O | HDDDHDHDD | HDDD,HD,HDD |
scalafiddle with tests and additional comments: https://scalafiddle.io/sf/q8xbQ9N/11
(if answer is helpful up-vote please. I spent a little too much time on it :))
SECOND IMPLEMENTATION:
You have propose version that does not use sliding
. Here it is, but it has its own problems listed below.
def groupUsing2[T >: Null](it:Iterator[T])(startGroup:T => Boolean):Iterator[List[T]] = {
type TT = (List[T], List[T])
val empty:TT = (Nil, Nil)
//We need this ugly `++ Iterator(null)` to close last group.
val sc = (it ++ Iterator(null)).scanLeft(empty) {
(a,b) => if (b == null || startGroup(b)) (b::Nil, a._1) else (b::a._1, Nil)
}
sc.collect {
case (_, a) if a.nonEmpty => a.reverse
}
}
Traits:
- (-) It works only for
T>:Null
types. We just need to add element that will close last collection on the end (null is perfect but it limits our type). - (~) it should create same amount of trsh as previous version. We just create tuples in first step instead of second one.
- (+) it does not check length of List (and this is big gain to be honest).
- (+) In core it is Ivan Kurchenko answer but without extra boxing.
Here is scalafiddle: https://scalafiddle.io/sf/q8xbQ9N/11