In certain algorithms, one may need to assign unique identifiers to data set elements.
This document shows how DataSetUtils can be used for that purpose.
zipWithIndex assigns consecutive labels to the elements, receiving a data set as input and returning a new data set of (unique id, initial value) 2-tuples.
This process requires two passes, first counting then labeling elements, and cannot be pipelined due to the synchronization of counts.
The alternative zipWithUniqueId works in a pipelined fashion and is preferred when a unique labeling is sufficient.
For example, the following code:
may yield the tuples: (0,G), (1,H), (2,A), (3,B), (4,C), (5,D), (6,E), (7,F)
In many cases one may not need to assign consecutive labels.
zipWithUniqueId works in a pipelined fashion, speeding up the label assignment process. This method receives a data set as input and returns a new data set of (unique id, initial value) 2-tuples.
For example, the following code:
may yield the tuples: (0,G), (1,A), (2,H), (3,B), (5,C), (7,D), (9,E), (11,F)