Partitioning data with multidplyr
The package multidplyr
provides simple techniques to partition data across a set of workers (multicore parallelism) on the same or different nodes.
Create a cluster of workers
Let’s load the multidplyr
package:
library(multidplyr)
First of all, you need to create a set of worker:
<- new_cluster(4)
cl cl
4 session cluster [....]
Data assignment
There are multiple ways to assign data to the workers.
Assign the same value to each worker
This is done with the cluster_assign()
function:
cluster_assign(cl, a = 1:4)
To execute the code on each worker and return the result, you use the function cluster_call()
:
cluster_call(cl, a)
[[1]]
[1] 1 2 3 4
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 2 3 4
[[4]]
[1] 1 2 3 4
cluster_assign(cl, b = runif(4))
cluster_call(cl, b)
[[1]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
[[2]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
[[3]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
[[4]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
Assign different values to each worker
For this, use instead cluster_assign_each()
:
cluster_assign_each(cl, c = 1:4)
cluster_call(cl, c)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
cluster_assign_each(cl, d = runif(4))
cluster_call(cl, d)
[[1]]
[1] 0.8892167
[[2]]
[1] 0.09334862
[[3]]
[1] 0.614763
[[4]]
[1] 0.6986541
Partition vectors
cluster_assign_partition()
splits up a vector to assign about the same amount of data to each worker:
cluster_assign_partition(cl, e = 1:10)
cluster_call(cl, e)
[[1]]
[1] 1 2 3
[[2]]
[1] 4 5
[[3]]
[1] 6 7
[[4]]
[1] 8 9 10