Partitioning data with multidplyr
The package multidplyr provides simple techniques to partition data across a set of workers (multicore parallelism) on the same or different nodes.
Create a cluster of workers
Let’s load the multidplyr package:
library(multidplyr)First of all, you need to create a set of worker:
cl <- new_cluster(4)
cl4 session cluster [....]
Data assignment
There are multiple ways to assign data to the workers.
Assign the same value to each worker
This is done with the cluster_assign() function:
cluster_assign(cl, a = 1:4)To execute the code on each worker and return the result, you use the function cluster_call():
cluster_call(cl, a)[[1]]
[1] 1 2 3 4
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 2 3 4
[[4]]
[1] 1 2 3 4
cluster_assign(cl, b = runif(4))
cluster_call(cl, b)[[1]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
[[2]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
[[3]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
[[4]]
[1] 0.93146519 0.75181518 0.33158435 0.02970799
Assign different values to each worker
For this, use instead cluster_assign_each():
cluster_assign_each(cl, c = 1:4)
cluster_call(cl, c)[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
cluster_assign_each(cl, d = runif(4))
cluster_call(cl, d)[[1]]
[1] 0.8892167
[[2]]
[1] 0.09334862
[[3]]
[1] 0.614763
[[4]]
[1] 0.6986541
Partition vectors
cluster_assign_partition() splits up a vector to assign about the same amount of data to each worker:
cluster_assign_partition(cl, e = 1:10)
cluster_call(cl, e)[[1]]
[1] 1 2 3
[[2]]
[1] 4 5
[[3]]
[1] 6 7
[[4]]
[1] 8 9 10