Distributed R for Big Data


Our framework extends R with new language extensions and a runtime to manage distributed execution. It efficiently executes data analyses that are naturally expressed as matrix algorithms. Users write their programs by manipulating array partitions in parallel. Even algorithms which have data dependences can be easily expressed.

The runtime uses multiple techniques to ensure efficient execution. It caches remote data, manages computation and data movement using a scheduler, reduces the load imbalance caused by sparse datasets, and handles data dependences.

Example: PageRank

PageRank represents the relative importance of pages in a Web graph. The code below depicts an implementation of PageRank in our framework. Keywords in bold are new R extensions provided.

   #M: sparse adjacency matrix, p: dense vector
1 : M<- darray(dim=c(N,N),blocks=c(s,N), sparse=T)
2 : p<- darray(dim=c(N,1),blocks=c(s,1), sparse=F)
3 : ...
   #Distributed matrix operations
4 : k<-numsplits(M)
5 : repeat{
6 :  foreach(i, 1:k, function(pgr=splits(p,i),
        m=splits(M,i), x=splits(xold), z=splits(Z,i)) {
7 :     pgr<-(m%*%x)+ z
8 :     update(pgr)
9 :  })
10:  if(norm(p-xold)>1e-9) {break}
11:  xold <- p
12: }
PageRank illustration