Presto Logo

Distributed R for big data



Design

Presto extends R with new language extensions and a runtime to manage distributed execution. It efficiently executes data analyses that are naturally expressed as matrix algorithms. Users write their programs by manipulating array partitions in parallel. Even algorithms which have data dependences can be easily expressed in Presto.

The Presto runtime uses multiple techniques to ensure efficient execution. It caches remote data, manages computation and data movement using a scheduler, reduces the load imbalance caused by sparse datasets, and handles data dependences.

Example: PageRank

PageRank represents the relative importance of pages in a Web graph. The code below depicts an implementation of PageRank in Presto. Keywords in bold are R extensions provided by Presto.

   #M: sparse adjacency matrix, p: dense vector
1 : M<- darray(dim=c(N,N),blocks=c(s,N), sparse=T)
2 : p<- darray(dim=c(N,1),blocks=c(s,1), sparse=F)
3 : ...
   #Distributed matrix operations
4 : k<-numsplits(M)
5 : repeat{
6 :  foreach(i, 1:k, function(pgr=splits(p,i),
        m=splits(M,i), x=splits(xold), z=splits(Z,i)) {
7 :     pgr<-(m%*%x)+ z
8 :     update(pgr)
9 :  })
10:  if(norm(p-xold)>1e-9) {break}
11:  xold <- p
12: }
PageRank illustration

Publications

Presentations