Distributed R for Big Data
Design
Our framework extends R with new language extensions and a runtime to manage distributed execution. It efficiently executes data analyses that are naturally expressed as matrix algorithms. Users write their programs by manipulating array partitions in parallel. Even algorithms which have data dependences can be easily expressed.
The runtime uses multiple techniques to ensure efficient execution. It caches remote data, manages computation and data movement using a scheduler, reduces the load imbalance caused by sparse datasets, and handles data dependences.
Example: PageRank
PageRank represents the relative importance of pages in a Web graph. The code below depicts an implementation of PageRank in our framework. Keywords in bold are new R extensions provided.
#M: sparse adjacency matrix, p: dense vector
|
|
Publications
- Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, Rob Schreiber. Eurosys 2013, Prague, Czech Republic.
- Using R for Iterative and Incremental Processing. Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Rob Schreiber. HotCloud 2012, Boston, USA.
- HotCloud, June 2012, Boston.
- From Data to Knowledge, May 2012, Berkeley.