Distributed R for Big Data

Arrays for big data

Distributed R is a system for large scale machine learning and graph processing. It enables and accelerates complex, big-data analysis.

Starting from the open source R language and system, it adds reliable distributed processing, efficient computation over sparse datasets, and incremental processing.

We have written a variety of parallel algorithms in this framework, from clustering, to shortest path and PageRank for graphs, to Smith-Waterman sequence alignment.

How does it work?

The framework exposes a simple primitive, the distributed array that stores data across a cluster. Arrays act as the single abstraction to efficiently express both machine learning algorithms, which primarily use matrix operations, and graph algorithms, which manipulate the graph’s adjacency matrix.

Why use arrays?

Our framework complements big data analysis systems like Hadoop MapReduce. Unlike these systems, our framework efficiently executes complex algorithms such as machine learning, graph processing, and advanced statistical analysis. For example, our system is more than 20 times faster than Hadoop MapReduce for clustering, PageRank, and other analyses.

By extending R, we allow programmers to leverage optimized math libraries and reuse the many freely available R analytics packages. Our framework is also a natural fit for analytic databases such as Vertica, thus extending SQL functionality with advanced statistical analysis.