
Distributed R for big data
Presto is a distributed system for large scale machine learning and graph processing. It enables and accelerates complex, big-data analysis.
Starting from the open source R language and system, Presto adds reliable distributed processing, efficient computation over sparse datasets, and incremental processing.
We have written a variety of parallel algorithms in Presto, from clustering, to shortest path and PageRank for graphs, to Smith-Waterman sequence alignment.
How does Presto work?
Presto exposes a simple primitive, the distributed array, that stores data across a cluster. Arrays act as the single abstraction to efficiently express both machine learning algorithms, which primarily use matrix operations, and graph algorithms, which manipulate the graph’s adjacency matrix.
Why use Presto?
Presto complements big data analysis systems like Hadoop MapReduce. Unlike these systems, Presto efficiently executes complex algorithms such as machine learning, graph processing, and advanced statistical analysis. For example, Presto is more than 20 times faster than Hadoop MapReduce for clustering, PageRank, and other analyses.
By extending R, Presto allows programmers to leverage optimized math libraries and reuse the many freely available R analytics packages. Presto is also a natural fit for analytic databases such as Vertica, thus extending SQL functionality with advanced statistical analysis.