Technical Reports
HPL-2011-198R1
Presto: Complex and Continuous Analytics with Distributed Arrays
Venkataraman, Shivaram; Roy, Indrajit; Schreiber, Robert S.; AuYoung, Alvin
HP Laboratories
HPL-2011-198R1
Keyword(s): R; Distributed execution engine; Continuous processing
Abstract: Presto is a distributed programming model for continuously analyzing data. continuous analytics, where applications constantly refine their predictive models as data arrives, is useful in many applications such as user recommendation systems, link analysis, and financial modeling. Unlike batch processing, continuous analytics requires partial re-execution, low-latency turnaround, and transitive propagation of changes to dependent tasks. Current distributed systems like MapReduce and its enhancements lack general purpose support for continuous analytics. Presto extends the freely available R software with language primitives for scalability, distributed parallelism and continuous analytics. Presto constructs, darray and onchange, are used to express parts of algorithms that should be executed when data items change. Even though the data is dynamic, Presto ensures that algorithms see a consistent snapshot of the data. Our experiments on four applications show that Presto is an order of magnitude faster than Hadoop while providing the added feature of continuous processing.
15 Pages
External Posting Date: February 21, 2012 [Fulltext]. Approved for External Publication
Internal Posting Date: February 21, 2012 [Fulltext]