
As data sizes increase, so does the need to provide R users with tools to efficiently analyze large datasets. The goal of this workshop is to standardize the API for exposing distributed computing in R, learn from the experiences of attendees in using R for large scale analysis, and collaborate in open source. We want to encourage R contributors (including students) to implement parallel versions of their favorite algorithms. By standardizing the infrastructure for distributed computing, we will be able to increase the availability of parallel algorithms in R, and ensure that R is an appealing choice even for analysis on really large data.
Today, R has many packages that provide parallelism constructs, but there is very little effort in standardizing the API. Each package has its own syntax, parallelism techniques, and operating systems that they support. This has the unfortunate consequence that if a contributor implements algorithms using one package, she would have to rely on the underlying package implementation and the package maintainer. It hinders contributions and portability of code. We hope to address many of these challenges in the workshop and provide a concrete plan for including distributed computing in future R releases.
Date: Jan 26-27, 2015
Organizers: Indrajit Roy (Principal Researcher, HP),
Michael Lawrence (Genentech, R-core member)
Location: HP Labs, Palo Alto, CA
1501 Page Mill Road, Building 3, Palo Alto, California
Travel Guide: Helpful tips on accommodation and travel
- Brian Lewis (Paradigm4, author of SciDB-R package)
- Dirk Eddelbuettel (Debian, author of Rcpp package)
- Duncan Temple Lang (UC Davis, R-core)
- Elliot Waingold (Microsoft, Azure ML)
- Gagan Bansal (Microsoft, Azure ML)
- George Ostrouchov (Oak Ridge National Laboratory, author of pbdR packages)
- Joseph Rickert (Revolution Analytics, Data Scientist)
- Junji Nakano (The Institute of Statistical Mathematics, Tokyo, author of Rhpc package)
- Louis Bajuk-Yorgan (TIBCO, senior product manager)
- Luke Tierney (The University of Iowa, R-core member)
- Mario Inchiosa (Revolution Analytics, Chief Scientist)
- Martin Morgan (Fred Hutchinson Cancer Research Center, R-core member)
- Michael Kane (Yale University, author of bigmemory package)
- Michael Sannella (TIBCO, architect of TERR)
- Robert Gentleman (Genentech, R-core)
- Ryan Hafen (Purdue University, author of Tessera package)
- Saptarshi Guha (Mozilla Corporation, author of RHIPE package)
- Simon Urbanek (AT&T Research Labs, R-core member)
Time | Talk | Presenter |
9:00 - 9:15am | Light breakfast | |
9:15 – 9:25 am | Welcome and announcements | |
9:25 – 9:45 am | Context for the workshop | Indrajit Roy, Michael Lawrence |
Session 1: Experience with MPI like backend | ||
---|---|---|
9:45 – 10:10 am | Some notes on parallel computing in R | Luke Tierney |
10:10 – 10:35am | Rhpc: An R package for high-performance computing | Junji Nakano |
10:35 – 10:50am | Break | |
10:50 – 11:15am | Approaches to standardizing parallel evaluation in Bioconductor | Martin Morgan |
11:15 – 11:40am | pbdR: A Sustainable Path for Scalable Statistical Computing | George Ostrouchov |
11:40 - 12:00pm | Recap of session 1 (Q&A) Discuss advantages and limitations of MPI like interface | |
12:00 - 1:00pm | Lunch | |
1:00 - 1:45pm | Short tour of HP Labs | |
Session 2: Beyond embarrassingly parallel computations | ||
1:45 – 2:10pm | Need for distributed data-structures and chunk based computation | Indrajit Roy |
2:10 – 2:35pm | iotools and ROctopus - two approaches to using R at scale | Simon Urbanek |
2:35 – 3:00pm | A Friendly Critique of SparkR | Michael Sannella |
3:00 – 3:25pm | RHIPE: Experiences from Analyzing Large Data using R, Hadoop and MapReduce | Saptarshi Guha |
3:25 – 3:45pm | Break | |
3:45 – 4:00 pm | Recap of session 2 (Q&A). Discuss requirements for distributed API in R. | |
4:00 – 5:00 pm | Brainstorming on API requirements |
Time | Talk | Presenter |
9:00 - 9:15am | Light breakfast | |
Session 3: Embrace disks, thread parallelism, and more | ||
---|---|---|
9:15 – 9:40 am | Parallel External Memory Algorithms in RevoPemaR and RevoScaleR | Mario Inchiosa |
9:40 – 10:05 am | Some notes about Rcpp and RcppParallel | Dirk Eddelbuettel |
10:05 – 10:30 am | Divide and Recombine - A Distributed Data Analysis Paradigm | Ryan Hafen |
10:30 – 10:55am | Bigmemory and the Upcoming Storage Singularity | Michael Kane |
10:55 – 11:00am | Break | |
11:00 – 11:30am | Brainstorming on API requirements | |
11:30 – 12:00 | Decide contributors, timeline, and next steps | |
12:00 – 1:00pm | Lunch | Wrap up! |