Workshop on Distributed Computing in R



Distributed Computing in R

As data sizes increase, so does the need to provide R users with tools to efficiently analyze large datasets. The goal of this workshop is to standardize the API for exposing distributed computing in R, learn from the experiences of attendees in using R for large scale analysis, and collaborate in open source. We want to encourage R contributors (including students) to implement parallel versions of their favorite algorithms. By standardizing the infrastructure for distributed computing, we will be able to increase the availability of parallel algorithms in R, and ensure that R is an appealing choice even for analysis on really large data.

Today, R has many packages that provide parallelism constructs, but there is very little effort in standardizing the API. Each package has its own syntax, parallelism techniques, and operating systems that they support. This has the unfortunate consequence that if a contributor implements algorithms using one package, she would have to rely on the underlying package implementation and the package maintainer. It hinders contributions and portability of code. We hope to address many of these challenges in the workshop and provide a concrete plan for including distributed computing in future R releases.

Date: Jan 26-27, 2015

Organizers: Indrajit Roy (Principal Researcher, HP),

Michael Lawrence (Genentech, R-core member)

Location: HP Labs, Palo Alto, CA
1501 Page Mill Road, Building 3, Palo Alto, California

Travel Guide: Helpful tips on accommodation and travel

Participants:

  • Brian Lewis (Paradigm4, author of SciDB-R package)
  • Dirk Eddelbuettel (Debian, author of Rcpp package)
  • Duncan Temple Lang (UC Davis, R-core)
  • Elliot Waingold (Microsoft, Azure ML)
  • Gagan Bansal (Microsoft, Azure ML)
  • George Ostrouchov (Oak Ridge National Laboratory, author of pbdR packages)
  • Joseph Rickert (Revolution Analytics, Data Scientist)
  • Junji Nakano (The Institute of Statistical Mathematics, Tokyo, author of Rhpc package)
  • Louis Bajuk-Yorgan (TIBCO, senior product manager)
  • Luke Tierney (The University of Iowa, R-core member)
  • Mario Inchiosa (Revolution Analytics, Chief Scientist)
  • Martin Morgan (Fred Hutchinson Cancer Research Center, R-core member)
  • Michael Kane (Yale University, author of bigmemory package)
  • Michael Sannella (TIBCO, architect of TERR)
  • Robert Gentleman (Genentech, R-core)
  • Ryan Hafen (Purdue University, author of Tessera package)
  • Saptarshi Guha (Mozilla Corporation, author of RHIPE package)
  • Simon Urbanek (AT&T Research Labs, R-core member)

Agenda:

Day 1: Monday, Jan 26, 2015

Time Talk Presenter
9:00 - 9:15am Light breakfast
9:15 – 9:25 am Welcome and announcements
9:25 – 9:45 am Context for the workshop Indrajit Roy, Michael Lawrence
Session 1: Experience with MPI like backend
9:45 – 10:10 am Some notes on parallel computing in R Luke Tierney
10:10 – 10:35am Rhpc: An R package for high-performance computing Junji Nakano
10:35 – 10:50am Break
10:50 – 11:15am Approaches to standardizing parallel evaluation in Bioconductor Martin Morgan
11:15 – 11:40am pbdR: A Sustainable Path for Scalable Statistical Computing George Ostrouchov
11:40 - 12:00pm Recap of session 1 (Q&A) Discuss advantages and limitations of MPI like interface
12:00 - 1:00pm Lunch
1:00 - 1:45pm Short tour of HP Labs
Session 2: Beyond embarrassingly parallel computations
1:45 – 2:10pm Need for distributed data-structures and chunk based computation Indrajit Roy
2:10 – 2:35pm iotools and ROctopus - two approaches to using R at scale Simon Urbanek
2:35 – 3:00pm A Friendly Critique of SparkR Michael Sannella
3:00 – 3:25pm RHIPE: Experiences from Analyzing Large Data using R, Hadoop and MapReduce Saptarshi Guha
3:25 – 3:45pm Break
3:45 – 4:00 pm Recap of session 2 (Q&A). Discuss requirements for distributed API in R.
4:00 – 5:00 pm Brainstorming on API requirements

Day 2: Tuesday, Jan 27, 2015

Time Talk Presenter
9:00 - 9:15am Light breakfast
Session 3: Embrace disks, thread parallelism, and more
9:15 – 9:40 am Parallel External Memory Algorithms in RevoPemaR and RevoScaleR Mario Inchiosa
9:40 – 10:05 am Some notes about Rcpp and RcppParallel Dirk Eddelbuettel
10:05 – 10:30 am Divide and Recombine - A Distributed Data Analysis Paradigm Ryan Hafen
10:30 – 10:55am Bigmemory and the Upcoming Storage Singularity Michael Kane
10:55 – 11:00am Break
11:00 – 11:30am Brainstorming on API requirements
11:30 – 12:00 Decide contributors, timeline, and next steps
12:00 – 1:00pm Lunch Wrap up!