bigmemory and the Upcoming Storage Singularity

Michael J. Kane

Yale University and Phronesis LLC

What are the goals of this talk?

  1. Give a quick overview of bigmemory
  2. Point out some of the things I think we did right
  3. Contextualize the work that was done

What is bigmemory?

  • A package for creating, storing, accessing, and manipulate dense (and semi-dense) matrices that are larger than available RAM.

  • It's been around since 2008 - I wrote it with Jay Emerson

  • Part of a suite of packages for processing matrices out-of-core (biganalytics, bigtabulate, bigalgebra, synchronicity)

  • Currently being maintained by myself and Pete Haverty

What does it look like?

> library(bigmemory)
> x = big.matrix(3, 3, type='integer', init=123,
+                backingfile="example.bin",
+                descriptorfile="example.desc",
+                dimnames=list(c('a','b','c'),
+                              c('d', 'e', 'f')))
> x[,]
    d   e   f
a 123 123 123
b 123 123 123
c 123 123 123
> rm(x)
> y = attach.big.matrix("example.desc")
> y[,]
    d   e   f
a 123 123 123
b 123 123 123
c 123 123 123

How does it work?

  • mmap - a POSIX-compliant Unix system call that maps files or devices into memory

  • All data movement (disk to RAM to cache) is handled transparently by the operating system.

  • The binary representation of the matrix is stored directly on disk.

  • The descriptor file holds meta-information (number of row, number of columns, etc.).

  • Works with any filesystem supporting mmap (including distributed ones).

Who uses it?

Who builds on top of it?


Reverse depends:	bigalgebra, biganalytics, bigpca, bigrf, 
                        bigtabulate
Reverse imports:	Rdsm
Reverse linking to:	bigalgebra, biganalytics, bigrf, bigtabulate
Reverse suggests:	bio3d, matpow, mlDNA, nat.nblast, NMF, PopGenome, 
                        rsgcc
Reverse enhances:	bigmemory.sri
Reverse depends: bigmemoryExtras, ChipXpressData, Biobase and BiocGenerics 
                 (through bigmemoryExtras)

CRAN

Bioconductor

Why are people using it?

  1. Only import data once

  2. It's generally faster than swapping

  3. It's compatible with BLAS and LAPACK libraries

Why has it worked?

We co-opted R's grammar.

We leveraged system software.

We made use of modern hardware advances.

bigmemory's interesting characteristic

  • Users can create data structures that don't to distinguish between cache, RAM, and disk (SSD)
  • Data structures are loaded "instantly"
  • Hardware is fast enough to support an interactive experience

Why don't we do this with native R data structures?

What are the benefits of doing this in R?

  • Transparent support much larger data structures
  • Natural location of the data is disk when not being used
  • Data structures (the binary representation) could be stored persistently and would not need to be explicitly imported

 

 

...well what about character types?

  • Character types are still a problem
  • However, there are other character vector representations that would work well

 

 

Why can't we (R users) do this now?

  • We can't override the memory allocation of native objects
  • We can't override how objects are copied
  • We can't give R new objects to be marshaled

 

 

What's being done now?

  • Memory allocation and object duplication
    • Simon has added some hooks for overriding R's memory allocator
    • flexmem (Bryan Lewis and myself)
  • Importing
    • Bryan Lewis's doppleganger trick

Conclusions and Open Questions

bigmemory's (and ff's) users show that there is a demand for memory mapped objects

 

We've also showed that they can be performant

 

How can they be better integrated?

 

More importantly should they be better integrated?