bigmemory and the Upcoming Storage Singularity

Michael J. Kane

Yale University and Phronesis LLC

What are the goals of this talk?

1. Give a quick overview of bigmemory
2. Point out some of the things I think we did right
3. Contextualize the work that was done

What is bigmemory?

• A package for creating, storing, accessing, and manipulate dense (and semi-dense)Â matrices that are larger than available RAM.

• It's been around since 2008 - I wrote it with Jay Emerson

• Part of a suite of packages for processing matrices out-of-core (biganalytics, bigtabulate, bigalgebra, synchronicity)

• Currently being maintained by myself andÂ PeteÂ Haverty

What does it look like?

> library(bigmemory)
> x = big.matrix(3, 3, type='integer', init=123,
+                backingfile="example.bin",
+                descriptorfile="example.desc",
+                dimnames=list(c('a','b','c'),
+                              c('d', 'e', 'f')))
> x[,]
d   e   f
a 123 123 123
b 123 123 123
c 123 123 123
> rm(x)
> y = attach.big.matrix("example.desc")
> y[,]
d   e   f
a 123 123 123
b 123 123 123
c 123 123 123


How does it work?

• mmap -Â aÂ POSIX-compliantÂ UnixÂ system callÂ that maps files or devices into memory

• All data movement (disk to RAM to cache) is handled transparently by the operating system.

• The binary representation of the matrix is stored directly on disk.

• The descriptor file holds meta-information (number of row, number of columns, etc.).

• Works with any filesystem supporting mmap (including distributed ones).

Who builds on top of it?


Reverse depends:	bigalgebra, biganalytics, bigpca, bigrf,
bigtabulate
Reverse imports:	Rdsm
Reverse linking to:	bigalgebra, biganalytics, bigrf, bigtabulate
Reverse suggests:	bio3d, matpow, mlDNA, nat.nblast, NMF, PopGenome,
rsgcc
Reverse enhances:	bigmemory.sri
Reverse depends: bigmemoryExtras, ChipXpressData, Biobase and BiocGenerics
(through bigmemoryExtras)

CRAN

Bioconductor

Why are people using it?

1. Only import data once

2. It's generally faster than swapping

3. It's compatible with BLAS and LAPACK libraries

bigmemory's interesting characteristic

• Users can create data structuresÂ thatÂ don't to distinguish betweenÂ cache, RAM, and disk (SSD)
• Data structures are loaded "instantly"
• Hardware is fast enough to support an interactive experience

What are the benefits of doing this in R?

• Transparent support much larger data structures
• Natural location of the data is disk when not being used
• Data structures (the binary representation) could be stored persistentlyÂ and would not need to be explicitly imported

Â

Â

• Character types are still a problem
• However, there are other character vector representations that would work well

Â

Â

Why can't we (R users) do this now?

• We can't override the memory allocation of native objects
• We can't override how objects are copied
• We can't give RÂ new objects to beÂ marshaled

Â

Â

What's being done now?

• Memory allocation and object duplication
• Simon has added some hooks forÂ overriding R's memory allocator
• flexmem (Bryan Lewis and myself)
• Importing
• Bryan Lewis's doppleganger trick

Conclusions and Open Questions

bigmemory's (and ff's) users show that there is a demand for memory mapped objects

Â

We've also showed that theyÂ can be performant

Â

How can they be better integrated?

Â

More importantlyÂ should they be better integrated?Â