|
We present a regular expression (regex) engine on a GPU. We utilize the highly parallel architecture of GPUs to accelerate such searches. We believe that previous attempts to utilize the GPU for this task did not fully tap its potential. Regex present imbalanced compute workloads which are very different from common GPU applications (CFD, CG and image processing). Hence, they can teach us general lessons on how to utilize GPUs for more general workloads. Our initial results show 30x improvement in running time relative to single threaded commercial regex engines.
|
| Regular expressions (RE)
|
- One of the most common programming languages.
- Standard syntax.
- Platform independent.
- Commonly used for text crunching applications:
- Security applications.
- Network monitoring.
- Database queries (SQL, Google's BigTable...).
- Common scenario: many strings & REs.
- Pitfall: high run-time variance.
- Throughput (and therefore run time) are critical.
- Latency is not always critical.
|
|
Video
|
|
David Lehavi's presentation at the GTC 2012: Nvidia GPU Technology Conference, May 14-17, 2012, San Jose, California.
|
|
GPU based library features (requires Fermi/Kepler Nvidia harware)
|
|
Regex GPU is a library for high throughput processing of regular expressions on CUDA enabled GPUs
- POSIX wildcard
- ranges
- sub-matches
- giga-char strings
- greedy/non-greedy operators
- Boyer-Moore searches
- support for multiple cards
| performance boost | per volume | per watt on Fermi | expected per watt on Kepler |
| Regex Search | 4 | 1 | 3 |
| Regex Match | 8 | 2 | 6 |
| Boyer Moore search | 50 | 10 | 30 |
|
|
- analyzing the boundary between virtual machine (in the computational sense) and code:
Universal Turing machine:
current = states[current,tape[place]]
place += increment[current]
|
 |
Silly machine, smart byte code, slow, no branches. Can be parallelized assuming cache and shared memory. SIMD/GPU friendly.
| Modern CPU: |
 |
Smart machine, silly byte code, fast, many (HW implemented) branches. Not SIMD/GPU friendly.
- Any hardware has it's own sweetspot
- In GPUs arithmetic is cheap, brunches are expansive
- Using indicator variables instead of branches
Instead of (bad - serialized divergent brunches)
if (s1[n] == s2[m] )
a[n,m] += match
else
a[n,m] += mismatch;
Use an indicator (good - cheap arithmetic):
i = (s1[n] == s2[m]);
a[n,m] += match + i *(mismatch - match);
- Sometimes better on CPU (no brunch prediction), always better on GPUs and SIMD (e.g. SSE).
- Exponential build-up , no precedence in evaluation.
- Balancing map reduce
while (has task) {
map free threads
while (work_load < max and not finished) {
perform one iteration
work_load ++
}
reduce finished threads
}
- Far better than map reduce due to varying task lengths.
- Ideally max computed dynamically.
- Reduce is trivial for regexes.
- Rescheduling is done without quitting the kernel code, all the data is still in shared memory.
- Memory access is not coalesced, but it is to shared memory/cache, so it doesn't matter.
- Similar ideas were explored in distributed systems with different problems/solutions.
|
|
News & Publication
|
30x faster regular expressions on a GPU
Lehavi, David.; Schein, Sagi (2012). GTC 2012: Nvidia GPU Technology Conference, May 14-17, 2012, San Jose, California.
Conference presentation
|