GPU Regex

We present a regular expression (regex) engine on a GPU. We utilize the highly parallel architecture of GPUs to accelerate such searches. We believe that previous attempts to utilize the GPU for this task did not fully tap its potential. Regex present imbalanced compute workloads which are very different from common GPU applications (CFD, CG and image processing). Hence, they can teach us general lessons on how to utilize GPUs for more general workloads. Our initial results show 30x improvement in running time relative to single threaded commercial regex engines.

  Regular expressions (RE)
  • One of the most common programming languages.
    • Standard syntax.
    • Platform independent.
  • Commonly used for text crunching applications:
    • Security applications.
    • Network monitoring.
    • Database queries (SQL, Google's BigTable...).
  • Common scenario: many strings & REs.
    • Pitfall: high run-time variance.
    • Throughput (and therefore run time) are critical.
    • Latency is not always critical.
  Video

David Lehavi's presentation at the GTC 2012: Nvidia GPU Technology Conference, May 14-17, 2012, San Jose, California.

  GPU based library features (requires Fermi/Kepler Nvidia harware)
Regex GPU is a library for high throughput processing of regular expressions on CUDA enabled GPUs
  • POSIX wildcard
  • ranges
  • sub-matches
  • giga-char strings
  • greedy/non-greedy operators
  • Boyer-Moore searches
  • support for multiple cards
performance boostper volumeper watt on Fermiexpected per watt on Kepler
Regex Search413
Regex Match826
Boyer Moore search501030

  • analyzing the boundary between virtual machine (in the computational sense) and code:

    Universal Turing machine:
    
    current = states[current,tape[place]]
    place += increment[current]
    

    Silly machine, smart byte code, slow, no branches. Can be parallelized assuming cache and shared memory. SIMD/GPU friendly.

    Modern CPU:

    Smart machine, silly byte code, fast, many (HW implemented) branches. Not SIMD/GPU friendly.

    • Any hardware has it's own sweetspot
    • In GPUs arithmetic is cheap, brunches are expansive

  • Using indicator variables instead of branches Instead of (bad - serialized divergent brunches)
    
    if (s1[n] == s2[m] )
      a[n,m] += match
    else 
      a[n,m] += mismatch;
    
    Use an indicator (good - cheap arithmetic):
    
    i = (s1[n] == s2[m]);
    a[n,m] += match + i *(mismatch - match);
    
    • Sometimes better on CPU (no brunch prediction), always better on GPUs and SIMD (e.g. SSE).
    • Exponential build-up , no precedence in evaluation.

  • Balancing map reduce
    
     while (has task) {
      map free threads
      while (work_load < max and not finished) {
          perform one iteration
          work_load ++
      }
      reduce finished threads
    }
    
    • Far better than map reduce due to varying task lengths.
    • Ideally max computed dynamically.
    • Reduce is trivial for regexes.
    • Rescheduling is done without quitting the kernel code, all the data is still in shared memory.
    • Memory access is not coalesced, but it is to shared memory/cache, so it doesn't matter.
    • Similar ideas were explored in distributed systems with different problems/solutions.
  News & Publication
30x faster regular expressions on a GPU
Lehavi, David.; Schein, Sagi (2012). GTC 2012: Nvidia GPU Technology Conference, May 14-17, 2012, San Jose, California.
  Conference presentation