Application-level Flow Scheduling for Efficient Collective Data Transfers
Kumar, Vijay S.; Tucek, Joseph; Wylie, Jay J.; Krevat, Elie; Ganger, Gregory R.
Abstract: Collective data transfers among sets of processes over high-bandwidth, low-latency data center networks are an integral part of Big Data computations (e.g., the data shuffle in MapReduce). In this paper, we use a carefully architected microbenchmark that emulates a data shuffle, to gather network traces and perform detailed analysis. The key result of our analysis is that having more than two competing bi-directional flows per node in the transfer reduces throughput by 10%. What this means is that, even at very low cardinality (3- or 4-node shuffle), only 90%of the possible throughput can be achieved when commodity Ethernet-based switches are employed. TCP contention among multiple flows is the reason for the throughput loss experienced by collective data transfers. Though we identify system parameter configurations that minimize such packet losses, we believe application- layer flow management is necessary to circumvent this network-level problem. Towards this end, we designed and implemented a technique, Max2Flows, that generates and orchestrates a schedule of coordinated data exchange stages. Each stage limits the number of competing flows per node to two or fewer, thus avoiding negative network-level effects. Experimental results show that, when incorporated into our microbenchmark, Max2Flows can operate at ~99% of the peak throughput on a 1 Gigabit Ethernet network for small shuffles.
External Posting Date: June 06, 2012 [Fulltext]. Approved for External Publication
Internal Posting Date: June 06, 2012 [Fulltext]