HP Labs Technical Reports
Click here for full text:
Clustered Instruction-Level Parallel Processors
Faraboschi, Paolo; Desoli, Giuseppe; Fisher, Joseph A.
Keyword(s): VLIW; registers; clustering; compilers, EPIC; scheduling
Abstract: CPUs with a large amount of instruction-level parallelism must carry out many register accesses each cycle. Eventually this leads to severe hardware bottlenecks and a loss of cycle time. A solution that has been proposed and implemented a few times is "clustering". Clustered ILP CPUs have several groups of hardware each consisting of a register bank and one or more functional units. Functional units may only access registers in their associated bank. To access registers in other banks, explicit or implicit intercluster moves must be made while a program is running. CPUs offer ILP in a great variety of ways, and thus there are many different ways clustering may be carried out. However it is done, clustering represents a tradeoff between cycle speed and cycle count: it will take more cycles to execute a program on a clustered CPU than a single clustered CPU with the same functional units and total number of registers, but the clustered CPU will have a faster clock. We measure here the cost of clustering on multiple cluster VLIW architectures, particularly using a new algorithm called Partial Component Clustering. Remarkably, the experiments reported here suggest the same ballpark results seen in very different environments. Indeed, the results seem similar across strikingly different architectures, layout algorithms, benchmarks, and degrees of ILP. As a rule of thumb, breaking the CPU into two clusters costs somewhere around 15-20% lost cycles; four clusters costs around 25-30%. As feature sizes of microprocessors decrease in relation to communication costs, these numbers are likely to strongly favor the use of clustering, at least for CPUs which execute applications with large amounts of ILP.
Back to Index