Personal tools
You are here: Home Publications Improving Memory Hierarchy Performance Through Combined Loop Interchange and Multi-Level Fusion
Document Actions

Qing Yi and Ken Kennedy (2002)

Improving Memory Hierarchy Performance Through Combined Loop Interchange and Multi-Level Fusion

In: Proceedings of the LACSI Symposium, Los Alamos, NM, Los Alamos Computer Institute.

Because of the increasing gap between the speeds of processors and main memories, it is critical that compilergenerated code makes e ective use of the cache memory hierarchy on modern processors. To do this, compilers must ensure that data elements are reused as often as possible once they have been moved into cache. Although loop nest blocking is the most widely studied method for increasing data reuse, loop interchange and loop fusion can be e ective tools as well. Loop fusion enhances locality by merging loops that access similar sets of data. After fusion, the uses of the data are brought closer together, thereby ensuring more reuse in the cache. Traditionally, loop fusion is applied to a collection of loops at the same level. It is often applied after loop interchange for each loop nest so that the best loop ordering for each nest is rst attained. Although loop interchange can nd the best ordering locally for each nest, it often selects the wrong loop to be placed at the outermost level for fusion, achieving sub-optimal performance globally. Building on traditional unimodular transformations for perfectly nested loops, this paper presents a novel transformation, dependence hoisting, that facilitates the combined interchange and fusion for a set of arbitrarily nested loops. We show that this transformation is not only useful for interchanging loops in complex loop structures, it can also aggressively fuse loops at di erent levels without resorting to loop interchange rst. We present techniques to e ectively combine loop interchange and multi-level fusion optimizations for better locality. By evaluating the compound e ect of both transformations beforehand, we are able to achieve better performance than that was possible by previous techniques, which apply interchange and fusion separately.

Also appeared in the International Journal of High Performance Computing Applications, Vol. 18, No. 2, 237-253 (2004) DOI: 10.1177/1094342004038956
by admin last modified 2007-12-10 21:06
« September 2010 »
Su Mo Tu We Th Fr Sa
1234
567891011
12131415161718
19202122232425
2627282930
 

Powered by Plone

LACSI Collaborators include:

Rice University LANL UH UNM UIUC UNC UTK