Loop nest transformation has been used successfully to tune dense numerical codes for high performance on single- and multi-core shared-memory systems, but has not been widely applied to cluster computing. We have explored the use of these tools to produce the extremely high degree of memory locality needed to achieve high performance on a cluster with IntelÕs Cluster OpenMP software. Our experiments show high performance across our dedicated homogeneous 56-core/14-node research cluster with gigabit Ethernet. With proper tuning, performance drops by less than a factor of two, and sometimes only a few percent, when the network speed is reduced to 100Mb/sec. These results indicate that properly chosen compile-time optimizations can be used for cluster computing, and illustrate the importance of scalable locality, which may be of interest to programmers developing cluster codes manually.