Parallel Rendering

Parallelization of Rendering Algorithms

Over the last decades, higher CPU performance has been achieved almost exclusively by raising the CPU’s clock rate. Today, the resulting power consumption and heat dissipation threaten to end this trend, so CPU designers are looking for alternative ways of providing more computing power. In particular, they are looking towards three concepts: a streaming compute model, vector-like SIMD units, and multi-core architectures.

Indeed, rendering algorithms need a high computational power to reach interactive rate. Today, interactive rendering is possible only by using simple shading model and fast rendering algorithms.

Our purpose is to exploit the computational strenght offered by nowdays parallel architectures in rendering algorithms.

Parallel Ray Tracing

Ray Tracing algorithms are well-known for their ability to generate high quality images but have also been infamous for their long rendering times. Considerable efforts have been spent in order to investigate new ways to overcome the high computational demands of Ray Tracing. Improving performance to interactive rates requires to combine highly optimized ray tracing implementations with massive amounts of computational power. Thanks to the recent advances in both hardware and software, it is now possible to create high quality images at interactive rates on commodity PC clusters. Cluster of workstation have been demonstrated successful in the context of Parallel Ray Tracing, but they still require some efforts to manage load balancing and communications hiding.

Our Contribution

In [CCDES09,CCDES08a,CCDES08b] we present a load-balancing technique that exploits the temporal coherence, among successive computation phases, in mesh-like computations to be mapped on a cluster of processors. Our method partitions the computation in balanced tasks and distributes them to independent processors through the Prediction Binary Tree (PBT). At each new phase, current PBT is updated by using previous phase computing time (for each task) as (next phase) cost estimate. The PBT is designed so that it balances the load across the tasks as well as reduce dependency among processors for higher performances. Reducing dependency is obtained by using rectangular tiles of the mesh, of almost-square shape (i.e. one dimension is at most twice the other). By reducing dependency, one can reduce inter-processors communication or exploit local dependencies among tasks (such as data locality). Our strategy has been assessed on a significant problem, Parallel Ray Tracing. Our implementation shows a good scalability, and improves over coherence-oblivious implementations. We report different measurements showing that granularity of tasks is a key point for the performances of our decomposition/mapping strategy.

  • [CCDES09] Gennaro Cordasco, Biagio Cosenza, Rosario De Chiara, Ugo Erra, and Vittorio Scarano. "Experiences with Mesh-like computations using Prediction Binary Trees". In Scalable Computing: Practice and Experience, Scientific International Journal for Parallel and Distributed Computing (SCPE), Vol. 10, ISSN: 1895-1767, pages 173-187, June 2009.
  • [CCDES08a] Gennaro Cordasco, Biagio Cosenza, Rosario De Chiara, Ugo Erra, and Vittorio Scarano. "On Estimating the Effectiveness of Temporal and Spatial Coherence in Parallel Ray Tracing". In Proc. of 6th Eurographics Italian Chapter Conference (EG_It 2008). July 2-4, Salerno, Italy.
  • [CCDES08b] Gennaro Cordasco, Biagio Cosenza, Rosario De Chiara, Ugo Erra, and Vittorio Scarano. "Load Balancing in Mesh-like Computations using Prediction Binary Trees" with Biagio Cosenza, Rosario De Chiara, Ugo Erra, and Vittorio Scarano. In Proc. of 7th International Symposium on Parallel and Distributed Computing (ISPDC 2008). July 1-5, Krakow, Poland.


Real-time computer graphics hardware is undergoing a major transition, from a few fixed algorithms to being fully programmable. Performance of graphics processors (GPUs) is increasing at a rapid rate -even greater than CPUs- because GPUs can effectively exploit the parallelism available in graphics computations.

These improvements in GPU flexibility and performance are likely to continue in the future, and will allow developers to write increasingly sophisticated and diverse programs that execute on the GPU.


With the increasing programmability of GPUs, these chips are capable of performing more than the specific graphics computations for which they were designed. They are now capable coprocessors, and their high speed makes them useful for a variety of applications. GPGPU stands for general purpose computation on GPU. Indeed, by the addition of programmable stages and higher precision arithmetic it is possible to use the stream processing of the rendering pipelines to solve non-graphics problem.

Our Contribution