A PIM (Processor-In-Memory) for Computer Graphics : Data Partitioning and Placement Schemes
The demand for higher performance graphics
continues to grow because of the incessant desire towards realism.
And, rapid advances in fabrication technology have enabled us to
build several processor cores on a single die. Hence, it is important to
develop single chip parallel architectures for such data-intensive
applications. In this paper, we propose an efficient PIM architectures
tailored for computer graphics which requires a large number of
memory accesses. We then address the two important tasks necessary
for maximally exploiting the parallelism provided by the architecture,
namely, partitioning and placement of graphic data, which affect
respectively load balances and communication costs. Under the
constraints of uniform partitioning, we develop approaches for optimal
partitioning and placement, which significantly reduce search space.
We also present heuristics for identifying near-optimal placement,
since the search space for placement is impractically large despite our
optimization. We then demonstrate the effectiveness of our partitioning
and placement approaches via analysis of example scenes; simulation
results show considerable search space reductions, and our heuristics
for placement performs close to optimal – the average ratio of
communication overheads between our heuristics and the optimal was
1.05. Our uniform partitioning showed average load-balance ratio of
1.47 for geometry processing and 1.44 for rasterization, which is
reasonable.
[1] International Technology Roadmap for Semiconductors , www.itrs.net/
[2] Keith Diefendorff, et al., How Multimedia Workloads Will Change
Processor Design, IEEE Computer, p.43-45, 1997.
[3] D. Burger, et al., Memory Bandwidth Limitations of Future
Microprocessors, In Proceedings of the 23rd Inter-national Symposium
on Computer Architecture, p.78-89, 1996.
[4] Patterson D, et al., A Case for Intelligent DRAM: IRAM, IEEE Micro,
1997.
[5] Mark Oskin, et al., Active Pages: A Computation Model for Intelligent
Memory, In Proceedings of the 23rd. Inter-national Symposium on.
Computer Architecture, p.192-203, 1998.
[6] Yi Kang, et al., FlexRAM: Toward an Advanced Intelligent Memory
System, In proceedings of 1999 IEEE International Conference on
Computer Design, p.192, 1999.
[7] Jung-Yup Kang, et al., An Efficient PIM (Processor-In-Memory)
Architecture for Motion Estimation. In proceedings of the 14th IEEE
International Conference on Application-Specific Systems, Architectures,
and Processors, p.282-292, 2003.
[8] Jung-Yup Kang, et al., Accelerating the Kernels of BLAST with an
Efficient PIM (Processor-In-Memory) Architecture, In proceedings of the
3rd International IEEE Computer Society Computational Systems
Bioinformatics Conference, p.552-553, 2004.
[9] John Montrym, et al., The GeForce 6800, IEEE Micro, p.41-51, 2005.
[10] Emmett Kilgariff, et al., The GeForce 6 Series GPU Architecture,
download.nvidia.com/ developer/GPU_Gems_2/GPU_Gems2_ch30.pdf
[11] Molner, et. al., A sorting classification of parallel rendering, Computer
Graphics and Application, IEEE, p.23-32, 1994.
[12] S. Whitman, Dynamic load balancing for parallel polygon rendering,
IEEE Computer Graphics and Applications, p.41-48, 1994.
[13] S. Whitman, Parallel Graphics Rendering Algorithms, In Proceedings of
3rd Eurographics Workshop on Rendering, Consolidation Express,
Bristol, UK, p.123-134, 1992.
[14] Tahsin M. Kurc, et al., Object-Space Parallel Polygon Rendering on
Hypercubes, Compu-ters & Graphics , p.487-503, 1998.
[15] B. Wei, et al., Performance Issues of a Distributed Frame Buffer on a
Multicomputer. In Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS workshop on Graphics Hardware, p.87
-96, 1998.
[16] Vineet Kumar. A Host Interface Architecture for HIPPI. In Proceedings
of Scalable High Performance Computing Conference, p.142-149, 1994.
[17] Jae C. Cha, et al., Technical Report CENG-2007-6.
[18] Akeley, Kurt. RealityEngine Graphics. In Proceedings of
SIGGRAPH -93, New York, p.109-116, 1993.
[19] Thomas W. Crockett, et al., Rendering Algorithm for MIMD
Architectures, In Proceedings of the 1993 Parallel Rendering Symposium,
p.35-42,1993.
[20] Deering, et al., A System for Cost Effective 3D Shaded Graphics. In
Proceedings of SIGGRAPH -93, p.101-108, 1993.
[21] Ellsworth, et al.,. A New Algorithm for Interactive Graphics on
Multicomputers. IEEE Computer Graphics & Applications, p.33-40,
1994.
[22] Fuchs, Henry, et al., Pixel-Planes 5: A Heterogeneous Multiprocessor
Graphics System Using Processor-Enhanced Memories. In Proceedings
of SIGGRAPH -89, p.79-88, 1993.
[23] J. D. Foley, et al., Computer Graphics, Principles and Practice. Addison-
Wesley, 2nd edition, 1996.
[24] Francis S Hill Jr., et al., Computer Graphics Using OpenGL, Prentice Hall,
3rd edition, 2006.
[25] Tomas Akenine-Moller, et al., Real-Time Rendering, 2nd edition, A.K.
Peters Ltd, 2002.
[26] Thomas W. Crockett, An Introduction to Parallel Rendering, Parallel
Computing, p.819-843, 1997.
[27] D.R. Roble, A Load Balanced Parallel Scanline Z-Buffer Algorithm for
the iPSC Hypercube, In Proceedings of the 1st International Conference
PIXIM 88, p.177-192, 1998.
[28] D.S. Whelan, Animac: A Multiprocessor Architecture for Real time
Computer Animation, Ph.D. dissertation, California Institute of
Technology, Pasadena, CA, 1985.
[29] Carl Mueller, Hierarchical Graphics Databases in Sort-First, In
Proceedings of the IEEE Symposium on Parallel Rendering, p.49-57,
1997.
[30] David Ellsworth, A Multicomputer Polygon Rendering Algorithm for
Interactive Applications, In Proceedings of the 1993 Parallel Rendering
Symposium, p.43-48, 1993.
[31] Carl Mueller, The sort-first rendering architecture for high-performance
graphics, In Proceedings of the 1995 symposium on Interactive 3D
graphics, p.75-ff., Monterey, 1995.
[32] The Cg Tutorial: The Definitive Guide to Programmable Real-Time
Graphics, NVDIA, http://developer.nvidia.com/CgTutorial.
[33] Dirk Bartz, Rendering and Visualization in Parallel Environments, In
SIGGRAPH 2000 Course.
[34] Frederico Abraham et al., A Load-Balancing Strategy for Sort-First
Distributed Rendering, In Proceedings of SIGGRAPH -04, p.292-299,
2004.
[35] Wulf, Wm.A and McKee, S.A. Hitting the Memory Wall: Implications of
the Obvious. ACM Computer Architecture News. Vol.23, No.1, 1995.
[36] http://www.nvidia.com/page/8800_tech_specs.html
[37] http://www.xbox.com/en-AU/support/xbox360/manuals/xbox360specs.h
tm
[38] http://techreport.com/articles.x/10039/1
[1] International Technology Roadmap for Semiconductors , www.itrs.net/
[2] Keith Diefendorff, et al., How Multimedia Workloads Will Change
Processor Design, IEEE Computer, p.43-45, 1997.
[3] D. Burger, et al., Memory Bandwidth Limitations of Future
Microprocessors, In Proceedings of the 23rd Inter-national Symposium
on Computer Architecture, p.78-89, 1996.
[4] Patterson D, et al., A Case for Intelligent DRAM: IRAM, IEEE Micro,
1997.
[5] Mark Oskin, et al., Active Pages: A Computation Model for Intelligent
Memory, In Proceedings of the 23rd. Inter-national Symposium on.
Computer Architecture, p.192-203, 1998.
[6] Yi Kang, et al., FlexRAM: Toward an Advanced Intelligent Memory
System, In proceedings of 1999 IEEE International Conference on
Computer Design, p.192, 1999.
[7] Jung-Yup Kang, et al., An Efficient PIM (Processor-In-Memory)
Architecture for Motion Estimation. In proceedings of the 14th IEEE
International Conference on Application-Specific Systems, Architectures,
and Processors, p.282-292, 2003.
[8] Jung-Yup Kang, et al., Accelerating the Kernels of BLAST with an
Efficient PIM (Processor-In-Memory) Architecture, In proceedings of the
3rd International IEEE Computer Society Computational Systems
Bioinformatics Conference, p.552-553, 2004.
[9] John Montrym, et al., The GeForce 6800, IEEE Micro, p.41-51, 2005.
[10] Emmett Kilgariff, et al., The GeForce 6 Series GPU Architecture,
download.nvidia.com/ developer/GPU_Gems_2/GPU_Gems2_ch30.pdf
[11] Molner, et. al., A sorting classification of parallel rendering, Computer
Graphics and Application, IEEE, p.23-32, 1994.
[12] S. Whitman, Dynamic load balancing for parallel polygon rendering,
IEEE Computer Graphics and Applications, p.41-48, 1994.
[13] S. Whitman, Parallel Graphics Rendering Algorithms, In Proceedings of
3rd Eurographics Workshop on Rendering, Consolidation Express,
Bristol, UK, p.123-134, 1992.
[14] Tahsin M. Kurc, et al., Object-Space Parallel Polygon Rendering on
Hypercubes, Compu-ters & Graphics , p.487-503, 1998.
[15] B. Wei, et al., Performance Issues of a Distributed Frame Buffer on a
Multicomputer. In Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS workshop on Graphics Hardware, p.87
-96, 1998.
[16] Vineet Kumar. A Host Interface Architecture for HIPPI. In Proceedings
of Scalable High Performance Computing Conference, p.142-149, 1994.
[17] Jae C. Cha, et al., Technical Report CENG-2007-6.
[18] Akeley, Kurt. RealityEngine Graphics. In Proceedings of
SIGGRAPH -93, New York, p.109-116, 1993.
[19] Thomas W. Crockett, et al., Rendering Algorithm for MIMD
Architectures, In Proceedings of the 1993 Parallel Rendering Symposium,
p.35-42,1993.
[20] Deering, et al., A System for Cost Effective 3D Shaded Graphics. In
Proceedings of SIGGRAPH -93, p.101-108, 1993.
[21] Ellsworth, et al.,. A New Algorithm for Interactive Graphics on
Multicomputers. IEEE Computer Graphics & Applications, p.33-40,
1994.
[22] Fuchs, Henry, et al., Pixel-Planes 5: A Heterogeneous Multiprocessor
Graphics System Using Processor-Enhanced Memories. In Proceedings
of SIGGRAPH -89, p.79-88, 1993.
[23] J. D. Foley, et al., Computer Graphics, Principles and Practice. Addison-
Wesley, 2nd edition, 1996.
[24] Francis S Hill Jr., et al., Computer Graphics Using OpenGL, Prentice Hall,
3rd edition, 2006.
[25] Tomas Akenine-Moller, et al., Real-Time Rendering, 2nd edition, A.K.
Peters Ltd, 2002.
[26] Thomas W. Crockett, An Introduction to Parallel Rendering, Parallel
Computing, p.819-843, 1997.
[27] D.R. Roble, A Load Balanced Parallel Scanline Z-Buffer Algorithm for
the iPSC Hypercube, In Proceedings of the 1st International Conference
PIXIM 88, p.177-192, 1998.
[28] D.S. Whelan, Animac: A Multiprocessor Architecture for Real time
Computer Animation, Ph.D. dissertation, California Institute of
Technology, Pasadena, CA, 1985.
[29] Carl Mueller, Hierarchical Graphics Databases in Sort-First, In
Proceedings of the IEEE Symposium on Parallel Rendering, p.49-57,
1997.
[30] David Ellsworth, A Multicomputer Polygon Rendering Algorithm for
Interactive Applications, In Proceedings of the 1993 Parallel Rendering
Symposium, p.43-48, 1993.
[31] Carl Mueller, The sort-first rendering architecture for high-performance
graphics, In Proceedings of the 1995 symposium on Interactive 3D
graphics, p.75-ff., Monterey, 1995.
[32] The Cg Tutorial: The Definitive Guide to Programmable Real-Time
Graphics, NVDIA, http://developer.nvidia.com/CgTutorial.
[33] Dirk Bartz, Rendering and Visualization in Parallel Environments, In
SIGGRAPH 2000 Course.
[34] Frederico Abraham et al., A Load-Balancing Strategy for Sort-First
Distributed Rendering, In Proceedings of SIGGRAPH -04, p.292-299,
2004.
[35] Wulf, Wm.A and McKee, S.A. Hitting the Memory Wall: Implications of
the Obvious. ACM Computer Architecture News. Vol.23, No.1, 1995.
[36] http://www.nvidia.com/page/8800_tech_specs.html
[37] http://www.xbox.com/en-AU/support/xbox360/manuals/xbox360specs.h
tm
[38] http://techreport.com/articles.x/10039/1
@article{"International Journal of Information, Control and Computer Sciences:55497", author = "Jae Chul Cha and Sandeep K. Gupta", title = "A PIM (Processor-In-Memory) for Computer Graphics : Data Partitioning and Placement Schemes", abstract = "The demand for higher performance graphics
continues to grow because of the incessant desire towards realism.
And, rapid advances in fabrication technology have enabled us to
build several processor cores on a single die. Hence, it is important to
develop single chip parallel architectures for such data-intensive
applications. In this paper, we propose an efficient PIM architectures
tailored for computer graphics which requires a large number of
memory accesses. We then address the two important tasks necessary
for maximally exploiting the parallelism provided by the architecture,
namely, partitioning and placement of graphic data, which affect
respectively load balances and communication costs. Under the
constraints of uniform partitioning, we develop approaches for optimal
partitioning and placement, which significantly reduce search space.
We also present heuristics for identifying near-optimal placement,
since the search space for placement is impractically large despite our
optimization. We then demonstrate the effectiveness of our partitioning
and placement approaches via analysis of example scenes; simulation
results show considerable search space reductions, and our heuristics
for placement performs close to optimal – the average ratio of
communication overheads between our heuristics and the optimal was
1.05. Our uniform partitioning showed average load-balance ratio of
1.47 for geometry processing and 1.44 for rasterization, which is
reasonable.", keywords = "Data Partitioning and Placement, Graphics, PIM,Search Space Reduction.", volume = "2", number = "2", pages = "433-9", }