Benchmarking Graphics Hardware

Jason Chaw

Navendu Jain

{jchaw, nav}@cs.utexas.edu

 

Introduction

During the last decade, the graphics hardware has shown an upcoming trend in the computer hardware industry with new generations of chips appearing almost every six months. The newest graphics cards are equipped with a Graphics Process Unit(GPU), large onboard memories, and memory bandwidths of the orders of gigabytes. And recently, the GPU has also become programmable. Hence, there arises a need to benchmark the performance of emerging graphics systems. This exercise would not only help us in understanding them better, but also in designing future state-of-the-art systems by identifying performance bottlenecks of existing ones.

In this project, we studied various issues associated with today's graphics subsystems in terms of performance, design and trade-offs.

The benchmark programs were tested on a Intel-based workstation equipped with a GeForce 2 card running the Linux operating system. The specifications are as follows:

Domain Name
sokoban.cs.utexas.edu
Graphics Card Vendor
NVIDIA Corporation
Driver Version
1.2.2
Visual Parameters
RGBA=<8,8,8,0>, Z=<24>
Window geometry
800 x 600
Screen geometry
1600 x 1200


To exercise the display pipeline equally, we made sure all the pixels drawn are actually shown on the screen thoughout the whole benchmark.

Result Highlights:

Part 1 - Geometry versus Rasterization

The first part of the assignment investigates finding the crossover point between geometry and rasterization limited processing.  In order to exercise the command pipeline equally, we kept the number of triangles drawn constant for each rendering pass.

Modern computer graphic adapters provide hardware optimization for certain stages of the graphics pipeline. Part 1 of this assignment allows us to explore two specific components. They are

  1. Geometry Engine

  2. Rasterization Engine

The geometry engine primarily operates on vertices. The work it performs include coordinate transformations, color assignment and lighting transformations. Geometry processing is a per-vertex operation and its performance largely depend on the number of vertices. We expect geometry processing performance to decrease for the following reasons:

  1. When the number of vertices in a scene increase

  2. When the number of lighting sources increase

The rasterization engine primarily deals with generating the necessary pixels of a geometrical primitive returned by the geometry engine. The tasks it perform include color blending, texture mapping, and depth buffering. Most of its operations are per-pixel processing. The performance of a rasterizer depends on the size of the primitive to be rasterized. A larger primitive translates to more pixels and more work for the rasterizer. We expect rasterization performance to drop for the following scenarios,

  1. When geometrical primitives increase in size

  2. When geometrical primitives include textures

From the above hypothesis, we realise that for small triangles, the rasterization work per triangle is small, so vertex (geometric) processing dominates the performance. In the case of large triangles (along-with complex shading), the fragment operations i.e. rasterization dominates the rendering pipeline.

Setup of Benchmark

To verify our hypothesis, we proceeded to design a benchmark. The original gfxbench program was designed only for smooth shaded benchmarking. We modified the code to allow for lighting and texture mapping facilities. Texture mapping was accomplished by providing a pair of texture coordinates to each vertex instead of a random RGB color value. We used linear filtering for both magnification and minification. We also used a simple lighting model, using 1 light source. In addition, the code was modified to provide triangle rates over a series of triangle edgelengths. In addition we also ran our benchmark under different conditions. The purpose is to determine if additional overhead for the geometry engine or rasterization engine affects the crossover point. The conditions we benchmarked were,

In order to determine the crossover point, our benchmark program uses triangle edge length as the parameter and computes the triangle rate for different triangle sizes, keeping the number of triangles constant for all the runs.

The source code for our part 1 benchmark is available.

We provide our results in a series of graphs.

 

Figure 1 - Triangle rates for all 4 conditions

 

Figure 2 - Triangle rates for smooth shaded unlit and lit conditions

 

Figure 3 - Triangle rates for textured lit and unlit conditions using a 64 x 64 pixel texture


Figure 4 - Triangle rates for textured lit and unlit conditions using a 1024 x 1024 pixel texture


From the above graphs, we observe the following,

Condition
Approximate crossover point (edgelength in pixel)
Smooth shaded
4.9
Smooth shaded and lit
5
Textured (64 x 64) pixel
2
Textured and lit (64 x 64) pixel
2
Textured (1024 x 1024) pixel
1.5
Textured and lit (1024 x 1024) pixel
1.5
As part of our benchmarking process, we also measured the fillrate under different conditions. The results are presented in the following table.

Condition
Fillrate (MPix/Sec)
Smooth shaded
104.38
Smooth shaded and Lit
104.22
Textured using 64 x 64 pixel texture
106.68
Textured and Lit  using 1024 X 1024 pixel texture
106.20
Textured using 1024 x 1024 pixel texture
88.36
Textured and Lit using 1024 x 1024 pixel texture
88.65

From the above table, we detail our findings below,

Part 2A - Rasterization

This part of the assignment requires us to benchmark the performance of the rasterizer in rendering different triangle types.

A triangle is defined by 3 vertices, (x,y,z). We define the base of the triangle by the line segment x:y
and the width of the triangle is the length of x:y. The height of the triangle is the length of the normal from x:y to vertex z. There also exists 3 angles, (x,y,z), (y,z,x), and (z,x,y). The sum of these angles is pi radians.


We define the ratio of a triangle by the following expression, ratio = height / width.


In addition, we generalize 2 types of triangles, tall thin triangles whose ratio is large, and long thick triangles whose ratio is small.

Using results from part 1, we understand that rasterization processing dominates for large geometrical primitives and  geometrical processing dominates for small geometrical primitives. We hypothesize that the order of scanline conversion will affect the rasterization performance between long thin triangles and short fat triangles of similar size. For example if the graphics card utilized vertical scanline conversion, then long thin triangles, requiring fewer vertical scanlines will have better performance compared to short fat triangles.

In our experiments, we expect the following,

Setup of experiment

Our experiment involves benchmarking the fillrate when rasterizing a series of triangle sizes having different shapes. For our purpose, we decide to investigate the fillrate performance for area sizes between 3 and 100 pixels. All the triangles in our benchmark are right-angled triangles for ease of implementation.

For each triangle size, the benchmark calculates a series of height and width values to provide a spread of ratios. As such for each triangle size, we have a spread of different triangle shapes identified by their ratios. For each ratio in the series, the benchmark program renders a triangle mesh containing a constant number of triangles and calculates the corresponding fillrate for the triangle size and shape.

We modified the original gfxbench source code for our purpose in this experiment. The source code for our part 2a benchmark is available.

Analysis of results

From the results, we plot 2 graphs, one each for both small and large triangle area sizes. For each set of triangle sizes(small or large), we plot fillrate against ratio.



Figure 5 - Fillrate for triangle sizes 50 to 100

Figure 6 - Fillrate for triangle sizes 10 to 15

Figure 7 - Fillrate for triangle sizes 3 to 9


Comparing the 2 graphs, we were able to observe the following,

Part 2D – Vertex Engine

Many geometric primitives are represented by their end vertices in most computer graphics applications. So in order to provide fast performance, every graphics pipeline needs to support a cache buffer for storing the vertices. Typically, these vertex values are stored after the transformation and lighting operations.

To determine the vertex cache size, we designed a test involving a mesh of vertex sharing triangles. Additionally, triangles presented to the hardware are in strip mode so that the vertices be accessed most sequentially, revisiting only recently accessed ones to improve the vertex cache efficiency.

For example instead of accessing vertices in the order <1,2,3; 4,5,6; 7 8 9>, the strip accesses would be <1,2,3; 3,2,4; 4,3,5; 5,4,6>.

We modified the gfxbench source code and ran the test with different values for the mesh size and the triangle size. Specifically, for a given triangle edge length (in pixels), we varied the dimension of a dim x dim grid with dim ranging from (dim=1, …, 25) with an increment of +1 each time. This was then passed as a function parameter to measure the triangle rate for a given size.


Figure 8 - Triangle rate versus Number of values cached (Triangle edge length 1 to 6 pixels)


Figure 9 - Triangle Rate v/s Number of values cached (Triangle Edge length 7- 10 pixels)

Looking at the graphs, we derive the inference that cache line size is 16. Initially, there is a performance boost due to the cache hits in the vertex cache.  We can see a transition point when cache size is 16 as the triangle rate reaches its peak. Exceeding its size even by one vertex drastically decreases the performance due to a reduced efficiency of overall caching and later, stabilizes around that value. Each vertex uses total 8 bytes  for storing the x and y coordinates, which are 4 bytes each. An interesting thing to note here is the size of each cache line is 4 bytes so x and y are stored in interleaved fashion. This leads to the fact that atmost 8 vertices represented as (int,int) coordintae pair would be in the cache at any time.

This experiment set-up also validates our results for part I. When we varied the triangle edge length in pixels, the vertex cache only determines the overall efficiency till 5 pixels i.e. the geometry dominates the rendering pipeline. After 6 pixels edgelength, rasterization work becomes dominant and thus the presence of a buffer for caching vertices doesn't affect overall performance anymore.

The cache replacement policy is or similar to FIFO. We inferred this from observing that performance increase when vertices were accessed in sequential order vis-ą-vis a random way.

Important Notes:

  • Vertex caches work best with indexing
  • Vertex cache size is close to 16
  • Cache replacement policy is more closer to FIFO
The benefits of a vertex cache are clearly evinced since there is a enhanced performance upto the vertex cache of 16. This is the limit since then, we see a dramatic decrease for increasing mesh size. Note that to exploit the benefits of caching, the accesses should be made to reuse recently cached vertices.

The source code for our part 2d benchmark is available.

Part 2E - Graphics Subsystem - A Discussion

We only provide a discussion about the design of the benchmark for this section:

Modern graphics hardware allows for placing vertex data in AGP memory or onboard video memory for increased geometry performance. We proceed to design a benchmark to utilize NVIDIA's "VertexArrayRange" extension. Our benchmark is designed to examine the performance of using AGP ovideo memory for vertex data compared to regular malloc'ed memory.

We expect significant performance gains in using video or AGP memory compared to regular malloc'ed memory. Our hypothesis is largely due to the following, 

  1. Onboard video memory typically consists of very high performance memory chips. In addition onboard video memory can be directly accessed by the GPU at very high speeds. This translates to very low latency and high bandwidths.
  2. The AGP bus provides a point-to-point connection to a portion of system memory. In addition AGP provides pipelined access to memory. Altogether AGP allows the graphics controller to access system memory at a very high bandwidth. Although system memory exhibit a higher lantency, but we expect the utility of clever memory addressing mechanisms such as pipelined memory access to minimize the effects of higher latency.
  3. Systems memory typically consists of mid-range memory chips. These chips may exhibit higher latencies and the systems bus , which is contested by multiple peripherals, provides limited bandwidth.

Setup for benchmark

Our benchmark attempts to draw a fixed number of fixed-length triangles using a triangle mesh under different conditions. These conditions are similar to those in Part 1.

1.     Smooth-shaded

2.     Smooth-shaded and lit

3.     Textured

4.     Textured and lit

Different conditions affect the size of each vertex. As such, the memory required to model the triangle mesh will vary according to conditions. Our benchmark use different memory sizes to measure differences in memory performance.

The memory requirements for different conditions are illustrated in the following table.

Condition
Vertex Size
Vertex Count
Memory Requirement
Smooth shaded
24
160000
3840000
Smooth shaded and Lit
36
160000
5760000
Textured
36
160000
5760000
Textured and Lit
48
160000
768000

For each condition, we will measure the triangle fillrate. Subsequently we will compare the results for the conditions using different memory systems.

We expect the results of the benchmark to exhibit performance characteristics of utilizing different memory systems.