Benchmarking Graphics Hardware
Jason Chaw
Navendu Jain
{jchaw, nav}@cs.utexas.edu
Introduction
During
the last decade, the graphics hardware has shown an upcoming trend in
the computer hardware industry with new generations of chips appearing
almost every six months. The newest graphics cards are equipped with a
Graphics Process Unit(GPU), large onboard
memories, and memory bandwidths of the orders of gigabytes. And
recently, the GPU has also become programmable. Hence, there arises a
need to benchmark the performance of emerging graphics systems. This
exercise would not only help us in understanding them better, but also
in designing future state-of-the-art systems by identifying performance
bottlenecks of existing ones.
In
this project, we studied various issues associated with today's
graphics subsystems in terms of performance, design and trade-offs.
The
benchmark programs were tested on a
Intel-based workstation equipped with a GeForce
2 card running the Linux operating system. The specifications are as
follows:
Domain Name
|
sokoban.cs.utexas.edu
|
Graphics Card Vendor
|
NVIDIA Corporation
|
Driver Version
|
1.2.2
|
Visual Parameters
|
RGBA=<8,8,8,0>,
Z=<24>
|
Window geometry
|
800 x 600
|
Screen geometry
|
1600 x 1200
|
To exercise the display pipeline equally, we made sure all the pixels
drawn are actually shown on the screen thoughout the whole benchmark.
Result Highlights:
- The crossover point for rendering between
geometry and rasterization is 5 pixels for smooth shaded and 2 pixels
for textured triangles
- There is increased rasterization processing under textured
conditions, requiring the system to spend significantly more time per
triangle as the edgelength increases
- Rasterizer provides better
performance for small texture sizes (magnification) than large texture
sizes (minification)
- The Rasterizer works in a
horizontal fashion
-
Vertex
cache size is 16 lines and cache replacement policy is FIFO
Part 1 - Geometry versus Rasterization
The first part of the assignment
investigates finding the crossover point between geometry and rasterization limited processing. In order
to exercise the command pipeline equally, we kept the number of
triangles drawn constant for each rendering pass.
Modern computer graphic adapters
provide hardware optimization for certain stages of the graphics
pipeline. Part 1 of this assignment allows us to explore two specific
components. They are
-
Geometry Engine
-
Rasterization Engine
The geometry engine primarily
operates on vertices. The work it performs include coordinate
transformations, color assignment and lighting transformations.
Geometry processing is a per-vertex operation and its performance
largely depend on the number of vertices. We expect geometry processing
performance to decrease for the following reasons:
-
When the number of vertices in a scene increase
-
When the number of lighting sources increase
The rasterization
engine primarily deals with generating the necessary pixels of a
geometrical primitive returned by the geometry engine. The tasks it
perform include color blending, texture mapping, and depth buffering.
Most of its operations are per-pixel processing. The performance of a rasterizer depends on the size of the primitive
to be rasterized. A larger primitive
translates to more pixels and more work for the rasterizer.
We expect rasterization performance to drop
for the following scenarios,
-
When geometrical primitives increase in size
-
When geometrical primitives include textures
From the above hypothesis, we realise
that for small triangles, the rasterization
work per triangle is small, so vertex (geometric) processing dominates
the performance. In the case of large triangles (along-with complex
shading), the fragment operations i.e. rasterization
dominates the rendering pipeline.
Setup of Benchmark
To verify our hypothesis, we
proceeded to design a benchmark. The original gfxbench
program was designed only for smooth shaded benchmarking. We modified
the code to allow for lighting and texture mapping facilities. Texture
mapping was accomplished by providing a pair of texture coordinates to
each vertex instead of a random RGB color value. We used linear
filtering for both magnification and minification. We also used a
simple lighting model, using 1 light source. In addition, the code was
modified to provide triangle rates over a series of triangle edgelengths. In addition we also ran our
benchmark under different conditions. The purpose is to determine if
additional overhead for the geometry engine or rasterization engine
affects the crossover point. The conditions we benchmarked were,
- Smooth shaded
- Smooth shaded and lit
- Textured using 64 x 64 texture map
- Textured and lit using 64 x 64 texture map
- Textured using 1024 x 1024 texture map
- Textured and lit using 1024 x 1024 texture map
In order to determine the crossover
point, our benchmark program uses triangle edge length as the parameter
and computes the triangle rate for different triangle sizes, keeping
the number of triangles constant for all the runs.
The source code for our part 1
benchmark is available.
We provide our results in a series
of graphs.

Figure 1
- Triangle rates for all 4 conditions

Figure 2
- Triangle rates for smooth shaded unlit and lit conditions

Figure 3
- Triangle rates for textured lit and unlit conditions using a 64 x 64
pixel texture
Figure
4 - Triangle rates for textured lit and unlit
conditions using a 1024 x 1024 pixel texture
From the above graphs, we observe the following,
- A marked drop in triangle rate. We
believe this point to be the crossover point between geometry and rasterization processing. We summarise the
crossover point under different conditions in the following table.
Condition
|
Approximate
crossover point (edgelength in pixel)
|
Smooth shaded
|
4.9
|
Smooth shaded and lit
|
5
|
Textured (64 x 64) pixel
|
2
|
Textured and lit (64 x 64) pixel
|
2
|
Textured (1024 x 1024) pixel
|
1.5
|
Textured and lit (1024 x 1024)
pixel
|
1.5
|
- It is interesting to
note that under textured conditions, the crossover point occurs at edgelength 2 as opposed to edgelength
5. Furthermore the corresponding rate of change in triangle rate is more
severe than smooth shaded. This observation verifies our hypothesis that
there is increased rasterization processing
under textured conditions, requiring the system to spend significantly
more time per triangle as the edgelength
increases, thus causing the crossover point to occur earlier.
- Lower triangle rates for small
triangle sizes when lighting is enabled. Following the crossover point,
the triangle rates become similar between lit and unlit conditions. This
confirms our hypothesis that lighting is a per-vertex operation and done
largely by the geometry engine. Geometry calculations are simply taking
longer because of the additional overhead of including a light source.
- We observe a marked increase in
performance degradation between smooth shaded and textured and for
smooth shaded lit and textured lit. Texture mapping lightens the
geometry processing workload since vertices do not have colors assigned.
Thus for small triangles, we observe a marked improvement in triangle
rates. Since texturing is a per-pixel operation, thus for large
triangles, where there are more pixels to fill-in, rasterization
performance drops significantly.
- We observe a higher triangle rate
for small triangles between smooth shaded and textured and for smooth
shaded lit and textured lit.
As part of our
benchmarking process, we also measured the fillrate under different
conditions. The results are presented in the following table.
Condition
|
Fillrate
(MPix/Sec)
|
Smooth shaded
|
104.38 |
Smooth shaded and Lit
|
104.22
|
Textured using 64 x
64 pixel texture
|
106.68
|
Textured
and Lit using 1024 X 1024 pixel texture
|
106.20
|
Textured using 1024 x 1024 pixel
texture
|
88.36
|
Textured and Lit using 1024 x
1024 pixel texture
|
88.65
|
From the above table, we detail our findings below,
- Fillrates do not differ significantly between enabling or
disabling lighting conditions for both smooth shaded and textured. This
further confirms our hypothesis that lighting is a geometry process.
- Rasterizer provides better performance for small texture sizes
than large texture sizes. Our window size was set at 800 x 800 pixels,
so for small textures, 64 x 64, the rasterization process magnifies the
texture map, having a texel mapped to multiple pixels. Conversely for
large textures, 1024 x 1024, the rasterization process minimizes the
texture map, requiring to analyse more texels before mapping it to a
pixel. From our result, we believe minification is significantly slower
than magnification. Thus it translates to lower fill rates when a large
texture map is used.
Part 2A - Rasterization
This part of the assignment requires
us to benchmark the performance of the rasterizer
in rendering different triangle types.
A triangle is defined by 3 vertices, (x,y,z). We define the base of the triangle
by the line segment x:y and
the width of the triangle is the length of x:y. The height of the triangle is
the length of the normal from x:y to vertex z.
There also exists 3 angles, (x,y,z),
(y,z,x), and (z,x,y).
The sum of these angles is pi radians.

We define the ratio of a triangle by
the following expression, ratio =
height / width.
In addition, we generalize 2 types of triangles, tall thin triangles
whose ratio is large, and long thick triangles whose ratio is small.
Using results from part 1, we
understand that rasterization processing dominates for large
geometrical primitives and geometrical processing dominates for
small geometrical primitives. We hypothesize that the order of scanline
conversion will affect the rasterization performance between long thin
triangles and short fat triangles of similar size. For example if the
graphics card utilized vertical scanline conversion, then long thin
triangles, requiring fewer vertical scanlines will have better
performance compared to short fat triangles.
In our experiments, we expect the
following,
- Rasterization performance for different
triangle shapes of similar size depends on the order of scanline
conversion. For example rasterization performance maybe better for
long thick triangles versus tall thin triangles when horizontal scanline
conversion is used. This is due to the comparatively more horizontal
scanlines compared to vertical scanlines used in rasterizing a long
thick triangle
- Rasterization performance is constant
between long thick triangles and tall thin triangles for small triangle
sizes. This is because geometry processing dominates for small triangles
and different scanline conversion methods do not affect as much.
Setup of experiment
Our experiment involves benchmarking
the fillrate when rasterizing a series of triangle sizes having
different shapes. For our purpose, we decide to investigate the
fillrate performance for area sizes between 3 and 100 pixels. All the
triangles in our benchmark are right-angled triangles for ease of
implementation.
For each triangle size, the
benchmark calculates a series of height and width values to provide a
spread of ratios. As such for each triangle size, we have a spread of
different triangle shapes identified by their ratios. For each ratio in
the series, the benchmark program renders a triangle mesh containing a
constant number of triangles and calculates the corresponding fillrate
for the triangle size and shape.
We modified the original gfxbench
source code for our purpose in this experiment. The source code for our
part 2a benchmark is available.
Analysis of results
From the results, we plot 2 graphs,
one each for both small and large triangle area sizes. For each set of
triangle sizes(small or large), we plot fillrate against ratio.

Figure
5
- Fillrate for triangle sizes 50 to 100

Figure
6
- Fillrate for triangle sizes 10 to 15

Figure
7 - Fillrate for
triangle sizes 3 to 9
Comparing the 2 graphs, we were able to observe the
following,
- Rate of change for fillrates
in small triangle graph is smaller than rate of change for fillrates in large triangle graph. This should
correspond to our results in Part 1 where vertex processing time dominate for small triangles and fragment
processing time dominate for large triangles.
- The display of large
triangles are dominated by rasterization
performance as observed in part 1. In part 2, we observe that large,
tall, and thin triangles with large ratios have better fillrate than large, short, and thick triangles.
This confirms our hypothesis that the order for scanline conversion
affects rasterization performance in large triangles having similar
areas but different shapes.
Part 2D – Vertex Engine
Many
geometric primitives are represented by their end vertices in most
computer graphics applications. So in order to provide fast performance,
every graphics pipeline needs to support a cache buffer for storing the
vertices. Typically, these vertex values are stored after the
transformation and lighting operations.
To determine the vertex cache size, we designed a test involving a mesh
of vertex sharing triangles. Additionally, triangles presented to the
hardware are in strip mode so that the vertices
be
accessed most sequentially, revisiting only recently accessed ones to
improve the vertex cache efficiency.
For example instead of accessing vertices in the order <1,2,3;
4,5,6; 7 8 9>, the strip accesses would be <1,2,3; 3,2,4; 4,3,5;
5,4,6>.
We modified the gfxbench
source code and ran the test with different values for the mesh size and
the triangle size. Specifically, for a given triangle edge length (in
pixels), we varied the dimension of a dim x dim grid with dim ranging
from (dim=1, …, 25) with an increment of +1
each time. This was then passed as a function parameter to measure the
triangle rate for a given size.

Figure 8 - Triangle rate versus Number
of values cached (Triangle edge length 1 to 6 pixels)

Figure 9 - Triangle Rate
v/s Number of values cached (Triangle Edge length 7- 10 pixels)
Looking at the graphs, we derive the
inference that cache line size is 16. Initially, there is a performance
boost due to the cache hits in the vertex cache. We can see a
transition point when cache size is 16 as the triangle rate reaches its
peak. Exceeding its size even by one vertex drastically decreases the
performance due to a reduced efficiency of overall caching and later,
stabilizes around that value. Each vertex uses total 8 bytes for
storing the x and y coordinates, which are 4 bytes each. An interesting
thing to note here is the size of each cache line is 4 bytes so x and y
are stored in interleaved fashion. This leads to the fact that atmost 8
vertices represented as (int,int) coordintae pair would be in the cache
at any time.
This
experiment set-up also validates our results for part I. When we varied
the triangle edge length in pixels, the vertex cache only determines the
overall efficiency till 5 pixels i.e. the geometry dominates the
rendering pipeline. After 6 pixels edgelength, rasterization work
becomes dominant and thus the presence of a buffer for caching vertices
doesn't affect overall performance anymore.
The cache replacement
policy is or similar to FIFO. We inferred this from observing that
performance increase when vertices were accessed in sequential order
vis-ą-vis a random way.
Important
Notes:
- Vertex caches work best with indexing
- Vertex cache size is close to 16
- Cache replacement policy is more closer to FIFO
The benefits of a vertex cache are clearly evinced since
there is a enhanced performance upto the vertex cache of 16. This is the
limit since then, we see a dramatic decrease for increasing mesh size.
Note that to exploit the benefits of caching,
the accesses should be made to reuse recently cached vertices.
The source code for our part 2d
benchmark is available.
Part 2E - Graphics Subsystem - A Discussion
We only provide a discussion about the design of the
benchmark for this section:
Modern graphics hardware allows
for placing vertex data in AGP memory or onboard video memory for
increased geometry performance. We proceed to design a benchmark to
utilize NVIDIA's "VertexArrayRange"
extension. Our benchmark is designed to examine the performance of
using AGP ovideo memory for vertex data compared to regular malloc'ed memory.
We expect significant performance
gains in using video or AGP memory compared to regular malloc'ed memory. Our hypothesis is largely due
to the following,
- Onboard
video memory typically consists of very high performance memory chips.
In addition onboard video memory can be directly accessed by the GPU at
very high speeds. This translates to very low latency and high
bandwidths.
- The
AGP bus provides a point-to-point connection to a portion of system
memory. In addition AGP provides pipelined access to memory. Altogether
AGP allows the graphics controller to access system memory at a very
high bandwidth. Although system memory exhibit a higher lantency, but we
expect the utility of clever memory addressing mechanisms such as
pipelined memory access to minimize the effects of higher latency.
- Systems memory typically consists of mid-range memory chips.
These chips may exhibit higher latencies and the systems bus , which is
contested by multiple peripherals, provides limited bandwidth.
Setup for benchmark
Our benchmark attempts to draw a
fixed number of fixed-length triangles using a triangle mesh under
different conditions. These conditions are similar to those in Part 1.
1. Smooth-shaded
2. Smooth-shaded
and lit
3. Textured
4. Textured
and lit
Different conditions affect the
size of each vertex. As such, the memory required to model the triangle
mesh will vary according to conditions. Our benchmark use different
memory sizes to measure differences in memory performance.
The memory requirements for
different conditions are illustrated in the following table.
Condition
|
Vertex Size
|
Vertex Count
|
Memory
Requirement
|
Smooth shaded
|
24
|
160000
|
3840000
|
Smooth shaded and Lit
|
36
|
160000
|
5760000
|
Textured
|
36
|
160000
|
5760000
|
Textured and Lit
|
48
|
160000
|
768000 |
- Above memory requirements assume a triangle mesh of
400 x 400 triangles.
For each condition, we will
measure the triangle fillrate.
Subsequently we will compare the results for the conditions using
different memory systems.
We expect the results of the
benchmark to exhibit performance characteristics of utilizing different
memory systems.