Or, it was about time :)

We ended part 1 on quite a cliffhanger, just after the NV20 nVidia chip introduced vertex programmability in the graphics pipeline; to give a bit of history context, this happened around March 2001, while ATI's then offering, R100, had no support for this.

To better understand what happened next, we'll go a bit in detail about how the introduction of vertex programmability changed the existing fixed function pipeline.

First steps towards a programmable pipeline

As seen in part 1, vertex stage programmability allowed full control on how the vertices were processed, and this also meant taking care of T&L in the vertex program. So, going back to our pipeline diagram, enabling vertex programs resulted in replacing the hard-wired "Transform and Lighting" stage with a custom "Vertex program" stage:

Even if vertex programs effectively were a superset of the T&L fixed functionality, the NV20 still hosted the tried and true fixed pipeline hardware next to the new vertex programmable units, instead of going the route of emulating T&L with vertex programs.

While this might seem like a waste of silicon area, keep in mind that the hardwired logic of the fixed pipeline had been hand-tuned for performance during the last 3 or so years, and programmable units were not as fast as fixed function ones when performing the very same task. It's a known optimization tradeoff: you give away raw performance to get customizability, and considering that all existing applications at that time were fixed function based, hardware vendors made sure to keep them working great, while at the same time paving the road for next-gen applications.

Sidenote: while keeping the fixed functionality hardware made sense for the desktop world of that time, a much different balance was adopted by nVidia with the NV2A, a custom GPU made for the X-Box game console, which did not host the fixed function hardware and used the spare silicon space to host more vertex units. Of course, such a decision is only affordable when you don't have legacy applications to start with :)

Now let's look again at the pipeline diagram. The fragment stage, which comes next in the pipeline, is much more complex than the vertex stage, and also much more performance critical: when you draw a triangle on screen, the vertex logic is invoked a grand total of three times, but the fragment logic is invoked for each and every pixel in the triangle. A stall in the fragment pipeline meant bad news from the framerate, much more than one in the vertex pipeline. This is why programmability was introduced in the vertex stage first: vendors wanted to test their architecture on the less performance-critical area first, and iron out wrinkles before moving to the fragment stage.

Despite the precautions and experience gained with the vertex units, the first GPU with support for fragment programmability was a commercial failure for both main vendors at that time (ATi's R200 in October 2001, nVidia's NV30 in 2003), while APIs entered a dark era of vendor-specific extensions and false standards.

History context (or: fragment programming in OpenGL, first rough drafts)

Limited forms of fragment programmability had been present in graphic adapters and available to OpenGL applications since 1998 with nVidia NV4, commercially branded as Riva TNT (short for TwiN Texel), which had the ability to fetch texels from two textures at once and feed them to hardware combiners, which implemented a small set of operations on texels.

The interested can consult documentation of GL_NV_texture_env_combine4 extension.

Texture shaders and register combiners

The NV20 expanded on the hardware combiners concept quite a bit. Input for the fragment stage came from 4 texture units, collectively called "texture shaders", and went to be processed by 8 configurable stages, collectively called "register combiners". When enabled, they effectively replaced the "Texture Environment", "Colour Sum" and "Fog" stages in our pipeline diagram.

In the end you had an extensive set of pre-baked texture and color operations that you could mix and match via the OpenGL API. They still didn't implement conditional logic though, and accessing them through OpenGL was, um, difficult. A typical code snippet configuring texture shaders and register combiners looked like this:

// Texture shader stage 0: bump normal map, dot product
glTexEnvi(GL_TEXTURE_SHADER_NV, GL_SHADER_OPERATION_NV,
          GL_DOT_PRODUCT_NV);
glTexEnvi(GL_TEXTURE_SHADER_NV, GL_PREVIOUS_TEXTURE_INPUT_NV,
          GL_TEXTURE0_ARB);

// Register combiner stage 0: input texture0, output passthrough
glCombinerInputNV(GL_COMBINER0_NV, GL_RGB, GL_VARIABLE_A_NV,
                  GL_TEXTURE0_ARB, GL_UNSIGNED_IDENTITY_NV, GL_RGB);
glCombinerOutputNV(GL_COMBINER0_NV, GL_RGB, GL_SPARE0_NV,
                   GL_DISCARD_NV, GL_DISCARD_NV, GL_NONE, GL_NONE,
                   GL_FALSE, GL_FALSE, GL_FALSE);

The relevant OpenGL extensions were GL_NV_texture_shader and GL_NV_register_combiners. In Direct3D parlance, supporting those extensions would roughly translate to supporting "Pixel Shader 1.0/1.3".

First programmable fragments

ATI's introduction of R200 GPU (commercially branded as Radeon 8500) in October 2001 marked the introduction of the first truly programmable fragment stage. You had texture fetch, opcodes to manipulate colors and conditional logic. This is where the API mayhem started.

ATI released its proprietary OpenGL extension to access those features at the same time the Radeon 8500 was released. If using nVidia's register combiners and texture shader extensions was difficult, using ATI's GL_ATI_fragment_shader extension was painful.

You had to input each instruction of the shader in between Begin/End blocks, in the form of one or more function calls to a single OpenGL C API function. This was so awkward that in the end lots of fragment programs were written with the intended instruction as a comment and the barely-readable GL command below, e.g.:

// reg2 = |light to vertex|^2
glColorFragmentOp2ATI(GL_DOT3_ATI, GL_REG_2_ATI, GL_NONE, GL_NONE,
                      GL_REG_2_ATI, GL_NONE, GL_NONE,
                      GL_REG_2_ATI, GL_NONE, GL_NONE);

In case you were wondering: GL_DOT3_ATI was the constant defining a dot product operation, GL_REG_2_ATI informed we wanted to work on the second register - that had to be populated by a previous call - and the GL_NONE occurrences are placeholders for arguments not needed by the DOT3 operation.

While not different in concept from the API used for texture shaders and register combiners, in that case the approach of using a single-function-catches-all C function style allowed to choose between a set of limited states, while in this case it was used to map a fairly generic assembly language.

How the API did(n't) adapt

So, life wasn't exactly great if you were an OpenGL programmer in late 2001 and you wanted to take advantage of fragment processing:

there was no ARB standard to define what a cross-vendor fragment program was even supposed to be like;
meanwhile, you couldn't "just wait" and hope for things to get better soon, so most developers rolled out their own tools to assist in writing fragment programs for the R200, either as C preprocessor macros or offline text-to-C compilers;
programs written for the R200 could not run on the NV20 by any means, because one hardware had features the other didn't: this effectively forced you to write your effects a first time for the R200 and a watered-down version for the NV20.

About this last point, one might have thought to just forget about the NV2x architecture; after all, everybody would have bought a R200 based card within months, right? Well, there was a problem with this, because NV25 (commercial name GeForce 4 Ti), an incremental evolution of NV20 still with no true fragment programmability, was meanwhile topping performance charts; instead, R200 had the great new features everybody was excited about, but existing games at the time still weren't using them and performed better on a previous generation chip.

On a sidenote, Direct3D didn't deal much better with such vendor fragmentation: Microsoft had previously took NV20 register combiners and texture shaders and made Pixel Shader 1.0/1.3 "standards" out of them. Then, when R200 was introduced, Microsoft updated the format to 1.4, which was in turn a copy-paste of ATI's R200 specific hardware features.

Direct3D Pixel Shader format had however a clear advantage over the OpenGL state of things: as the R200 hardware features were a superset of NV2x's, Microsoft took their time to implement an abstraction layer that allowed to run a PS 1.0/1.3 program on both NV2x and R200 chips (instead, running a PS 1.4 program on the NV2x was impossible due to lack of required hardware features).

nVidia did try to ease life for OpenGL programmers by releasing NVParse, a tool that allowed to use a Direct3D Pixel Shader 1.0/1.1 in a OpenGL program. It had however some restrictions on supporting PS 1.0 and 1.1 that hampered its adoption, and was never updated to support PS 1.2.

On another sidenote, a saner text-format based shader language for the R200 chip only came with the GL_ATI_text_fragment_shader extension, that was introduced no less than six months later and actually implemented in Radeon drivers not earlier than beginning 2003.

It was mainly seen as a "too little, too late" addition, and never saw widespread adoption.

There's an extension born every minute

To add to the already existing mayhem, both nVidia and ATI took care to expose hardware features present in new iterations of their chips to OpenGL via vendor-specific extensions, whose number was increasing over time; actually, most of the GL_NV and GL_ATI extensions in the OpenGL registry date back to this timeframe. Meanwhile, the ARB consortium, responsible for dictating OpenGL standards, was trying to extend the fixed function pipeline, adding parameters and configurability via yet more extensions, and seemingly postponing the idea of having a programmable pipeline at all, when that was clearly the direction that graphics hardware was taking.

On an historical sidenote, the GL_ARB_vertex_program extension, that we covered in the first article of this series, was actually released together with GL_ARB_fragment_program, and vertex programming APIs were similarly vendor-fragmented until then; yet, vertex programmability had much less momentum, within both vendors and developers, for this to actually constitute a problem.

As an indicator, consider that until then you had a grand total of one vendor-specific extension that covered vertex programmability (nVidia had GL_NV_vertex_program, whose representation of shaders was text based, and ATI had GL_EXT_vertex_shader, that used C functions to describe shaders). Compare that to literally tens of vendor-specific extensions released for fragment programmability, and you get the idea.

Light at the end of the tunnel

It should be clear that the situation of OpenGL programmable pipeline was messy to say the best at this point. We're roughly at September 2002 when finally, but undoubtly too late to the table, the ARB consortium finalized and released the GL_ARB_fragment_program extension. It was what the OpenGL world had been in dire need for during the last year: a cross-vendor text-based format to program fragment stages.

It should be noted that, while GL_ARB_fragment_program was intended to end the API hell OpenGL had fallen in during the previous year, it couldn't be implemented in hardware neither by the NV20 series (no surprise) nor by the R200 series. This was more or less expected: the ARB understood that they missed that generation's train and it made more sense to provide a uniform programming model for the next one.

The technology know-how gathered by ATI with the less-than-successful R200 allowed them to develop the first GPU that could support the new extension actually before it was finalized; the R300 (commercially branded as Radeon 9700 Pro), was in fact released by ATI in August 2002. ATI was ahead of the competition this time, while nVidia's much awaited NV30 chip only reached shelves five months later (January 2003, with commercial name of GeForce FX), and for various reasons it was no match for the R300 performance-wise. History repeating here: like the R200 for ATI, the first GPU with true fragment programmability had been a market failure for nVidia as well.

Fragment programs in OpenGL

What GL_ARB_fragment_program brought to the table was a quite expressive assembly language, where vectors and their related operations (trig functions, interpolation, etc...) were first-class citizens.

From an API point of view, it was pretty similar to GL_ARB_vertex_program: programs have to be bound to be used, they have access to OpenGL state and the main program can feed parameters to them.

From a pipeline point of view, fragment programs when enabled effectively replace the "Texture Environment", "Colour Sum", "Fog" and "Alpha test" stages:

This is a selection of fragment attributes accessible by fragment programs:

Fragment Attribute Binding  Components  Underlying State
--------------------------  ----------  ----------------------------
fragment.color              (r,g,b,a)   primary color
fragment.color.secondary    (r,g,b,a)   secondary color
fragment.texcoord           (s,t,r,q)   texture coordinate, unit 0
fragment.texcoord[n]        (s,t,r,q)   texture coordinate, unit n
fragment.fogcoord           (f,0,0,1)   fog distance/coordinate
fragment.position           (x,y,z,1/w) window position

And here's the full instruction set of version 1.0 of the assembly language:

Instruction    Inputs  Output   Description
-----------    ------  ------   --------------------------------
ABS            v       v        absolute value
ADD            v,v     v        add
CMP            v,v,v   v        compare
COS            s       ssss     cosine with reduction to [-PI,PI]
DP3            v,v     ssss     3-component dot product
DP4            v,v     ssss     4-component dot product
DPH            v,v     ssss     homogeneous dot product
DST            v,v     v        distance vector
EX2            s       ssss     exponential base 2
FLR            v       v        floor
FRC            v       v        fraction
KIL            v       v        kill fragment
LG2            s       ssss     logarithm base 2
LIT            v       v        compute light coefficients
LRP            v,v,v   v        linear interpolation
MAD            v,v,v   v        multiply and add
MAX            v,v     v        maximum
MIN            v,v     v        minimum
MOV            v       v        move
MUL            v,v     v        multiply
POW            s,s     ssss     exponentiate
RCP            s       ssss     reciprocal
RSQ            s       ssss     reciprocal square root
SCS            s       ss--     sine/cosine without reduction
SGE            v,v     v        set on greater than or equal
SIN            s       ssss     sine with reduction to [-PI,PI]
SLT            v,v     v        set on less than
SUB            v,v     v        subtract
SWZ            v       v        extended swizzle
TEX            v,u,t   v        texture sample
TXB            v,u,t   v        texture sample with bias
TXP            v,u,t   v        texture sample with projection
XPD            v,v     v        cross product

It's easy to see why fragment programming was such a breakthrough in real-time computer graphics. It made possible to describe arbitrarily complex graphics algorithms and run them in hardware for each fragment rendered by the GPU. From fragment programs on, OpenGL had a standard that wasn't anymore a "really very customizable" graphic pipeline: it was a truly programmable one.

For reasons already discussed in part 1, fragment programs didn't include branching logic, so you had to do with the conditional instruction SLT.

Programming the torus' fragments

As a good first step in the realm of fragment programs, let's reimplement the fixed function texture lookup trick that we used in Part 1. The code is very simple:

!!ARBfp1.0

# Just sample the currently bind texture at fragment
# interpolated texcoord
TEX result.color, fragment.texcoord, texture[0], 1D;

END

What's happening here? If you recall how we implemented the cel shading effect in Part 1, the idea was to compute light intensity for a given vertex, and use that as coordinate on a lookup texture, which mapped intensities to discretized values. So, after the hardware interpolators do their job, control is passed to our fragment program, which receives the interpolated texture coordinate. We have to perform the actual sampling and we're done. That's what was going on behind the scenes with fixed functionality.

One fragment at a time...

So, now that we have full control on the fragment stage, we can get rid of the lookup texture altogether, and do the threshold checks for each displayed fragment of the model.

This will not improve the graphic outcome in any noticeable way, as we're just moving over to GPU the same discretization code that we previously ran on the CPU in part 1, but this time we'll obtain the final color with a real-time computation instead of a fetch on a precalculated lookup table; this will also serve as a first step for further improvements. Let's step into our fragment program:

!!ARBfp1.0

PARAM bright = { 1.0, 1.0, 1.0, 1.0 };
PARAM mid    = { 0.5, 0.5, 0.5, 1.0 };
PARAM dark   = { 0.2, 0.2, 0.2, 1.0 };
PARAM c   = { 0.46875, 0.125, 0, 0 };

TEMP R0;
TEMP R1;

Nothing new here: ARB-flavour fragment program version 1.0, three constant parameters to hold the final colors and the threshold values, and temporaries. Here's where the real program starts:

# Thresholds check, xy: 11 dark, 10 mid, else bright
SLT R1, fragment.texcoord.xxxx, c;

SLT instruction is for "Set on Less Than": it will compare each component of its second and third arguments whether the former is less than the latter, and set to 1.0 (true) or 0.0 (false) each component of the first argument.

Thanks to swizzling, we can broadcast the light intensity all over the second argument, and therefore test both thresholds with a single instruction. As we have two thresholds only, we only care about the x and y components; had we four thresholds, we could have used the same trick to get up to four checks performed with a single instruction.

# R0 = R1.x > 0.0 ? mid : bright
CMP R0, -R1.x, mid, bright;

# R0 = R1.y > 0.0 ? dark : R0
CMP R0, -R1.y, dark, R0;

MOV result.color, R0;
END

It's now time to set the desired color according to our threshold checks. But... how do you choose between the three possible outcomes if you don't have a jump instruction? Well, in this particular case you can flatten the comparisons: that is, always executing both of them, and keeping the first one's result if the second one doesn't pass. Note that we have to negate our arguments, because of the CMP instruction actually checking whether its second argument is less than zero.

Full fragment control

Until now we've computed light intensity for each vertex, and interpolated them to get the values in between vertices. The drawback to this approach is that a lot of information is lost in the process: consider the case where two adjacent vertices receive the same light intensity, but have two different normals. Interpolating on values yields a constant color all over; what we would like instead is interpolating normals, to get a shade over those vertices:

With a fully programmable fragment stage, we can move the light calculation per-fragment, offloading the vertex program and moving that logic to the fragment program. This was effectively impossibile with fixed functionality.

First, the revised vertex program:

!!ARBvp1.0

# Current model/view/projection matrix
PARAM mvp[4]    = { state.matrix.mvp };

# Transform the vertex position
DP4 result.position.x, mvp[0], vertex.position;
DP4 result.position.y, mvp[1], vertex.position;
DP4 result.position.z, mvp[2], vertex.position;
DP4 result.position.w, mvp[3], vertex.position;

# Make vertex position and normal available to the interpolator
# through texture coordinates
MOV result.texcoord[0], vertex.position;
MOV result.texcoord[1], vertex.normal;

END

The new vertex program is only there to pass current vertex position and normal to the fragment program. The hardware interpolator will generate per-fragment versions of these vertices, allowing us to produce a much smoother rendering of light transitions.

!!ARBfp1.0

PARAM lightPosition = program.env[0];

PARAM bright = { 1.0, 1.0, 1.0, 1.0 };
PARAM mid    = { 0.5, 0.5, 0.5, 1.0 };
PARAM dark   = { 0.2, 0.2, 0.2, 1.0 };
PARAM c   = { 0.46875, 0.125, 0, 0 };

TEMP R0;
TEMP R1;
TEMP R2;

Accordingly, we'll have to move the actual light intensity calculation in the fragment program, which starts with a very familiar prologue, with the added information of the light position in the scene.

TEMP lightVector;
SUB lightVector, lightPosition, fragment.texcoord[0];

TEMP normLightVector;
DP3 normLightVector.w, lightVector, lightVector;
RSQ normLightVector.w, normLightVector.w;
MUL normLightVector.xyz, normLightVector.w, lightVector;

TEMP iNormal;
MOV iNormal, fragment.texcoord[1];

TEMP nNormal;
DP3 nNormal.w, iNormal, iNormal;
RSQ nNormal.w, nNormal.w;
MUL nNormal.xyz, nNormal.w, iNormal;

# Store the final intensity in R2.x
DP3 R2.x, normLightVector, nNormal;

This is where the actual intensity computation is performed. The two normalization steps (you can spot them instantly at this point, can you?) are required because interpolated vectors are non-normalized, even if the starting and ending vectors are, so you have to normalize them yourself if you want the final light computation to make sense.

# Thresholds check, xy: 11 dark, 10 mid, else bright
SLT R1, R2.xxxx, c;

# R0 = R1.x > 0.0 ? mid : bright
CMP R0, -R1.x, mid, bright;

# R0 = R1.y > 0.0 ? dark : R0
CMP R0, -R1.y, dark, R0;

MOV result.color, R0;
END

This closing part should already be familiar to you :)

The rendering improvement using per-fragment light computation is noticeable even on our very simple object, especially if we zoom on the transition line between highlight and midtone:

Wrapping it up

My initial plan was to end the series with part 2, but no history of fragment programming would be complete without the obligatory report of the API mayhem that happened in the early 2000's, and that took some more room than I thought to fully cover.

In the next part we'll walk through the rise of compilers that first allowed developers to produce assembly programs from higher level languages, and how ultimately a compiler for such a language became part of the OpenGL standard.

Webography

http://programmers.stackexchange.com/questions/60544/why-do-game-developers-prefer-windows
http://www.arcsynthesis.org/gltut/History%20GeForceFX.html
http://www.techspot.com/article/657-history-of-the-gpu-part-3/

Power, Turbo, Reset

Wednesday, August 28, 2013

Evolution of the Graphics Pipeline: a History in Code (part 2)