Sunday, September 29, 2013

The shade of things to come

Update (15/10/2013): Microsoft has just announced via an official blog post that the Xbox One will not support Mantle as a programming API.

At GPU14 AMD announced the new Mantle low-level API for their GCN architecture. Anandtech has a pretty good analysis of the presentation, along with some insights of what Mantle could mean for the GPU market.

The announcement left me shocked, but not because I was not expecting such a move; far from it. Actually I had foresaw this some months ago, and I was pleased to see that I got pretty much every detail right... Except for one. A very crucial one.

The landscape above Mantle

We still know very little about the Mantle API.  AMD claims it will allow developers to implement performance critical sections closer to the metal on their GCN chips, without the performance penalty of higher level APIs like OpenGL and Direct3D.

AMD claims that this is a growing need from graphic developers, and I totally copy that; with the "G" in GPU progressively shifting from meaning "Graphics" to mean "Generic", the abstraction gap between OpenGL and the underlying architecture has gradually grown.

Also, AMD is in an interesting market position nowadays, as both the Microsoft Xbox One and the Sony Playstation 4 will feature a GPU based on their GCN architecture.  The Nintendo Wii U hosts an AMD GPU as well, but it's not based on GCN.

The Mantle API is thus intended to squeeze every last graphics performance drop from Xbox One, Playstation 4 and PCs equipped with a AMD GPU, and making porting between these platform easier.

Could the Mantle API be implemented for other GPUs as well?  Probably not while maintaining its intended prime focus on efficiency: from what it can be inferred from the AMD presentation, the Mantle API is strictly tied to the GCN architecture.

History repeating?

The history of graphics programming has a notable precedent of a platform-specific API: Glide.  It was a low-level API that exposed the hardware of the 3dfx Voodoo chipset, and for a long time it was also the only way to access its functionalities.  Thanks to 3dfx' hardware popularity, Glide was widespread, with Glide-only game releases not being uncommon.

While the Mantle API is as platform-specific as Glide was, the present context is much different: OpenGL and Direct3D are now mature, and the ultimate purpose of the new project is likely to complement existing APIs, offering a fast path for critical sections, rather than replacing them.

The other side of today's gaming

It's clear that AMD is trying to leverage its position as a GPU supplier across the PC and console markets.  However, there's another GPU vendor that is going to be in a similar position in the near future: earlier this year, in a SIGGRAPH event that suprisingly got very little press, NVIDIA demonstrated its own Kepler GPU architecture running on Tegra 5, their next generation low power SoC aimed to mobile devices.

So, both vendors have one foot in the PC market, while keeping the other one in the console market for AMD and in the mobile one for NVIDIA.  This is far from being a balanced situation, because console market has shrunk massively during the last years, and the momentum is on mobile gaming right now.  Asymco has a detailed summary of the current situation, along with some speculation about the future I don't fully agree with.

If NVIDIA manages to be a first choice for Steam Boxes, as it's clearly trying to be, and put the Tegra 5 on a relevant base of mobile devices, the Kepler architecture would become the lingua franca between PC and mobile gaming, as much as AMD hopes GCN to do between PC and console gaming.

At that point introducing a platform specific API, both to offer developers the extra edge on their hardware and to lock out competitors, starts to make a lot of sense.

In the end, that was the detail I got wrong: the vendor that API would come from first :)

Sunday, September 1, 2013

Book review: OpenGL Development Cookbook

I was proposed to review the latest book from Packet Publishing: OpenGL Development Cookbook.  This post is a slightly expanded version of the review I've already published on two major book review websites.

I had great expectations when I first opened this book.  In fact, I feel there is a big void right in the middle of published books about OpenGL.

At one side of the void there are either technical references or introductory texts, which explain the reader how to properly use the library but don't show practical applications: at the end of those books people know how to texture lookup from a vertex shader, not how to render realistic terrain from a height map.

At the other side there are collections of articles about very advanced rendering techniques, intended for people already well versed in graphics programming and hardly of any use for the everyday developer (think about the ShaderX or the GPU Pro series).

The premise of this book is to be the gap filler, which tells you about all the cool things you can do with OpenGL (in addition to rendering teapots) in a wide range of topics, while remaining practical enough for the average OpenGL developer.

While it's a good shot in that direction, it doesn't live up to this ambitious premise.

Let's start with what's good: recipes cover a vast range of applications, including mirrors, object picking, particle systems, GPU physics simulations, ray tracing, volume rendering and more.

OpenGL version of choice is 3.3 core profile, so all the recipes are implemented using modern OpenGL while still being compatible with most GPU hardware out there.  Every recipe comes with a self-contained and working code example that you can run and tweak. All examples share coding style and conventions, which is great added value.

The toolchain of choice is Visual Studio for Windows, but the examples also build unmodified on Linux installations.  Despite Mac OS X only supporting up to OpenGL 3.2, examples not requiring 3.3 features will build there as well with minor modifications (just be sure to use included GLUT.framework rather than freeglut, as the latter relies on XQuartz which isn't able to request an OpenGL 3 context).

Then, there's something that just doesn't work well.  First, the formatting of code excerpts is terrible: long lines wrap twice or thrice with no leading spaces, so without highlighting it's nigh impossible to read the code right at first glance:

glm::mat4 T = glm::translate( glm::mat4(1.0f),
glm::vec3(0.0f,0.0f, dist));
glm::mat4 Rx=glm::rotate(T,rX,glm::vec3(1.0f, 0.0f,
glm::mat4 MV=glm::rotate(Rx,rY,

Given that a good 30% of this book is code, this is really something that should be addressed in a second edition.

A somewhat deeper problem is about how recipes are presented.  Most of them dive directly in a step-by-step sequence of code snippets, taking little time to explain the required background and the overall idea behind the implementation.  On a related note, the book states that knowledge of OpenGL is just a "definite plus" to get through it, but after the very first recipe spends a total of three lines explaining what Vertex Array Objects are, before jumping into code that uses them, it becomes clear that being proficient with OpenGL is a requirement to appreciate the book.

The quality of the recipes varies a lot through the book: the best written and most interesting ones are from chapters 6, 7 and 8, which comes as no surprise as the author's research interests include the topics they cover.  I would have exchanged many of the previous recipes, some of which are variations on the same theme, to be about techniques that both fit the recipe format and are relevant for any up-to-date rendering engine (depth of field, fur, caustics, etc...).  On a related note, I think that perhaps the single biggest flaw of the book is that it's written by a single author, but to offer 50 great recipes a cookbook needs several ones, each master in her own trade and each offering the best of her knowledge.

In the end: if you're already well versed in OpenGL, have interest in the specific topics best covered by the author, and you're going to read each recipe at the computer to comfortably read code, OpenGL Development Cookbook has something to offer.  While not the gapfiller I was initially looking for, the learning opportunity from having a code example for each recipe is remarkable.

Wednesday, August 28, 2013

Evolution of the Graphics Pipeline: a History in Code (part 2)

Or, it was about time :)

We ended part 1 on quite a cliffhanger, just after the NV20 nVidia chip introduced vertex programmability in the graphics pipeline; to give a bit of history context, this happened around March 2001, while ATI's then offering, R100, had no support for this.

To better understand what happened next, we'll go a bit in detail about how the introduction of vertex programmability changed the existing fixed function pipeline.

First steps towards a programmable pipeline

As seen in part 1, vertex stage programmability allowed full control on how the vertices were processed, and this also meant taking care of T&L in the vertex program. So, going back to our pipeline diagram, enabling vertex programs resulted in replacing the hard-wired "Transform and Lighting" stage with a custom "Vertex program" stage:

Even if vertex programs effectively were a superset of the T&L fixed functionality, the NV20 still hosted the tried and true fixed pipeline hardware next to the new vertex programmable units, instead of going the route of emulating T&L with vertex programs.

While this might seem like a waste of silicon area, keep in mind that the hardwired logic of the fixed pipeline had been hand-tuned for performance during the last 3 or so years, and programmable units were not as fast as fixed function ones when performing the very same task. It's a known optimization tradeoff: you give away raw performance to get customizability, and considering that all existing applications at that time were fixed function based, hardware vendors made sure to keep them working great, while at the same time paving the road for next-gen applications.

Sidenote: while keeping the fixed functionality hardware made sense for the desktop world of that time, a much different balance was adopted by nVidia with the NV2A, a custom GPU made for the X-Box game console, which did not host the fixed function hardware and used the spare silicon space to host more vertex units. Of course, such a decision is only affordable when you don't have legacy applications to start with :)

Now let's look again at the pipeline diagram. The fragment stage, which comes next in the pipeline, is much more complex than the vertex stage, and also much more performance critical: when you draw a triangle on screen, the vertex logic is invoked a grand total of three times, but the fragment logic is invoked for each and every pixel in the triangle. A stall in the fragment pipeline meant bad news from the framerate, much more than one in the vertex pipeline. This is why programmability was introduced in the vertex stage first: vendors wanted to test their architecture on the less performance-critical area first, and iron out wrinkles before moving to the fragment stage.

Despite the precautions and experience gained with the vertex units, the first GPU with support for fragment programmability was a commercial failure for both main vendors at that time (ATi's R200 in October 2001, nVidia's NV30 in 2003), while APIs entered a dark era of vendor-specific extensions and false standards.

History context (or: fragment programming in OpenGL, first rough drafts)

Limited forms of fragment programmability had been present in graphic adapters and available to OpenGL applications since 1998 with nVidia NV4, commercially branded as Riva TNT (short for TwiN Texel), which had the ability to fetch texels from two textures at once and feed them to hardware combiners, which implemented a small set of operations on texels.

The interested can consult documentation of GL_NV_texture_env_combine4 extension.

Texture shaders and register combiners

The NV20 expanded on the hardware combiners concept quite a bit. Input for the fragment stage came from 4 texture units, collectively called "texture shaders", and went to be processed by 8 configurable stages, collectively called "register combiners". When enabled, they effectively replaced the "Texture Environment", "Colour Sum" and "Fog" stages in our pipeline diagram.

In the end you had an extensive set of pre-baked texture and color operations that you could mix and match via the OpenGL API. They still didn't implement conditional logic though, and accessing them through OpenGL was, um, difficult. A typical code snippet configuring texture shaders and register combiners looked like this:

// Texture shader stage 0: bump normal map, dot product

// Register combiner stage 0: input texture0, output passthrough
                   GL_FALSE, GL_FALSE, GL_FALSE);

The relevant OpenGL extensions were GL_NV_texture_shader and GL_NV_register_combiners. In Direct3D parlance, supporting those extensions would roughly translate to supporting "Pixel Shader 1.0/1.3".

First programmable fragments

ATI's introduction of R200 GPU (commercially branded as Radeon 8500) in October 2001 marked the introduction of the first truly programmable fragment stage. You had texture fetch, opcodes to manipulate colors and conditional logic. This is where the API mayhem started.

ATI released its proprietary OpenGL extension to access those features at the same time the Radeon 8500 was released. If using nVidia's register combiners and texture shader extensions was difficult, using ATI's GL_ATI_fragment_shader extension was painful.

You had to input each instruction of the shader in between Begin/End blocks, in the form of one or more function calls to a single OpenGL C API function. This was so awkward that in the end lots of fragment programs were written with the intended instruction as a comment and the barely-readable GL command below, e.g.:

// reg2 = |light to vertex|^2
                      GL_REG_2_ATI, GL_NONE, GL_NONE,
                      GL_REG_2_ATI, GL_NONE, GL_NONE);

In case you were wondering: GL_DOT3_ATI was the constant defining a dot product operation, GL_REG_2_ATI informed we wanted to work on the second register - that had to be populated by a previous call - and the GL_NONE occurrences are placeholders for arguments not needed by the DOT3 operation.

While not different in concept from the API used for texture shaders and register combiners, in that case the approach of using a single-function-catches-all C function style allowed to choose between a set of limited states, while in this case it was used to map a fairly generic assembly language.

How the API did(n't) adapt

So, life wasn't exactly great if you were an OpenGL programmer in late 2001 and you wanted to take advantage of fragment processing:

  • there was no ARB standard to define what a cross-vendor fragment program was even supposed to be like;
  • meanwhile, you couldn't "just wait" and hope for things to get better soon, so most developers rolled out their own tools to assist in writing fragment programs for the R200, either as C preprocessor macros or offline text-to-C compilers;
  • programs written for the R200 could not run on the NV20 by any means, because one hardware had features the other didn't: this effectively forced you to write your effects a first time for the R200 and a watered-down version for the NV20.

About this last point, one might have thought to just forget about the NV2x architecture; after all, everybody would have bought a R200 based card within months, right? Well, there was a problem with this, because NV25 (commercial name GeForce 4 Ti), an incremental evolution of NV20 still with no true fragment programmability, was meanwhile topping performance charts; instead, R200 had the great new features everybody was excited about, but existing games at the time still weren't using them and performed better on a previous generation chip.

On a sidenote, Direct3D didn't deal much better with such vendor fragmentation: Microsoft had previously took NV20 register combiners and texture shaders and made Pixel Shader 1.0/1.3 "standards" out of them. Then, when R200 was introduced, Microsoft updated the format to 1.4, which was in turn a copy-paste of ATI's R200 specific hardware features.

Direct3D Pixel Shader format had however a clear advantage over the OpenGL state of things: as the R200 hardware features were a superset of NV2x's, Microsoft took their time to implement an abstraction layer that allowed to run a PS 1.0/1.3 program on both NV2x and R200 chips (instead, running a PS 1.4 program on the NV2x was impossible due to lack of required hardware features).

nVidia did try to ease life for OpenGL programmers by releasing NVParse, a tool that allowed to use a Direct3D Pixel Shader 1.0/1.1 in a OpenGL program. It had however some restrictions on supporting PS 1.0 and 1.1 that hampered its adoption, and was never updated to support PS 1.2.

On another sidenote, a saner text-format based shader language for the R200 chip only came with the GL_ATI_text_fragment_shader extension, that was introduced no less than six months later and actually implemented in Radeon drivers not earlier than beginning 2003.

It was mainly seen as a "too little, too late" addition, and never saw widespread adoption.

There's an extension born every minute

To add to the already existing mayhem, both nVidia and ATI took care to expose hardware features present in new iterations of their chips to OpenGL via vendor-specific extensions, whose number was increasing over time; actually, most of the GL_NV and GL_ATI extensions in the OpenGL registry date back to this timeframe. Meanwhile, the ARB consortium, responsible for dictating OpenGL standards, was trying to extend the fixed function pipeline, adding parameters and configurability via yet more extensions, and seemingly postponing the idea of having a programmable pipeline at all, when that was clearly the direction that graphics hardware was taking.

On an historical sidenote, the GL_ARB_vertex_program extension, that we covered in the first article of this series, was actually released together with GL_ARB_fragment_program, and vertex programming APIs were similarly vendor-fragmented until then; yet, vertex programmability had much less momentum, within both vendors and developers, for this to actually constitute a problem.

As an indicator, consider that until then you had a grand total of one vendor-specific extension that covered vertex programmability (nVidia had GL_NV_vertex_program, whose representation of shaders was text based, and ATI had GL_EXT_vertex_shader, that used C functions to describe shaders). Compare that to literally tens of vendor-specific extensions released for fragment programmability, and you get the idea.

Light at the end of the tunnel

It should be clear that the situation of OpenGL programmable pipeline was messy to say the best at this point. We're roughly at September 2002 when finally, but undoubtly too late to the table, the ARB consortium finalized and released the GL_ARB_fragment_program extension. It was what the OpenGL world had been in dire need for during the last year: a cross-vendor text-based format to program fragment stages.

It should be noted that, while GL_ARB_fragment_program was intended to end the API hell OpenGL had fallen in during the previous year, it couldn't be implemented in hardware neither by the NV20 series (no surprise) nor by the R200 series. This was more or less expected: the ARB understood that they missed that generation's train and it made more sense to provide a uniform programming model for the next one.

The technology know-how gathered by ATI with the less-than-successful R200 allowed them to develop the first GPU that could support the new extension actually before it was finalized; the R300 (commercially branded as Radeon 9700 Pro), was in fact released by ATI in August 2002. ATI was ahead of the competition this time, while nVidia's much awaited NV30 chip only reached shelves five months later (January 2003, with commercial name of GeForce FX), and for various reasons it was no match for the R300 performance-wise. History repeating here: like the R200 for ATI, the first GPU with true fragment programmability had been a market failure for nVidia as well.

Fragment programs in OpenGL

What GL_ARB_fragment_program brought to the table was a quite expressive assembly language, where vectors and their related operations (trig functions, interpolation, etc...) were first-class citizens.

From an API point of view, it was pretty similar to GL_ARB_vertex_program: programs have to be bound to be used, they have access to OpenGL state and the main program can feed parameters to them.

From a pipeline point of view, fragment programs when enabled effectively replace the "Texture Environment", "Colour Sum", "Fog" and "Alpha test" stages:

This is a selection of fragment attributes accessible by fragment programs:

Fragment Attribute Binding  Components  Underlying State
--------------------------  ----------  ----------------------------
fragment.color              (r,g,b,a)   primary color
fragment.color.secondary    (r,g,b,a)   secondary color
fragment.texcoord           (s,t,r,q)   texture coordinate, unit 0
fragment.texcoord[n]        (s,t,r,q)   texture coordinate, unit n
fragment.fogcoord           (f,0,0,1)   fog distance/coordinate
fragment.position           (x,y,z,1/w) window position

And here's the full instruction set of version 1.0 of the assembly language:

Instruction    Inputs  Output   Description
-----------    ------  ------   --------------------------------
ABS            v       v        absolute value
ADD            v,v     v        add
CMP            v,v,v   v        compare
COS            s       ssss     cosine with reduction to [-PI,PI]
DP3            v,v     ssss     3-component dot product
DP4            v,v     ssss     4-component dot product
DPH            v,v     ssss     homogeneous dot product
DST            v,v     v        distance vector
EX2            s       ssss     exponential base 2
FLR            v       v        floor
FRC            v       v        fraction
KIL            v       v        kill fragment
LG2            s       ssss     logarithm base 2
LIT            v       v        compute light coefficients
LRP            v,v,v   v        linear interpolation
MAD            v,v,v   v        multiply and add
MAX            v,v     v        maximum
MIN            v,v     v        minimum
MOV            v       v        move
MUL            v,v     v        multiply
POW            s,s     ssss     exponentiate
RCP            s       ssss     reciprocal
RSQ            s       ssss     reciprocal square root
SCS            s       ss--     sine/cosine without reduction
SGE            v,v     v        set on greater than or equal
SIN            s       ssss     sine with reduction to [-PI,PI]
SLT            v,v     v        set on less than
SUB            v,v     v        subtract
SWZ            v       v        extended swizzle
TEX            v,u,t   v        texture sample
TXB            v,u,t   v        texture sample with bias
TXP            v,u,t   v        texture sample with projection
XPD            v,v     v        cross product

It's easy to see why fragment programming was such a breakthrough in real-time computer graphics. It made possible to describe arbitrarily complex graphics algorithms and run them in hardware for each fragment rendered by the GPU. From fragment programs on, OpenGL had a standard that wasn't anymore a "really very customizable" graphic pipeline: it was a truly programmable one.

For reasons already discussed in part 1, fragment programs didn't include branching logic, so you had to do with the conditional instruction SLT.

Programming the torus' fragments

As a good first step in the realm of fragment programs, let's reimplement the fixed function texture lookup trick that we used in Part 1. The code is very simple:


# Just sample the currently bind texture at fragment
# interpolated texcoord
TEX result.color, fragment.texcoord, texture[0], 1D;


What's happening here? If you recall how we implemented the cel shading effect in Part 1, the idea was to compute light intensity for a given vertex, and use that as coordinate on a lookup texture, which mapped intensities to discretized values. So, after the hardware interpolators do their job, control is passed to our fragment program, which receives the interpolated texture coordinate. We have to perform the actual sampling and we're done. That's what was going on behind the scenes with fixed functionality.

One fragment at a time...

So, now that we have full control on the fragment stage, we can get rid of the lookup texture altogether, and do the threshold checks for each displayed fragment of the model.

This will not improve the graphic outcome in any noticeable way, as we're just moving over to GPU the same discretization code that we previously ran on the CPU in part 1, but this time we'll obtain the final color with a real-time computation instead of a fetch on a precalculated lookup table; this will also serve as a first step for further improvements. Let's step into our fragment program:


PARAM bright = { 1.0, 1.0, 1.0, 1.0 };
PARAM mid    = { 0.5, 0.5, 0.5, 1.0 };
PARAM dark   = { 0.2, 0.2, 0.2, 1.0 };
PARAM c   = { 0.46875, 0.125, 0, 0 };


Nothing new here: ARB-flavour fragment program version 1.0, three constant parameters to hold the final colors and the threshold values, and temporaries. Here's where the real program starts:

# Thresholds check, xy: 11 dark, 10 mid, else bright
SLT R1, fragment.texcoord.xxxx, c;

SLT instruction is for "Set on Less Than": it will compare each component of its second and third arguments whether the former is less than the latter, and set to 1.0 (true) or 0.0 (false) each component of the first argument.

Thanks to swizzling, we can broadcast the light intensity all over the second argument, and therefore test both thresholds with a single instruction. As we have two thresholds only, we only care about the x and y components; had we four thresholds, we could have used the same trick to get up to four checks performed with a single instruction.

# R0 = R1.x > 0.0 ? mid : bright
CMP R0, -R1.x, mid, bright;

# R0 = R1.y > 0.0 ? dark : R0
CMP R0, -R1.y, dark, R0;

MOV result.color, R0;

It's now time to set the desired color according to our threshold checks. But... how do you choose between the three possible outcomes if you don't have a jump instruction? Well, in this particular case you can flatten the comparisons: that is, always executing both of them, and keeping the first one's result if the second one doesn't pass. Note that we have to negate our arguments, because of the CMP instruction actually checking whether its second argument is less than zero.

Full fragment control

Until now we've computed light intensity for each vertex, and interpolated them to get the values in between vertices. The drawback to this approach is that a lot of information is lost in the process: consider the case where two adjacent vertices receive the same light intensity, but have two different normals. Interpolating on values yields a constant color all over; what we would like instead is interpolating normals, to get a shade over those vertices:

With a fully programmable fragment stage, we can move the light calculation per-fragment, offloading the vertex program and moving that logic to the fragment program. This was effectively impossibile with fixed functionality.

First, the revised vertex program:


# Current model/view/projection matrix
PARAM mvp[4]    = { state.matrix.mvp };

# Transform the vertex position
DP4 result.position.x, mvp[0], vertex.position;
DP4 result.position.y, mvp[1], vertex.position;
DP4 result.position.z, mvp[2], vertex.position;
DP4 result.position.w, mvp[3], vertex.position;

# Make vertex position and normal available to the interpolator
# through texture coordinates
MOV result.texcoord[0], vertex.position;
MOV result.texcoord[1], vertex.normal;


The new vertex program is only there to pass current vertex position and normal to the fragment program. The hardware interpolator will generate per-fragment versions of these vertices, allowing us to produce a much smoother rendering of light transitions.


PARAM lightPosition = program.env[0];

PARAM bright = { 1.0, 1.0, 1.0, 1.0 };
PARAM mid    = { 0.5, 0.5, 0.5, 1.0 };
PARAM dark   = { 0.2, 0.2, 0.2, 1.0 };
PARAM c   = { 0.46875, 0.125, 0, 0 };


Accordingly, we'll have to move the actual light intensity calculation in the fragment program, which starts with a very familiar prologue, with the added information of the light position in the scene.

TEMP lightVector;
SUB lightVector, lightPosition, fragment.texcoord[0];

TEMP normLightVector;
DP3 normLightVector.w, lightVector, lightVector;
RSQ normLightVector.w, normLightVector.w;
MUL, normLightVector.w, lightVector;

TEMP iNormal;
MOV iNormal, fragment.texcoord[1];

TEMP nNormal;
DP3 nNormal.w, iNormal, iNormal;
RSQ nNormal.w, nNormal.w;
MUL, nNormal.w, iNormal;

# Store the final intensity in R2.x
DP3 R2.x, normLightVector, nNormal;

This is where the actual intensity computation is performed. The two normalization steps (you can spot them instantly at this point, can you?) are required because interpolated vectors are non-normalized, even if the starting and ending vectors are, so you have to normalize them yourself if you want the final light computation to make sense.

# Thresholds check, xy: 11 dark, 10 mid, else bright
SLT R1, R2.xxxx, c;

# R0 = R1.x > 0.0 ? mid : bright
CMP R0, -R1.x, mid, bright;

# R0 = R1.y > 0.0 ? dark : R0
CMP R0, -R1.y, dark, R0;

MOV result.color, R0;

This closing part should already be familiar to you :)

The rendering improvement using per-fragment light computation is noticeable even on our very simple object, especially if we zoom on the transition line between highlight and midtone:

Wrapping it up

My initial plan was to end the series with part 2, but no history of fragment programming would be complete without the obligatory report of the API mayhem that happened in the early 2000's, and that took some more room than I thought to fully cover.

In the next part we'll walk through the rise of compilers that first allowed developers to produce assembly programs from higher level languages, and how ultimately a compiler for such a language became part of the OpenGL standard.


Saturday, June 22, 2013

How I got 1TB of online storage on

Disclaimer: this post describes two vulnerabilities I've stumbled upon in Copy's referral system, while genuinely trying to debug an issue with my referral code.  I already reported them to Barracuda Networks' support, but will not go into details in this post, as one of them looks still exploitable.

Update (22th June): Barracuda Networks has successfully patched both vulnerabilities in the referral system.

At the time of this writing, the bonus space I've earned through various methods on the awesome Copy service from Barracuda Networks exceeds 1TB.

If you've been living under a rock during the last few months, Copy is a file synchronization service similar to Dropbox, which adds two very savoury ingredients to the tried recipe:

  • cost of shared storage is split across all users accessing the shared files (Barracuda Networks calls this "Fair Storage");
  • no limit on the extra storage that can be obtained via user referrals.
All free Copy accounts start with 15GB, and you earn 5GB per referral.  This means that when being referred from a friend you receive 20GB of Copy storage from the get go.

I can't have enough online storage, so I was naturally interested to add storage to my Copy account by getting referrals.  I learned something unexpected in the process.

First mandatory step: AdWords campaign

Back in the Dropbox days, I maxed my account storage via referrals using a targeted Google AdWords campaign, with a total expense of less than €6 during one week, with a pretty good 5% conversion rate.  In more detail:

  • 558 people clicked on the ads;
  • 80 people out of 558 signed up for Dropbox;
  • 32 people out of 80 installed the application, giving me 500MB of extra storage.

It felt natural to try this with Copy as well, but I struggled with getting the same conversion rate:

  • 1741 people clicked on the ads;
  • 13 people out of 1741 signed up for Copy;
  • 9 people out of 13 installed the application, giving me 5GB of extra storage.
Which translates to a 0.5% conversion rate.  At this point I stopped the campaign.  It just wasn't clear to me what I was doing wrong.

The case of the missing referrals

A couple days later a friend of mine signed up for Copy using my referral, and we noticed something weird in the e-mail Copy sent him:

My friend was sure he used my referral to signup - I was sure my name was not Fabio either - so that 0.5% conversion rate just began to make more sense: there was probably a bug in Copy's referral system.

Now, I was really curious about how this could happen.  Did Fabio R. somehow got my very same referral code?  I set out to understand how the Copy sign-up process was carried out client side, hoping to shed light on this and file a report to Barracuda Networks' support team.

The infamous "success" GET request

Later that day I opened Chrome's developer console and tried to make sense of what happened during the registration process.  I opened an account through my referral code, installed the desktop application on another system, and I found a couple peculiar requests:

The registration process started with a POST request with the registration data (including the referral code) followed by a GET.  At first glance the GET request looked like it was meant to just load the usual "Congratulations" page, but upon closer inspection I noticed it still carried the referral code in its cookies (don't bother to verify this, it's not happening anymore), which made no sense to me: the server already had its chance to store the referral code, why would it be sent again?

Using the awesome requests library by Kenneth Reitz, I quickly set up a script that replicated this particular GET sequence.  I had not a clear plan in mind at this point, and just ran the script a couple times to look at the responses and hope to spot wrong patterns.

A couple minutes later I noticed this in my e-mail inbox:

Excuse me?  The Copy AdWords campaign was paused and I was pretty sure ten people signing up through my referrals and installing the application in around 10 minutes was unlikely at best.

So it looked like repeating the GET request with the same headers and cookies recorded during registration triggered the referral system, provided the desktop application was installed on the referee's system.

At this point I knew I had stumbled on something: I opened a ticket to Barracuda Networks' support, describing the initial problem with the referral going to Fabio R., and went to bed.  That is, after leaving the script to run in an endless loop, just to see what would have happened.  I woke up the next day with around 0.9TB of Copy storage.

Barracuda Networks was pretty quick to fix this: I opened my ticket on a friday night and the script wouldn't work anymore next monday...

Fiddling with UUIDs

...except that I noticed something weird in the new GET request that the browser was now emitting at the end of the registration process.  The request cookies contained an UUID that wasn't present days ago.

Could it be a fingerprint identifying the computer from which the request was coming?  I didn't investigate much, but just out of curiosity I ran again the script crafting a new UUID for each GET request, and got referral bonuses from most (but, interestingly, not all) of the requests.

Patching it, this time for real

I was looking forward to investigate this further, but it looks like Barracuda Networks was quicker to fix this than me to report it.  Meanwhile, the Copy desktop client received a push version update from 1.28.657.0 to 1.30.0347.  I can only assume it's related to the vulnerabilities reported in this post.

Parting words

Copy really is an awesome service.  Its Fair Storage sharing rule is what sets it above most of its competitors for me (think family photos) and I look forward to use it a lot in the future.

If you enjoyed reading this, and don't have a Copy account yet, please sign up through my referral; the genuine extra space I'll earn this way will have me covered just in case Barracuda Networks decides to do something about that questionable 0.9TB earned during one night :)