Tuesday, May 3, 2011

Using PIX to help figure out graphics glitches in Star Ninja

As Star Ninja rapidly approaches completion, I've been working on a lot of game polish tasks. I really want this game to make a great first impression and to that goal I've been working on streamlining the UI and making transitions between screens look nice. 

Recently, there was a problem with the screen transition logic that rendered a cross fade between the level selection and the gameplay screen over the course of a second or so. For a while, I didn't really think too much about what was essentially a 1-frame screen flicker but once I noticed it I knew it had to be fixed. 

Single frame render glitches are always hard to deal with unless you have the right tools & process. The first goal is to identify what is really going on. Second, reproduce it reliably and quickly; without that, a lot of time can be wasted. At this point you can iterate with the debugger and tools to take a close look at what is usually a problem with a lot of moving parts. That's where PIX comes in. 

To give an idea what I was looking at, here is the level picker menu, the glitched screen and a frame not long after the transition was done *Note: the art and level is not final, this is a game in development after all!
The menu screen cross fades with the game, but at the last frame of the crossfade it was doing this:


Here's the frame right after it:

Pretty glitchy there in the middle, but since it's only for one frame it's almost impossible to detect it as more than just a flicker. PIX can record an XNA application's stream of graphics device calls, giving you the ability to analyze every last call made to DirectX. This is an enormously useful tool for diagnosing problems such as these.

Because this is happening for only a single frame, I chose to record the stream rather than fumble with breakpoints which is always a hassle when dealing with UI and timing related problems. To do this, the PIX experiment needs to be set up as such:
Note that you have to create and configure each trigger, there isn't a magic "setup stream recording" button. No big deal once you know what to do though. I find I need to check "Disable D3DX analysis" on the Target Program tab when using PIX with XNA apps, it doesn't work without that for me. Might be my system configuration, or maybe an XNA compatibility issue (who knows).

So, once set up, click Start Experiment to run the game. Press your key to start and stop the stream recording to capture the problem. Exit the application, wait a moment, and PIX will pop up a new window like this:

From here, you can "scrub" the video to any frame and drill down into any frame to inspect the sequence of DirectX events. After a bit of digging around in the data, I found that the cause of the problem could be seen by selecting the "Render" tab, find the correct frame of the stream that showed the problem and then select the Depth channel in the "Channel(s):" combo box. This is what I saw:

Clearly, the menu was writing the depth buffer and the game screen wasn't able to draw correctly because of this. Keep in mind this was happening during the transition, where the code is actually drawing both screens to make the fade effect work. This is a new situation for the game because prior to the transitions being added, all screens were rendered without concern for how they might interact with other screens. 

I didn't want to render into a render targets and then back to full screen quads because that would be too slow. The code is already doing a render target for the gameplay screen as it fades in; this problem happened on the very last frame as the gameplay screen was rendering directly to the back buffer like it does when it's not part of a transition. What was happening is the gameplay screen was affected by the previous frame depth buffer results, but only for the one frame. To save an extra bit of time here and there, the game doesn't do a full screen clear at the beginning of each frame unless it is known to be necessary. Some people have told me that the screen clears are so fast that I shouldn't bother, it's too early to optimize, and I should just clear it whenever its convenient, but the fact is the PIX logging shows the Clear operation takes enough time that it's worth it to me to avoid doing when possible or useful. During transitions, the phone is already using a lot of GPU because of the render target usage so this is a good time to avoid wasteful operations. Optimizing early is sometimes a bad thing, but if I know at the outset that between various options one is faster than another and not too much trouble to do I'll always go for the faster option because it tends to make the entire application more robust in the long run with fewer architectural performance problems to go back and wish I had done right in the first place.

To fix this, I simply added this line of code between the two screens to reset the depth buffer during the transition, causing the screen draw order to determine visibility:

GraphicsDevice.Clear(ClearOptions.DepthBuffer, Color.Black, 1, 0);


In the end, it was a simple bug that was easy to fix. Without PIX, I would have been left recording video and analyzing it frame by frame and guessing what was wrong. Fortunately I was able to use PIX to quickly identify and solve the problem.