Performance Profiling with a Focus

By Kirk J Krauss

Sometimes the goal of a software development project is simply to get the job done. To put some text on a display, take in some input, open a file — as long as these things don’t take an inordinate amount of time, how much does the code’s performance matter? The answer may depend on how much you’re planning for scalability, new features, and maintenance. What first seems to be an insignificant performance consideration may later become significant indeed.

During the mainframe era, even such an everyday task as opening a file raised performance concerns. As systems evolved, software developers recognized that first the right storage device must be accessed using the right handlers, access privileges must be checked, and a buffer created, in addition to the seek across media that must be performed. A common saying that arose among developers was that on a clear day, you could open forever. Fast forwarding to the present, if you’ll debug a dozen-plus calls deep into what happens within the processing of a File › Open dialog request on any GUI-based system, you might gain a new appreciation of the complexity involved just bringing a file from disk into memory on a modern computer.

In other words, when you develop code that gets invoked frequently, unless it’s as simple as “Hello, world” it very well may benefit from your understanding some details about its performance. This kind of understanding goes deeper than a hunt for some bottleneck to be optimized out. You might want to know which code is consuming the least time, because that’s where you can think about fitting in improvements such as data validations or logging, or even entirely new features, without significantly slowing things down. You also might want to recognize circumstances where one approach makes sense over another, and other circumstances where the opposite applies.

Finding which of your code is invoked frequently enough to matter, and how much time is spent in certain functions and lines of code, is where a performance profiler becomes helpful. But a profiler provides its best help only when you consider how to best aim it. Your aim should usually be on code you can control, that’s not merely waiting on something you can’t control. This probably means that you can entirely bypass the code behind the File › Open dialog, unless you simply want to learn about it. My primary suggestion for profiling your code is to always keep in mind the time that could actually make a difference for the user experience.

Selective Instrumentation

A performance profiler is a type of runtime analysis tool. Typically, you can choose what to profile, at the level of modules to be instrumented by the tool, in advance. That way, you don’t have to instrument “the world” in order to see what’s going on just in your own code.

Some benefits of singling out your modules to instrument include these:

The profiler I use, Quantify for Windows, has a command line option that allows for instrumenting specific modules. The list can be as short as one module name. For example, to selectively instrument the Windows multithreaded GDI sample executable, this command is enough to do useful profiling:

quantify -cachedir=d:\cache -selective_instrumentation -selected_module_list=mtgdi.exe

The output is an mtgdi.exe.QSel file that can simply be renamed (e.g. to mtgdi.instrumented1.exe) and executed, in order to run the instrumented version with Quantify. If Quantify is already running, the new instrumented mtgdi.exe run will connect to it; otherwise Quantify will come up together with the instrumented program. Functions in dependent modules, such as the MFC library, won’t show up in the profiler’s output, unless they are recognizable by their entry points.

With Quantify, by default, timings for instrumented code are based on instruction clock cycles, while calls to uninstrumented code get timed. For all the examples shown below, the calls out of mtgdi.exe are intercepted so that a timestamp can be recorded on the way in and out of those calls. Other profilers may perform sampling or other types of timing analysis on the instrumented code. Anyway, most of them will time all of the calls going out to uninstrumented modules, when selective instrumentation (by whatever name it may be called) is in the picture.

Some sampling profilers are touted as performing no instrumentation. They nevertheless need a way to collect data for display. To get it, they may rely on a build or runtime option applicable to the module(s) being profiled. An option like that is needed not only to provide access to the stack frames for each thread executing code in the module, but also to ensure that a stack frame is cleanly associated with each sample. On most platforms, calling into routines, and returning from them, doesn’t take place via atomic operations. A profiler that does nothing to preclude sampling in the midst of a transfer of control is likely to generate misleading or confusing results. In other words, for most well-designed profilers of any stripe, there’s a way to designate modules for profiling, and to select just the ones that matter to you.

Incidentally, the reason both the Quantify option and the topic being discussed here have the same name is that I’m the person who named them both. Some other runtime analysis tools offer this functionality with a different name, such as partial or exclusive instrumentation, or they may simply ask that you fill out a list of the modules to be instrumented. Whatever the feature may be called, it’s typically available in profilers whose developers have been thoughtful about usability.

Call Graph Navigation

The top-level output of most modern profilers is a call graph showing functions that have executed during a run. The profiler may narrow the set of functions immediately displayed, by default, to those that have taken the biggest share of execution time. On the one hand, a narrowed result set is just that, and there can be many interesting things to learn from navigating beyond it. On the other hand, the complete call graph that’s available after a sophisticated module has been profiled can seem busy and cumbersome to navigate.

Navigation can become particularly daunting when a bunch of threads have run, and when each thread function has gotten a place of its own in the call graph. Yet some programs, such as mtgdi.exe, create large numbers of threads by design.

Choosing a sub-graph on which to focus isn’t as hard as it may first seem. To help you find a thread of interest, a profiler can pop up tooltips as you mouse over the thread functions shown in the call graph, so you can quickly see the proportions of time spent on each thread. If you select “Focus on subtree” for a thread (in Quantify, this is via a context menu option available when you select the thread function), you can see the time spent on any of the functions called by that thread individually. Undoing this allows you to see time spent for a function across all threads collectively. Most profilers allow you to focus on a sub-graph this way, so you can consider just the time spent within the sequence of calls shown in it.

In mtgdi.exe, creating ten threads that draw rectangles, ten threads that draw lines, and ten threads that draw balls is as simple as selecting items on a drop-down menu. This results in a situation where a lot of threads call a few functions. Indeed, the rendering of each item, whether it’s a rectangle, a ball, or a line, is handled within just one routine, named CGDIThread::InitInstance(), in mtgdi.exe. This routine is always called in the context of a CRectThread, a CLineThread, or a CBallThread, and so it invokes the appropriate SingleStep() subroutine to update a location in which to draw the appropriate shape for each thread in turn, and to then draw it. After you profile a run where you’ve generated ten threads of each variety on offer, you can expand the call graph to show every routine that ran, then hide the portions that involve application startup and shutdown, to clearly show how this object-oriented arrangement scales across all the threads.

Time Elimination

A question to ask yourself, as you’re looking at a call graph weighted according to time spent per routine, is what the weighting means to you. If it doesn’t seem to mean much, you can probably fix that with just a few adjustments to the profiler’s settings. With mtgdi.exe, there’s a kind of time sharing scheme going on among the threads that can make settings adjustments like these worthwhile, toward a better understanding of where the real execution time goes.

That time sharing scheme involves a shared device context. Just one thread at a time gets to update the screen, as mtgdi.exe is running, while each of the other threads has to wait for a chance to make updates of its own. In other words, the more threads you start, the more time each thread spends waiting on all the others. Adding up all this waiting across thirty threads that draw rectangles, lines, and balls results in the lion’s share of the time being attributed to little more than just this waiting.

A profiler typically will provide a way to reattribute the time spent in functions that wait, so that actual execution times can be shown and factored into the call graph, not including wait times. In Quantify, a filter manager dialog allows you to simply delete the time spent in functions that you can specify either by name...

... or across the board.

With just the time spent waiting on *EnterCriticalSection(), and thus the shared device context, deleted, the call graph shows that the ::SingleStep() methods of the CGDIThread objects take most of the live execution time as mtgdi.exe is running.

However, though ten threads that draw lines can be started and stopped all together while this program is running, some of those threads get a chance to draw more lines than others. This situation isn’t very extreme in mtgdi.exe, but in some cases thread starvation can become a severe performance problem in its own right, as discussed here.

Looking line by line at where the time is spent, you may want to keep a lookout for the use of locks, such as the critical section whose time has been deleted for call graph purposes. The deleted time still shows up in an annotated source view available in Quantify. The code for drawing a ball is very simple, and it lives in the critical section together with only a few other lines that manage the device context that’s shared among the threads.

Those few other lines take only a small percentage of the time.

For useful profiling, not every function that waits needs to have its time removed from the graph. Some functions don’t wait very much. The factors that make a difference for functions that wait can make interesting areas of focus. For example, the wait for a thread kill signal would be a problem, in case it was consuming significant time throughout the thread’s life. For mtgdi.exe, though, this isn’t a problem.

In fact, the developers of mtgdi.exe paid careful attention to the performance aspects of killing threads, so much that they developed two separate functions just for the task and allowed the user to choose either of them. There’s a good reason for paying attention to thread management performance. With mtgdi.exe, for example, the user can decide to very quickly create large numbers of threads drawing balls, lines, or rectangles. Once all those threads are making the system busy, stopping them all depends on getting some attention from the system toward making them stop.

Fitting the Situation

Once threads have been terminated, a profiler can show where they’ve spent their time. For mtgdi.exe, the threads that do the drawing spend the bulk of their time within the calls that render the drawing, rather than on calculations involving what to draw and where to draw it. Based on this outcome, it’s clear that performance could hardly be affected by a few more calculations like these, such as getting the balls, rectangles and lines to bounce off the walls at unexpected angles, or having them slow down a bit before they bounce.

So the profiler can help bring new functionality ideas to mind, just by showing where time is available. It also can help narrow down decisions about which functionality fits which conditions. For instance, not every program gives the user a choice of whether to kill threads fast or slow. Beyond the realm of programs intended to be instructive for software developers, choices about thread management are best left up to the program itself.

It’s clear from the comments in the mtgdi.exe source code that its developers did their performance-checking homework. According to them, there’s a sort of limit at around 50 threads when just asking all of them to self-terminate becomes slow. They seem to have gotten this right. Profiling a run where only 30 of these threads are started and stopped using the “slow” option, then started again and stopped again using the “fast” option, shows that the ::OnThreadKillThreadsSlow() routine indeed consumes less time than its counterpart, ::OnKillThreads().

Why should code that’s slow for large numbers of threads become relatively fast for smallish numbers of threads? The ::OnThreadKillThreadsSlow() routine walks the thread list, setting the m_hEventKill event for each CGDIThread in turn and waiting for each of them to drop out of ::InitInstance(), one at a time.

The ::OnKillThreads() routine, by contrast, sets the m_hEventKill event for every CGDIThread and then waits for an indication that all of them have been terminated. Profiling shows that the overall wait time involving this indication is relatively long, at 30 threads, paradoxically making the ::OnThreadKillThreadsSlow() routine the speedier way to go, in that case.

True to the developers’ findings, though, the situation is reversed once more threads are involved.

Profiling a run where 200 threads are started and stopped slow, and started again and stopped fast, shows that the ::OnKillThreads() routine is indeed the speedier route, by a long shot, when that many threads are involved.

The call graph can reveal the difference made by choosing one thread termination strategy over another in the context of entire runs. For the run where 30 threads have been started and right away killed using both routines, killing them via ::OnThreadKillThreadsSlow() consumes less than half a percent of the total run time.

But for the run where the same thing has been done with 200 threads, the time spent in ::OnThreadKillThreadsSlow() is enough to make its way to the default call graph display, even though that default graph also includes the startup and shutdown time for the program as a whole, not to mention the time the threads spent running before the menu was displayed and a kill threads choice made.

In the real world, when you want to automate the choice of which routine to run in a given situation, you won’t always have a useful comment from other developers as to which routine to call in which circumstances. A profiler can help you find out not only which piece of code is a good fit in a given situation, but also how much difference a choice of code can make as conditions change.

In Summation

A performance profiler is more than a bottleneck detector. It’s a useful tool for getting to know your code from the perspective of where time is spent, where time is available to fit things in without impacting the user experience, and in which circumstances you might consider one routine or technique over another. When you focus your profiler on the code that makes the biggest difference, you maximize the value of your own time spent gaining insight.

The code snippets above are from the Microsoft® Foundation Classes (MFC) Samples and are the property of Microsoft Corporation. The profiler screenshots above are from the Quantify GUI, which is the property of Unicom® Global. All other content is copyright © 2017, 2018

Develop for Performance