Wednesday, October 5, 2011

Demonstrating the effect of CPU pipelining on performance

When writing high performance applications, you may need to understand some of the fundamentals of CPU architecture, such as CPU pipelining. Often it can be most useful to see the effect of pipelining in a simple example.

The full source code for this example is below, but the most important part is the assembly code and results.

The test uses the Windows high performance timer to see how long it takes to execute a function written in Intel x64 assembly.

The simple loop below sets the rax register to a large number (in this case 2.33 billion, since I am running on a 2.33 GHz processor), and repeatedly decreases it until it hits zero. My console application then printed out the number of milliseconds this took to execute. On my machine, it took roughly 1000 ms to execute, even though it is executing a decrement instruction followed by a jump.
.CODE             

runAssemblyCode PROC
  mov rax, 2330 * 1000 * 1000
 start:
  dec rax
  jnz start
  ret 
runAssemblyCode ENDP 
END

To complicate things even more, here I perform more, independent, operations in the same loop. Instead of 1 decrement, I now do 5 on various different registers. The total time taken is only about 2000 milliseconds.

.CODE             

runAssemblyCode PROC
  mov rax, 2330 * 1000 * 1000
 start:
  dec rcx
  dec rdx
  dec r9
  dec r10
  dec rax
  jnz start
  ret 
runAssemblyCode ENDP 
END

As to the finer details of Intel CPU pipelining, I could not explain how to calculate the expected execution time for these examples, but they do demonstrate quite well that a sequence of carefully crafted instructions can execute faster than the clock speed of your CPU would have you believe.

To run this application yourself, simply create a Visual Studio 2010 (or newer) C++ project and create two files - one .ASM file with the assembly above, and one .CPP file with the C++ code below. You will also need to enable the MASM build customisation by right clicking on the project and clicking Build Customisations.

#include <Windows.h>
#include <memory>
#include <iostream>

using namespace std;

extern "C" void runAssemblyCode();

class Timer
{
public:
    Timer()
    {
        QueryPerformanceFrequency(&_ticksPerSecond);
        Reset();
    }

    void Reset()
    {
        QueryPerformanceCounter(&_startedAt);
    }

    long long GetElapsedMilliseconds()
    {
        LARGE_INTEGER now;
        QueryPerformanceCounter(&now);
        return (now.QuadPart - _startedAt.QuadPart) * 1000 / _ticksPerSecond.QuadPart;
    }

private:
    LARGE_INTEGER _startedAt;
    LARGE_INTEGER _ticksPerSecond;
};

int wmain(int argc, wchar_t* argv[])
{
    Timer timer;

    runAssemblyCode();

    auto elapsed = timer.GetElapsedMilliseconds();
    cout << elapsed << endl;

    return 0;
}

1 comment:

  1. It is amazing and wonderful to visit your site. I've learn many things from your site.
    Pipelining

    ReplyDelete