When writing high performance applications, you may need to understand some of the fundamentals of CPU architecture, such as
CPU pipelining. Often it can be most useful to see the effect of pipelining in a simple example.
The full source code for this example is below, but the most important part is the assembly code and results.
The test uses the Windows high performance timer to see how long it takes to execute a function written in Intel x64 assembly.
The simple loop below sets the rax register to a large number (in this case 2.33 billion, since I am running on a 2.33 GHz processor), and repeatedly decreases it until it hits zero. My console application then printed out the number of milliseconds this took to execute. On my machine, it took roughly 1000 ms to execute, even though it is executing a decrement instruction followed by a jump.
.CODE
runAssemblyCode PROC
mov rax, 2330 * 1000 * 1000
start:
dec rax
jnz start
ret
runAssemblyCode ENDP
END
To complicate things even more, here I perform more, independent, operations in the same loop. Instead of 1 decrement, I now do 5 on various different registers. The total time taken is only about 2000 milliseconds.
.CODE
runAssemblyCode PROC
mov rax, 2330 * 1000 * 1000
start:
dec rcx
dec rdx
dec r9
dec r10
dec rax
jnz start
ret
runAssemblyCode ENDP
END
As to the finer details of Intel CPU pipelining, I could not explain how to calculate the expected execution time for these examples, but they do demonstrate quite well that a sequence of carefully crafted instructions can execute faster than the clock speed of your CPU would have you believe.
To run this application yourself, simply create a Visual Studio 2010 (or newer) C++ project and create two files - one .ASM file with the assembly above, and one .CPP file with the C++ code below. You will also need to enable the MASM build customisation by right clicking on the project and clicking
Build Customisations.
#include <Windows.h>
#include <memory>
#include <iostream>
using namespace std;
extern "C" void runAssemblyCode();
class Timer
{
public:
Timer()
{
QueryPerformanceFrequency(&_ticksPerSecond);
Reset();
}
void Reset()
{
QueryPerformanceCounter(&_startedAt);
}
long long GetElapsedMilliseconds()
{
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
return (now.QuadPart - _startedAt.QuadPart) * 1000 / _ticksPerSecond.QuadPart;
}
private:
LARGE_INTEGER _startedAt;
LARGE_INTEGER _ticksPerSecond;
};
int wmain(int argc, wchar_t* argv[])
{
Timer timer;
runAssemblyCode();
auto elapsed = timer.GetElapsedMilliseconds();
cout << elapsed << endl;
return 0;
}