RANA SONY: C++, the GPU, and Thrust: Sorting Numbers on the GPU

Version 1.0 of Thrust was released in May 2009 and is available under the Apache License version 2.0. There is a NOTICE file which also contains the Boost license and a small paragraph by Hewlett-Packard Company. So it appears there is really a mixture of open source licenses applied to the Thrust library. Thrust is completely implemented in header files, so installation for development consists of downloading the zip file and expanding it somewhere.
Fedora 11 ships with gcc version 4.4. Unfortunately, at the time of writing, the current CUDA release did not support gcc version 4.4. To get around this you'll need to install a 4.3.x or lower version of gcc and modify your CUDA installation to use the older gcc. To get an older gcc, you could yum install compat-gcc-34 and then apply the changes below. Note that only the gcc34 directory and the links contained within are required if you are not planning on compiling the examples from the CUDA SDK.

$ mkdir ~/gcc34
$ ln -s /usr/bin/gcc34 ~/gcc34/gcc
$ ln -s /usr/bin/g++34 ~/gcc34/g++

$ su -l
# cp -av /usr/local/cuda/sdk/common/common.mk /usr/local/cuda/sdk/common/common.mk.original
# vi /usr/local/cuda/sdk/common/common.mk
...
CXX        := g++34
CC         := gcc34
LINK       := g++34 -fPIC
...
NVCCFLAGS := --compiler-bindir ~/gcc34

The below code is based on the first example from the Thrust website with additions to show the input and sorted output on standard error. Notice that there is a host_vector and device_vector, these represent std::vector like containers which use the main memory and VRAM respectively. The thrust::sort() call with transfer control from the CPU to the GPU and the sort will be processed on the graphics card. Once the sort is complete, execution will begin again on the CPU at the line after thrust::sort() call. As the second last line of main() shows, you can directly access an element from the device vector from code running on the CPU, but as it involves accessing the VRAM from the CPU it will be a slow operation. It is faster to copy the whole device vector back into main memory (a host vector) before iterating over its elements.
You can clearly see the host and device (RAM and VRAM) vectors used in the code to move the input and output data around. You might be wondering where are these kernel functions that were mentioned in the introduction of the series. The closest you get to one in this example is the invocation of thrust::sort which provides the same functionality as std::sort. While the outcome is the same, thrust::sort compiles its code to work on the GPU, in particular a version of thrust::less is used for element comparison.

#include 
#include 
#include 
#include 
#include 

#include 
#include 

int main(void)
{
  // generate random data on the host
  thrust::host_vector h_vec(20);
  thrust::generate(h_vec.begin(), h_vec.end(), rand);
  std::cerr << "input..." << std::endl;
  std::copy( h_vec.begin(), h_vec.end(), std::ostream_iterator(std::cerr, "\n") );
  std::cerr << "" << std::endl;

  // transfer to device and sort
  thrust::device_vector d_vec = h_vec;
  thrust::sort(d_vec.begin(), d_vec.end());

  // show result
  thrust::host_vector h_result = d_vec;
  std::cerr << "output..." << std::endl;
  std::copy( h_result.begin(), h_result.end(), std::ostream_iterator(std::cerr, "\n") );
  std::cerr << "" << std::endl;

  std::cerr << "third item in sorted data:" << d_vec[2] << std::endl;

  return 0;
}

So the in program above, the thrust::sort line will execute on the GPU, accessing the device_vector d_vec and sorting it's contents.
The commands below will compile and run the above example. Assuming CUDA is already installed on the machine. If your Linux distribution does not use gcc 4.4 then you can leave out the compiler-bindir argument to nvcc. CUDA programs have the extension .cu instead of .cpp. The gcc compiler is not invoked directly to compile a source file which uses CUDA, but the nvcc executable is used, which itself uses gcc behind the scenes.

$ nvcc --compiler-bindir ~/gcc34   website-example.cu -o website-example
...
$ ./website-example 
input...
1804289383
846930886
1681692777
1714636915
1957747793
424238335
719885386
1649760492
596516649
1189641421
1025202362
1350490027
783368690
1102520059
2044897763
1967513926
1365180540
1540383426
304089172
1303455736

output...
304089172
424238335
596516649
719885386
783368690
846930886
1025202362
1102520059
1189641421
1303455736
1350490027
1365180540
1540383426
1649760492
1681692777
1714636915
1804289383
1957747793
1967513926
2044897763

third item in sorted data:596516649

The program shown below is the first benchmark, comparing an NVidia 250 GTS card with an Intel Q6600 for sorting a vector of numbers. Both integer and floating point numbers are tested to see what performance impact there is using wider data types with a floating point less than operation.
The main() function simply calls the bench() template function with a specific numeric type and the size of the vector to use. Note that the only thing that the program does differently in order to use the GPU for the sort is copy the vector into a device_vector, use thrust::sort() instead of std::sort(), and copy the device_vector back to the main memory of the machine. The Benchmark class starts a timer whenever an object is created and stops the timer before printing the interval whenever an object is destroyed. It is used in a Resource Acquisition Is Initialization (RAII) design pattern where the object scope determines when the benchmark is started and stopped.

#include 
#include 
#include 
#include 
#include 

#include "bench.hh"

template < class T >
void bench( const std::string& n, const int SZ )
{
    cerr << "-------------" << endl;
    cerr << "bench() " << n << " SZ:" << SZ << endl;
    
    Benchmark dbm("bench function total....");
    thrust::host_vector h_vec( SZ );
    thrust::generate(h_vec.begin(), h_vec.end(), rand);

    // transfer to device and sort
    {
        Benchmark dbm("GPU process and copy-to-and-from-device");
        thrust::device_vector d_vec;
        
        {
            Benchmark dbm("GPU process and copy-to-device");
            d_vec = h_vec;
            {
                Benchmark dbm("GPU process");
                thrust::sort(d_vec.begin(), d_vec.end());
            }
        }
        
        thrust::host_vector t = d_vec;
        T xx = t[3];
    }

    // sort on host, CPU only.
    {
        Benchmark dbm("CPU only process");
        std::sort( h_vec.begin(), h_vec.end() );
    }
}

int main(void)
{
    bench( "int",          1 * 1000 );
    bench( "int",         10 * 1000 );
    bench( "int",        100 * 1000 );
    bench( "int",       1000 * 1000 );

    bench( "double",          1 * 1000 );
    bench( "double",         10 * 1000 );
    bench( "double",        100 * 1000 );
    bench( "double",       1000 * 1000 );
    
    return 0;
}

The results of the above program for integer vectors is shown below. Both axis are on logarithmic scales, the X-axis showing the vectors from one thousand to one million elements, the Y-axis recording runtime. The blue line represents time for the GPU to sort the vector from VRAM. The red line includes both the GPU processing time (blue line) and the time taken to transfer the vector between main memory and VRAM and back again. The purple line is the time taken by the CPU to sort the vector (std::sort).
By the time your vector contains 10,000 elements, using the GPU is faster overall, but only slightly. Notice that the GPU doesn't change a great deal between 1,000 and 1,000,000 elements. This seems to indicate that the parallelization offered by thrust::sort has not hit the limits of the GPU hardware at a million elements. Performance wise, once your vector has a million elements, using the GPU is about 10 times faster! Not a bad return for changing a few lines of C++ code.

Sorting a vector of floating point numbers is shown below. This gives very similar results to sorting integers; a crossover at 10,000 elements where using the GPU becomes more attractive for sorting and at a million elements the GPU is about 10 times faster.

I created a small program to test the performance of copying memory between main memory (memcpy), from main memory to VRAM, from VRAM back to main memory, and between two VRAM buffers. The later three use cudaMemcpy() to perform the copy. The benchmark performs 100 copies using blocks of memory ranging from 100,000 to 100,000,000 bytes. The timings for copies from VRAM to VRAM are left off the graph because they were all below 4 milliseconds. The times for main memory to main (blue), main to VRAM (red), and VRAM to main (purple), are shown below. Note that both axis use a logarithmic scale. Although copying to VRAM was slower than copying to main memory, it was not by a huge factor. Copying back from VRAM to main memory was slower, perhaps because most games only send data to the graphics card.