Fedora 11 ships with gcc version 4.4. Unfortunately, at the time of writing, the current CUDA release did not support gcc version 4.4. To get around this you'll need to install a 4.3.x or lower version of gcc and modify your CUDA installation to use the older gcc. To get an older gcc, you could yum install compat-gcc-34 and then apply the changes below. Note that only the gcc34 directory and the links contained within are required if you are not planning on compiling the examples from the CUDA SDK.
$ mkdir ~/gcc34 $ ln -s /usr/bin/gcc34 ~/gcc34/gcc $ ln -s /usr/bin/g++34 ~/gcc34/g++ $ su -l # cp -av /usr/local/cuda/sdk/common/common.mk /usr/local/cuda/sdk/common/common.mk.original # vi /usr/local/cuda/sdk/common/common.mk ... CXX := g++34 CC := gcc34 LINK := g++34 -fPIC ... NVCCFLAGS := --compiler-bindir ~/gcc34The below code is based on the first example from the Thrust website with additions to show the input and sorted output on standard error. Notice that there is a host_vector and device_vector, these represent std::vector like containers which use the main memory and VRAM respectively. The thrust::sort() call with transfer control from the CPU to the GPU and the sort will be processed on the graphics card. Once the sort is complete, execution will begin again on the CPU at the line after thrust::sort() call. As the second last line of main() shows, you can directly access an element from the device vector from code running on the CPU, but as it involves accessing the VRAM from the CPU it will be a slow operation. It is faster to copy the whole device vector back into main memory (a host vector) before iterating over its elements.
You can clearly see the host and device (RAM and VRAM) vectors used in the code to move the input and output data around. You might be wondering where are these kernel functions that were mentioned in the introduction of the series. The closest you get to one in this example is the invocation of thrust::sort which provides the same functionality as std::sort. While the outcome is the same, thrust::sort compiles its code to work on the GPU, in particular a version of thrust::less is used for element comparison.
#includeSo the in program above, the thrust::sort line will execute on the GPU, accessing the device_vector d_vec and sorting it's contents.#include #include #include #include #include #include int main(void) { // generate random data on the host thrust::host_vector h_vec(20); thrust::generate(h_vec.begin(), h_vec.end(), rand); std::cerr << "input..." << std::endl; std::copy( h_vec.begin(), h_vec.end(), std::ostream_iterator (std::cerr, "\n") ); std::cerr << "" << std::endl; // transfer to device and sort thrust::device_vector d_vec = h_vec; thrust::sort(d_vec.begin(), d_vec.end()); // show result thrust::host_vector h_result = d_vec; std::cerr << "output..." << std::endl; std::copy( h_result.begin(), h_result.end(), std::ostream_iterator (std::cerr, "\n") ); std::cerr << "" << std::endl; std::cerr << "third item in sorted data:" << d_vec[2] << std::endl; return 0; }
The commands below will compile and run the above example. Assuming CUDA is already installed on the machine. If your Linux distribution does not use gcc 4.4 then you can leave out the compiler-bindir argument to nvcc. CUDA programs have the extension .cu instead of .cpp. The gcc compiler is not invoked directly to compile a source file which uses CUDA, but the nvcc executable is used, which itself uses gcc behind the scenes.
$ nvcc --compiler-bindir ~/gcc34 website-example.cu -o website-example ... $ ./website-example input... 1804289383 846930886 1681692777 1714636915 1957747793 424238335 719885386 1649760492 596516649 1189641421 1025202362 1350490027 783368690 1102520059 2044897763 1967513926 1365180540 1540383426 304089172 1303455736 output... 304089172 424238335 596516649 719885386 783368690 846930886 1025202362 1102520059 1189641421 1303455736 1350490027 1365180540 1540383426 1649760492 1681692777 1714636915 1804289383 1957747793 1967513926 2044897763 third item in sorted data:596516649The program shown below is the first benchmark, comparing an NVidia 250 GTS card with an Intel Q6600 for sorting a vector of numbers. Both integer and floating point numbers are tested to see what performance impact there is using wider data types with a floating point less than operation.
The main() function simply calls the bench() template function with a specific numeric type and the size of the vector to use. Note that the only thing that the program does differently in order to use the GPU for the sort is copy the vector into a device_vector, use thrust::sort() instead of std::sort(), and copy the device_vector back to the main memory of the machine. The Benchmark class starts a timer whenever an object is created and stops the timer before printing the interval whenever an object is destroyed. It is used in a Resource Acquisition Is Initialization (RAII) design pattern where the object scope determines when the benchmark is started and stopped.
#includeThe results of the above program for integer vectors is shown below. Both axis are on logarithmic scales, the X-axis showing the vectors from one thousand to one million elements, the Y-axis recording runtime. The blue line represents time for the GPU to sort the vector from VRAM. The red line includes both the GPU processing time (blue line) and the time taken to transfer the vector between main memory and VRAM and back again. The purple line is the time taken by the CPU to sort the vector (std::sort).#include #include #include #include #include "bench.hh" template < class T > void bench( const std::string& n, const int SZ ) { cerr << "-------------" << endl; cerr << "bench() " << n << " SZ:" << SZ << endl; Benchmark dbm("bench function total...."); thrust::host_vector h_vec( SZ ); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer to device and sort { Benchmark dbm("GPU process and copy-to-and-from-device"); thrust::device_vector d_vec; { Benchmark dbm("GPU process and copy-to-device"); d_vec = h_vec; { Benchmark dbm("GPU process"); thrust::sort(d_vec.begin(), d_vec.end()); } } thrust::host_vector t = d_vec; T xx = t[3]; } // sort on host, CPU only. { Benchmark dbm("CPU only process"); std::sort( h_vec.begin(), h_vec.end() ); } } int main(void) { bench ( "int", 1 * 1000 ); bench ( "int", 10 * 1000 ); bench ( "int", 100 * 1000 ); bench ( "int", 1000 * 1000 ); bench ( "double", 1 * 1000 ); bench ( "double", 10 * 1000 ); bench ( "double", 100 * 1000 ); bench ( "double", 1000 * 1000 ); return 0; }
By the time your vector contains 10,000 elements, using the GPU is faster overall, but only slightly. Notice that the GPU doesn't change a great deal between 1,000 and 1,000,000 elements. This seems to indicate that the parallelization offered by thrust::sort has not hit the limits of the GPU hardware at a million elements. Performance wise, once your vector has a million elements, using the GPU is about 10 times faster! Not a bad return for changing a few lines of C++ code.
Sorting a vector of floating point numbers is shown below. This gives very similar results to sorting integers; a crossover at 10,000 elements where using the GPU becomes more attractive for sorting and at a million elements the GPU is about 10 times faster.
I created a small program to test the performance of copying memory between main memory (memcpy), from main memory to VRAM, from VRAM back to main memory, and between two VRAM buffers. The later three use cudaMemcpy() to perform the copy. The benchmark performs 100 copies using blocks of memory ranging from 100,000 to 100,000,000 bytes. The timings for copies from VRAM to VRAM are left off the graph because they were all below 4 milliseconds. The times for main memory to main (blue), main to VRAM (red), and VRAM to main (purple), are shown below. Note that both axis use a logarithmic scale. Although copying to VRAM was slower than copying to main memory, it was not by a huge factor. Copying back from VRAM to main memory was slower, perhaps because most games only send data to the graphics card.
Tune in next time when we'll take a look at sorting vectors of strings instead of numeric types.
No comments:
Post a Comment