Enable GPU acceleration. Computing on GPU Calculation from gpu to cpu

Using GPU Computing with C++ AMP

So far, in discussing parallel programming techniques, we have considered only processor cores. We have gained some skills in parallelizing programs across multiple processors, synchronizing access to shared resources, and using high-speed synchronization primitives without using locks.

However, there is another way to parallelize programs - graphics processing units (GPUs), having a large number cores than even high-performance processors. GPU cores are excellent for implementing parallel data processing algorithms, and their large number more than pays for the inconvenience of running programs on them. In this article we will get acquainted with one of the ways to run programs on a GPU, using a set of C++ language extensions called C++AMP.

The C++ AMP extensions are based on the C++ language, which is why this article will demonstrate examples in C++. However, with moderate use of the interaction mechanism in. NET, you can use C++ AMP algorithms in your .NET programs. But we will talk about this at the end of the article.

Introduction to C++ AMP

In fact, GPU is the same processor as any other, but with a special set of instructions, a large number of cores and its own memory access protocol. However, there are big differences between modern GPUs and conventional processors, and understanding them is key to creating programs that effectively use the processing power of the GPU.

Modern GPUs have a very small instruction set. This implies some limitations: lack of ability to call functions, limited set of supported data types, lack of library functions, and others. Some operations, such as conditional branches, can cost significantly more than similar operations performed on conventional processors. Obviously, moving large amounts of code from the CPU to the GPU under such conditions requires significant effort.

The number of cores in the average GPU is significantly higher than in the average conventional processor. However, some tasks are too small or cannot be broken down into large enough parts to benefit from the GPU.

Synchronization support between GPU cores performing the same task is very poor, and completely absent between GPU cores performing different tasks. This circumstance requires synchronization of the graphics processor with a conventional processor.

The question immediately arises: what tasks are suitable for solving on a GPU? Keep in mind that not every algorithm is suitable for execution on a GPU. For example, GPUs don't have access to I/O devices, so you won't be able to improve the performance of a program that scrapes RSS feeds from the Internet by using a GPU. However, many computational algorithms can be transferred to the GPU and can be massively parallelized. Below are a few examples of such algorithms (this list is by no means complete):

increasing and decreasing sharpness of images, and other transformations;

fast Fourier transform;

matrix transposition and multiplication;

number sorting;

direct hash inversion.

An excellent source for additional examples is the Microsoft Native Concurrency blog, which provides code snippets and explanations for various algorithms implemented in C++ AMP.

C++ AMP is a framework included with Visual Studio 2012 that gives C++ developers an easy way to perform computations on the GPU, requiring only a DirectX 11 driver. Microsoft has released C++ AMP as an open specification that can be implemented by any compiler vendor.

The C++ AMP framework allows you to run code in graphics accelerators accelerators, which are computing devices. Using the DirectX 11 driver, the C++ AMP framework dynamically detects all accelerators. C++ AMP also includes software emulator accelerator and emulator based on a conventional processor, WARP, which serves as a fallback on systems without a GPU or with a GPU but lacking a DirectX 11 driver, and uses multiple cores and SIMD instructions.

Now let's start exploring an algorithm that can easily be parallelized for execution on a GPU. The implementation below takes two vectors of equal length and calculates the pointwise result. It's hard to imagine anything more straightforward:

Void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( for (int i = 0; i< length; ++i) { result[i] = first[i] + exp(second[i]); } }

To parallelize this algorithm on a regular processor, you need to split the iteration range into several subranges and run one thread of execution for each of them. We've spent a lot of time in previous articles on exactly this way of parallelizing our first prime number search example - we've seen how it can be done by creating threads manually, passing jobs to a thread pool, and using Parallel.For and PLINQ to automatically parallelize. Remember also that when parallelizing similar algorithms on a conventional processor, we took special care not to split the problem into too small tasks.

For the GPU, these warnings are not needed. GPUs have multiple cores that execute threads very quickly, and the cost of context switching is significantly lower than conventional processors. Below is a snippet trying to use the function parallel_for_each from the C++ AMP framework:

#include #include using namespace concurrency; void VectorAddExpPointwise(float* first, float* second, float* result, int length) ( array_view avFirst(length, first); array_view avSecond(length, second); array_view avResult(length, result); avResult.discard_data(); parallel_for_each(avResult.extent, [=](index<1>i) restrict(amp) ( avResult[i] = avFirst[i] + fast_math::exp(avSecond[i]); )); avResult.synchronize(); )

Now let's examine each part of the code separately. Let's immediately note that the general form of the main loop has been preserved, but the originally used for loop has been replaced by a call to the parallel_for_each function. In fact, the principle of converting a loop into a function or method call is not new to us - such a technique has previously been demonstrated using the Parallel.For() and Parallel.ForEach() methods from the TPL library.

Next, the input data (parameters first, second and result) are wrapped with instances array_view. The array_view class is used to wrap data passed to the GPU (accelerator). Its template parameter specifies the data type and its dimension. In order to execute instructions on a GPU that access data originally processed on a conventional CPU, someone or something must take care of copying the data to the GPU because most modern graphics cards are separate devices with their own memory. array_view instances solve this problem - they provide data copying on demand and only when it is really needed.

When the GPU completes the task, the data is copied back. By instantiating array_view with a const argument, we ensure that first and second are copied into GPU memory, but not copied back. Likewise, calling discard_data(), we exclude copying result from the memory of a regular processor to the accelerator memory, but this data will be copied in the opposite direction.

The parallel_for_each function takes an extent object that specifies the form of the data to be processed and a function to apply to each element in the extent object. In the example above, we used a lambda function, support for which appeared in the ISO C++2011 (C++11) standard. The restrict (amp) keyword instructs the compiler to check whether the function body can be executed on the GPU and disables most C++ syntax that cannot be compiled into GPU instructions.

Lambda function parameter, index<1>object, represents a one-dimensional index. It must match the extent object being used - if we were to declare the extent object to be two-dimensional (for example, by defining the shape of the source data as a two-dimensional matrix), the index would also need to be two-dimensional. An example of such a situation is given below.

Finally, the method call synchronize() at the end of the VectorAddExpPointwise method, it ensures that the calculation results from array_view avResult, produced by the GPU, are copied back to the result array.

This concludes our first introduction to the world of C++ AMP, and now we are ready for more detailed research, as well as more interesting examples demonstrating the benefits of using parallel computing on a GPU. Vector addition is not a good algorithm and is not the best candidate for demonstrating GPU usage due to the large overhead of copying data. The next subsection will show two more interesting examples.

Matrix multiplication

The first "real" example we'll look at is matrix multiplication. For implementation, we will take a simple cubic matrix multiplication algorithm, and not the Strassen algorithm, which has a execution time close to cubic ~O(n 2.807). Given two matrices, an m x w matrix A and a w x n matrix B, the following program will multiply them and return the result, an m x n matrix C:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( for (int i = 0; i< m; ++i) { for (int j = 0; j < n; ++j) { int sum = 0; for (int k = 0; k < w; ++k) { sum += A * B; } C = sum; } } }

There are several ways to parallelize this implementation, and if you want to parallelize this code to run on a regular processor, the right choice would be to parallelize the outer loop. However, the GPU has a fairly large number of cores, and by parallelizing only the outer loop, we will not be able to create a sufficient number of jobs to load all the cores with work. Therefore, it makes sense to parallelize the two outer loops, leaving the inner loop untouched:

Void MatrixMultiply (int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent, [=](index<2>idx) restrict(amp) ( int sum = 0; for (int k = 0; k< w; ++k) { sum + = avA(idx*w, k) * avB(k*w, idx); } avC = sum; }); }

This implementation still closely resembles the sequential implementation of matrix multiplication and the vector addition example given above, with the exception of the index, which is now two-dimensional and accessible in the inner loop using the operator. How much faster is this version than the sequential alternative running on a regular processor? Multiplying two matrices (integers) of size 1024 x 1024, the sequential version on a regular CPU takes an average of 7350 milliseconds, while the GPU version - hold on tight - takes 50 milliseconds, 147 times faster!

Particle motion simulation

The examples of solving problems on the GPU presented above have a very simple implementation of the internal loop. It is clear that this will not always be the case. The Native Concurrency blog, linked above, demonstrates an example of modeling gravitational interactions between particles. The simulation involves an infinite number of steps; at each step, new values of the elements of the acceleration vector are calculated for each particle and then their new coordinates are determined. Here, the particle vector is parallelized - with a sufficiently large number of particles (from several thousand and above), you can create a sufficiently large number of tasks to load all the GPU cores with work.

The basis of the algorithm is the implementation of determining the result of interactions between two particles, as shown below, which can easily be transferred to the GPU:

// here float4 are vectors with four elements // representing the particles involved in the operations void bodybody_interaction (float4& acceleration, const float4 p1, const float4 p2) restrict(amp) ( float4 dist = p2 – p1; // no w here used float absDist = dist.x*dist.x + dist.y*dist.y + dist.z*dist.z; float invDist = 1.0f / sqrt(absDist); float invDistCube = invDist*invDist*invDist; acceleration + = dist*PARTICLE_MASS*invDistCube; )

The initial data at each modeling step is an array with the coordinates and velocities of particles, and as a result of calculations, a new array with the coordinates and velocities of particles is created:

Struct particle ( float4 position, velocity; // implementations of constructor, copy constructor and // operator = with restrict(amp) omitted to save space ); void simulation_step(array & previous, array & next, int bodies) ( extent<1>ext(bodies); parallel_for_each (ext, [&](index<1>idx) restrict(amp) ( particle p = previous; float4 acceleration(0, 0, 0, 0); for (int body = 0; body< bodies; ++body) { bodybody_interaction (acceleration, p.position, previous.position); } p.velocity + = acceleration*DELTA_TIME; p.position + = p.velocity*DELTA_TIME; next = p; }); }

With the help of an appropriate graphical interface, modeling can be very interesting. The full example provided by the C++ AMP team can be found on the Native Concurrency blog. On my system with an Intel Core i7 processor and a Geforce GT 740M graphics card, the simulation of 10,000 particles runs at ~2.5 fps (steps per second) using the sequential version running on the regular processor, and 160 fps using the optimized version running on the GPU - a huge increase in performance.

Before we wrap up this section, there is one more important feature of the C++ AMP framework that can further improve the performance of code running on the GPU. GPUs support programmable data cache(often called shared memory). The values stored in this cache are shared by all threads of execution in a single tile. Thanks to memory tiling, programs based on the C++ AMP framework can read data from graphics card memory into the shared memory of the mosaic and then access it from multiple threads of execution without re-fetching the data from graphics card memory. Accessing mosaic shared memory is approximately 10 times faster than graphics card memory. In other words, you have reasons to keep reading.

To provide a tiled version of the parallel loop, the parallel_for_each method is passed domain tiled_extent, which divides the multidimensional extent object into multidimensional tiles, and the tiled_index lambda parameter, which specifies the global and local ID of the thread within the tile. For example, a 16x16 matrix can be divided into 2x2 tiles (as shown in the image below) and then passed to the parallel_for_each function:

Extent<2>matrix(16,16); tiled_extent<2,2>tiledMatrix = matrix.tile<2,2>(); parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // ... ));

Each of the four threads of execution belonging to the same mosaic can share the data stored in the block.

When performing operations with matrices, in the GPU core, instead of the standard index<2>, as in the examples above, you can use idx.global. Proper use of local tiled memory and local indexes can provide significant performance gains. To declare tiled memory shared by all threads of execution in a single tile, local variables can be declared with the tile_static specifier.

In practice, the technique of declaring shared memory and initializing its individual blocks in different threads of execution is often used:

Parallel_for_each(tiledMatrix, [=](tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; ));

Obviously, any benefits from using shared memory can only be obtained if access to this memory is synchronized; that is, threads must not access memory until it has been initialized by one of them. Synchronization of threads in a mosaic is performed using objects tile_barrier(reminiscent of the Barrier class from the TPL library) - they will be able to continue execution only after calling the tile_barrier.Wait() method, which will return control only when all threads have called tile_barrier.Wait. For example:

Parallel_for_each(tiledMatrix, (tiled_index<2,2>idx) restrict(amp) ( // 32 bytes are shared by all threads in the block tile_static int local; // assign a value to the element for this thread of execution local = 42; // idx.barrier is an instance of tile_barrier idx.barrier.wait(); // Now this thread can access the "local" array // using the indexes of other threads of execution! ));

Now is the time to translate what you have learned into a concrete example. Let's return to the implementation of matrix multiplication, performed without the use of tiling memory organization, and add the described optimization to it. Let's assume that the matrix size is a multiple of 256 - this will allow us to work with 16 x 16 blocks. The nature of matrices allows for block-by-block multiplication, and we can take advantage of this feature (in fact, dividing matrices into blocks is a typical optimization of the matrix multiplication algorithm, providing more efficient CPU cache usage).

The essence of this technique comes down to the following. To find C i,j (the element in row i and column j in the result matrix), you need to calculate the dot product between A i,* (i-th row of the first matrix) and B *,j (j-th column in the second matrix ). However, this is equivalent to computing the partial dot products of the row and column and then summing the results. We can use this fact to convert the matrix multiplication algorithm into a tiling version:

Void MatrixMultiply(int* A, int m, int w, int* B, int n, int* C) ( array_view avA(m, w, A); array_view avB(w, n, B); array_view avC(m, n, C); avC.discard_data(); parallel_for_each(avC.extent.tile<16,16>(), [=](tiled_index<16,16>idx) restrict(amp) ( int sum = 0; int localRow = idx.local, localCol = idx.local; for (int k = 0; k

The essence of the described optimization is that each thread in the mosaic (256 threads are created for a 16 x 16 block) initializes its element in 16 x 16 local copies of fragments of the original matrices A and B. Each thread in the mosaic requires only one row and one column of these blocks, but all threads together will access each row and each column 16 times. This approach significantly reduces the number of accesses to main memory.

To calculate element (i,j) in the result matrix, the algorithm requires the complete i-th row of the first matrix and the j-th column of the second matrix. When the threads are 16x16 tiling represented in the diagram and k=0, the shaded regions in the first and second matrices will be read into shared memory. The execution thread computing element (i,j) in the result matrix will calculate the partial dot product of the first k elements from the i-th row and j-th column of the original matrices.

In this example, using a tiled organization provides a huge performance boost. The tiled version of matrix multiplication is much faster than the simple version, taking approximately 17 milliseconds (for the same 1024 x 1024 input matrices), which is 430 times faster than the version running on a conventional processor!

Before we end our discussion of the C++ AMP framework, we would like to mention the tools (in Visual Studio) available to developers. Visual Studio 2012 offers a graphics processing unit (GPU) debugger that lets you set breakpoints, examine the call stack, and read and change local variable values (some accelerators support GPU debugging directly; for others, Visual Studio uses a software simulator), and a profiler that lets you evaluate the benefits an application receives from parallelizing operations using a GPU. For more information about debugging capabilities in Visual Studio, see the Walkthrough article. Debugging a C++ AMP Application" on MSDN.

GPU Computing Alternatives in .NET

So far this article has only shown examples in C++, however, there are several ways to harness the power of the GPU in managed applications. One way is to use interop tools that allow you to offload work with GPU cores to low-level C++ components. This solution is great for those who want to use the C++ AMP framework or have the ability to use pre-built C++ AMP components in managed applications.

Another way is to use a library that works directly with the GPU from managed code. There are currently several such libraries. For example, GPU.NET and CUDAfy.NET (both commercial offerings). Below is an example from the GPU.NET GitHub repository demonstrating the implementation of the dot product of two vectors:

Public static void MultiplyAddGpu(double a, double b, double c) ( int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X; int TotalThreads = BlockDimension.X * GridDimension.X; for (int ElementIdx = ThreadId; ElementIdx

I am of the opinion that it is much easier and more efficient to learn a language extension (based on C++ AMP) than to try to orchestrate interactions at the library level or make significant changes to the IL language.

So, after we looked at the possibilities of parallel programming in .NET and using the GPU, no one doubts that organizing parallel computing is an important way to increase productivity. In many servers and workstations around the world, the invaluable processing power of CPUs and GPUs goes unused because applications simply don't use it.

The Task Parallel Library gives us a unique opportunity to include all available CPU cores, although this will require solving some interesting problems of synchronization, excessive task fragmentation, and unequal distribution of work between execution threads.

The C++ AMP framework and other multi-purpose GPU parallel computing libraries can be successfully used to parallelize calculations across hundreds of GPU cores. Finally, there is a previously unexplored opportunity to gain productivity gains from the use of cloud distributed computing technologies, which have recently become one of the main directions in the development of information technology.

Features of the AMD/ATI Radeon architecture

This is similar to the birth of new biological species, when, during the development of habitats, living beings evolve to improve their adaptability to the environment. Likewise, the GPU, starting with accelerating the rasterization and texturing of triangles, developed additional capabilities for executing shader programs for coloring these same triangles. And these abilities are also in demand in non-graphical computing, where in some cases they provide significant performance gains compared to traditional solutions.

Let us draw further analogies - after a long evolution on land, mammals penetrated into the sea, where they displaced ordinary marine inhabitants. In competition, mammals used both new advanced abilities that appeared on the earth’s surface, and those specially acquired for adaptation to life in water. Likewise, GPUs, based on the architecture's strengths for 3D graphics, are increasingly adding specialized functionality useful for non-graphics tasks.

So, what allows GPUs to claim their own sector in the general-purpose software space? The GPU microarchitecture is built completely differently than that of conventional CPUs, and it inherently contains certain advantages. Graphics tasks require independent parallel processing, and the GPU is natively multi-threaded. But this parallelism only brings him joy. The microarchitecture is designed to exploit the large number of threads available that require execution.

The GPU consists of several dozen (30 for Nvidia GT200, 20 for Evergreen, 16 for Fermi) processor cores, which are called Streaming Multiprocessor in Nvidia terminology, and SIMD Engine in ATI terminology. For the purposes of this article, we will call them miniprocessors, because they execute several hundred program threads and can do almost everything that a regular CPU can do, but still not everything.

Marketing names are confusing - for greater importance, they indicate the number of functional modules that can subtract and multiply: for example, 320 vector “cores”. These kernels are more like grains. It is better to think of the GPU as a kind of multi-core processor with a large number of cores executing many threads simultaneously.

Each miniprocessor has local memory, 16 KB for GT200, 32 KB for Evergreen and 64 KB for Fermi (essentially a programmable L1 cache). It has an access time similar to the first level cache of a conventional CPU and performs similar functions for the fastest delivery of data to functional modules. In the Fermi architecture, part of the local memory can be configured as a regular cache. In a GPU, local memory is used for fast data exchange between executing threads. One of the usual schemes of a GPU program is as follows: first, data from the GPU global memory is loaded into local memory. This is just ordinary video memory, located (like system memory) separately from “its” processor - in the case of video, it is soldered by several chips on the PCB of the video card. Next, several hundred threads work with this data in local memory and write the result to global memory, after which it is transferred to the CPU. It is the programmer's responsibility to write instructions for loading and unloading data from local memory. Essentially, it is partitioning the data [of a specific task] for parallel processing. The GPU also supports atomic write/read instructions into memory, but they are ineffective and are usually needed at the final stage to “glue together” the calculation results of all miniprocessors.

Local memory is common to all threads executing in the miniprocessor, therefore, for example, in Nvidia terminology it is even called shared, and the term local memory denotes the exact opposite, namely: a certain personal area of a separate thread in global memory, visible and accessible only to it. But in addition to local memory, the miniprocessor has another memory area, which in all architectures is approximately four times larger in volume. It is divided equally between all executing threads; these are registers for storing variables and intermediate calculation results. Each thread has several dozen registers. The exact number depends on how many threads the miniprocessor is running. This number is very important, since the latency of global memory is very high, hundreds of cycles, and in the absence of caches there is nowhere to store intermediate results of calculations.

And one more important feature of the GPU: “soft” vectorization. Each miniprocessor has a large number of compute modules (8 for GT200, 16 for Radeon and 32 for Fermi), but they can all execute only the same instruction, with the same program address. In this case, the operands can be different, different threads have their own. For example, instructions add the contents of two registers: it is executed simultaneously by all computing devices, but the registers are taken different. It is assumed that all threads of the GPU program, performing parallel data processing, generally move in a parallel course through the program code. Thus, all computing modules are loaded evenly. And if the threads diverge in their code execution path due to branches in the program, then so-called serialization occurs. Then not all computing modules are used, since the threads submit various instructions for execution, and a block of computing modules can execute, as we have already said, only an instruction with one address. And, of course, productivity drops relative to maximum.

The advantage is that vectorization is completely automatic, it is not programming using SSE, MMX, and so on. And the GPU itself handles the discrepancies. Theoretically, you can generally write programs for the GPU without thinking about the vector nature of the execution modules, but the speed of such a program will not be very high. The downside is the large width of the vector. It is larger than the nominal number of functional modules and is 32 for Nvidia GPUs and 64 for Radeon. The threads are processed in blocks of appropriate size. Nvidia calls this block of threads the term warp, AMD calls it wave front, which is the same thing. Thus, on 16 computing devices, a “wavefront” with a length of 64 threads is processed in four clock cycles (assuming the usual instruction length). The author prefers the term warp in this case, due to the association with the nautical term warp, meaning a rope tied from twisted ropes. So the threads “twist” and form a solid bundle. However, “wave front” can also be associated with the sea: instructions arrive at the actuators in the same way as waves roll onto the shore one after another.

If all threads are equally advanced in program execution (located in the same place) and thus executing the same instruction, then everything is fine, but if not, slowdown occurs. In this case, threads from one warp or wave front are located in different places in the program; they are divided into groups of threads that have the same instruction number value (in other words, instruction pointer). And only the threads of one group are still executed at one time - all execute the same instruction, but with different operands. As a result, warp runs as many times slower as the number of groups it is divided into, and the number of threads in the group does not matter. Even if the group consists of only one thread, it will still take the same amount of time to execute as a full warp. In hardware, this is implemented by masking certain threads, that is, instructions are formally executed, but the results of their execution are not recorded anywhere and are not used in the future.

Although at any given time each miniprocessor (Streaming MultiProcessor or SIMD Engine) executes instructions belonging to only one warp (a bunch of threads), it has several dozen active warps in the execution pool. Having executed the instructions of one warp, the miniprocessor executes not the next instruction of the threads of this warp, but the instructions of some other warp. That warp can be in a completely different place in the program, this will not affect the speed, since only inside the warp the instructions of all threads must be the same for execution at full speed.

In this case, each of the 20 SIMD Engines has four active wave fronts, each with 64 threads. Each thread is indicated by a short line. Total: 64×4×20=5120 threads

Thus, given that each warp or wave front consists of 32-64 threads, the miniprocessor has several hundred active threads that execute almost simultaneously. Below we will see what architectural benefits such a large number of parallel threads promises, but first we will consider what limitations the miniprocessors that make up the GPU have.

The main thing is that the GPU does not have a stack where function parameters and local variables could be stored. Due to the large number of threads, there is simply no space on the chip for the stack. Indeed, since the GPU simultaneously executes about 10,000 threads, with a stack size of one thread of 100 KB, the total volume will be 1 GB, which is equal to the standard amount of all video memory. Moreover, there is no way to place a stack of any significant size in the GPU core itself. For example, if you put 1000 bytes of stack on a thread, then only one miniprocessor would require 1 MB of memory, which is almost five times the combined amount of local memory of the miniprocessor and the memory allocated for storing registers.

Therefore, there is no recursion in a GPU program, and there is not much to be done with function calls. All functions are directly inserted into the code when compiling the program. This limits the scope of GPU applications to computational-type tasks. It is sometimes possible to use limited stack emulation using global memory for recursion algorithms with known small iteration depths, but this is not a typical GPU application. To do this, it is necessary to specially develop an algorithm and explore the possibility of its implementation without guaranteeing successful acceleration compared to the CPU.

Fermi introduced the ability to use virtual functions for the first time, but again their use is limited by the lack of a large, fast cache for each thread. 1536 threads account for 48 KB or 16 KB of L1, that is, virtual functions in a program can be used relatively rarely, otherwise the stack will also use slow global memory, which will slow down execution and, most likely, will not bring benefits compared to the CPU version.

Thus, the GPU is represented as a computing coprocessor into which data is loaded, it is processed by some algorithm, and the result is produced.

Architecture benefits

But it calculates the GPU very quickly. And its high multithreading helps it with this. A large number of active threads makes it possible to partially hide the high latency of the separately located global video memory, which is about 500 clock cycles. It is leveled out especially well for code with a high density of arithmetic operations. Thus, the transistor-expensive L1-L2-L3 cache hierarchy is not required. Instead, multiple compute modules can be placed on the chip, providing outstanding arithmetic performance. While the instructions of one thread or warp are being executed, the remaining hundreds of threads are quietly waiting for their data.

Fermi introduced a L2 cache of about 1 MB in size, but it cannot be compared with the caches of modern processors, it is more intended for communication between cores and various software tricks. If its size is divided among all tens of thousands of threads, each will have a very negligible volume.

But in addition to global memory latency, there are many more latencies in a computing device that need to be hidden. This is the latency of on-chip data transfer from computing devices to the first level cache, that is, the local memory of the GPU, and to the registers, as well as the instruction cache. The register file, as well as local memory, are located separately from the functional modules, and the access speed to them is approximately one and a half dozen cycles. And again, a large number of threads, active warps, can effectively hide this latency. Moreover, the total access bandwidth (bandwidth) to the local memory of the entire GPU, taking into account the number of miniprocessors composing it, is significantly greater than the access bandwidth to the first level cache of modern CPUs. The GPU can process significantly more data per unit of time.

We can immediately say that if the GPU is not provided with a large number of parallel threads, then it will have almost zero performance, because it will work at the same pace as if fully loaded, and will do much less work. For example, let there be only one thread instead of 10,000: performance will drop by about a thousand times, because not only will not all blocks be loaded, but all latencies will also be affected.

The problem of hiding latencies is also acute for modern high-frequency CPUs; sophisticated methods are used to eliminate it - deep pipelining, out-of-order execution of instructions. This requires complex instruction schedulers, various buffers, etc., which takes up space on the chip. This is all required for best single-threaded performance.

But all this is not needed for the GPU; it is architecturally faster for computing tasks with a large number of threads. But it transforms multithreading into performance, like the philosopher's stone turns lead into gold.

The GPU was originally designed for optimal execution of shader programs for triangle pixels, which are obviously independent and can be executed in parallel. And from this state it has evolved by adding various capabilities (local memory and addressable access to video memory, as well as complicating the instruction set) into a very powerful computing device, which can still be effectively used only for algorithms that allow highly parallel implementation using a limited amount of local memory. memory.

Example

One of the most classic problems for GPU is the problem of calculating the interaction of N bodies creating a gravitational field. But if we, for example, need to calculate the evolution of the Earth-Moon-Sun system, then the GPU is a bad help for us: there are few objects. For each object, it is necessary to calculate interactions with all other objects, and there are only two of them. In the case of the motion of the Solar System with all the planets and their moons (about a couple hundred objects), the GPU is still not very efficient. However, due to the high overhead of thread management, a multi-core processor will also not be able to display all its power and will operate in single-threaded mode. But if you also need to calculate the trajectories of comets and asteroid belt objects, then this is already a task for the GPU, since there are enough objects to create the required number of parallel calculation threads.

The GPU will also perform well if you need to calculate the collision of globular clusters of hundreds of thousands of stars.

Another opportunity to use GPU power in an N-body problem arises when you need to calculate many individual problems, albeit with a small number of bodies. For example, if you need to calculate options for the evolution of one system for different options for initial velocities. Then you can effectively use the GPU without any problems.

AMD Radeon microarchitecture details

We looked at the basic principles of GPU organization; they are common to video accelerators from all manufacturers, since they initially had one target task - shader programs. However, manufacturers have found an opportunity to differ on the details of the microarchitectural implementation. Although CPUs from different vendors are sometimes very different, even if they are compatible, such as Pentium 4 and Athlon or Core. The Nvidia architecture is already quite widely known, now we will look at Radeon and highlight the main differences in the approaches of these vendors.

AMD video cards received full support for general-purpose computing starting with the Evergreen family, which also implemented DirectX 11 specifications for the first time. Cards from the 47xx family have a number of significant limitations, which will be discussed below.

The differences in the size of local memory (32 KB for Radeon versus 16 KB for GT200 and 64 KB for Fermi) are generally not significant. As well as the wave front size of 64 threads for AMD versus 32 threads in warp for Nvidia. Almost any GPU program can be easily reconfigured and adjusted to these parameters. Performance can change by tens of percent, but in the case of a GPU this is not so important, because a GPU program usually runs ten times slower than its CPU counterpart, or ten times faster, or does not work at all.

More important is AMD's use of VLIW (Very Long Instruction Word) technology. Nvidia uses scalar simple instructions that operate on scalar registers. Its accelerators implement simple classical RISC. AMD video cards have the same number of registers as the GT200, but the registers are 128-bit vector. Each VLIW instruction operates on multiple four-component 32-bit registers, which is similar to SSE, but VLIW has much more capabilities. This is not SIMD (Single Instruction Multiple Data) like SSE - here the instructions for each pair of operands can be different and even dependent! For example, let the components of register A be called a1, a2, a3, a4; register B is similar. Can be calculated with a single instruction that executes in one clock cycle, for example, the number a1×b1+a2×b2+a3×b3+a4×b4 or a two-dimensional vector (a1×b1+a2×b2, a3×b3+a4×b4 ).

This was made possible due to the lower frequency of the GPU than the CPU and the strong reduction in process technology in recent years. In this case, no scheduler is required; almost everything is executed in a clock cycle.

Thanks to vector instructions, Radeon's peak single-precision performance is very high, reaching teraflops.

One vector register can store one double precision number instead of four single precision numbers. And one VLIW instruction can either add two pairs of double numbers, or multiply two numbers, or multiply two numbers and add with a third. Thus, peak performance in double is about five times lower than in float. For older Radeon models, it corresponds to the performance of Nvidia Tesla on the new Fermi architecture and is much higher than the performance of double cards on the GT200 architecture. In Fermi-based Geforce consumer video cards, the maximum speed of double calculations has been reduced by four times.

Schematic diagram of Radeon operation. Only one miniprocessor out of 20 running in parallel is presented

GPU manufacturers, unlike CPU manufacturers (primarily x86-compatible ones), are not bound by compatibility issues. A GPU program is first compiled into some intermediate code, and when the program runs, the driver compiles this code into model-specific machine instructions. As described above, GPU manufacturers have taken advantage of this by coming up with convenient ISA (Instruction Set Architecture) for their GPUs and changing them from generation to generation. In any case, this added some percentage of performance due to the absence (as unnecessary) of a decoder. But AMD went even further by coming up with its own format for arranging instructions in machine code. They are not arranged sequentially (according to the program listing), but in sections.

First comes the section of conditional branch instructions, which have links to sections of continuous arithmetic instructions corresponding to the various branch branches. They are called VLIW bundles. These sections contain only arithmetic instructions with data from registers or local memory. This organization simplifies the management of the flow of instructions and their delivery to executive devices. This is all the more useful given that VLIW instructions are relatively large in size. There are also sections for memory access instructions.

Conditional jump instruction sections
Section 0	Branch 0	Link to section 3 of continuous arithmetic instructions
Section 1	Branch 1	Link to section No. 4
Section 2	Branch 2	Link to section No. 5
Continuous Arithmetic Instruction Sections
Section 3	VLIW instruction 0	VLIW instruction 1	VLIW instruction 2	VLIW instruction 3
Section 4	VLIW instruction 4	VLIW instruction 5
Section 5	VLIW instruction 6	VLIW instruction 7	VLIW instruction 8	VLIW instruction 9

GPUs from both Nvidia and AMD also have built-in instructions to quickly calculate basic math functions, square root, exponent, logarithms, sines, and cosines for single-precision numbers in a few clock cycles. There are special computing units for this. They “originate” from the need to implement fast approximation of these functions in geometry shaders.

Even if someone did not know that GPUs are used for graphics, and only read the technical characteristics, then by this sign he could guess that these computing coprocessors originated from video accelerators. Likewise, based on certain traits of marine mammals, scientists realized that their ancestors were land creatures.

But a more obvious feature that reveals the graphical origin of the device is the 2D and 3D texture reading units with support for bilinear interpolation. They are widely used in GPU programs, as they provide accelerated and simplified reading of read-only data arrays. One of the standard behaviors of a GPU application is to read arrays of source data, process them in the computing cores, and write the result to another array, which is then transferred back to the CPU. This scheme is standard and common because it is convenient for the GPU architecture. Tasks that require intensive reads and writes into one large region of global memory, thus containing data dependencies, are difficult to parallelize and implement efficiently on the GPU. Also, their performance will greatly depend on the latency of global memory, which is very high. But if the task is described by the pattern “reading data - processing - writing the result,” then you can almost certainly get a big boost from executing it on the GPU.

For texture data in the GPU, there is a separate hierarchy of small caches of the first and second levels. This is what provides acceleration from using textures. This hierarchy originally appeared in GPUs in order to take advantage of the locality of access to textures: obviously, after processing one pixel, a neighboring pixel (with a high probability) will require nearby texture data. But many algorithms for conventional calculations have a similar nature of data access. So texture caches from graphics will be very useful.

Although the size of the L1-L2 caches in Nvidia and AMD cards is approximately similar, which is obviously caused by the requirements for optimality in terms of game graphics, the access latency to these caches varies significantly. Nvidia has higher access latency, and texture caches in GeForce primarily help reduce the load on the memory bus, rather than directly speed up data access. This is not noticeable in graphics programs, but is important for general purpose programs. In Radeon, the latency of the texture cache is lower, but the latency of the local memory of miniprocessors is higher. We can give the following example: for optimal matrix multiplication on Nvidia cards, it is better to use local memory, loading the matrix there block by block, while for AMD it is better to rely on a low-latency texture cache, reading matrix elements as needed. But this is already a fairly subtle optimization, and for an algorithm that has already been fundamentally transferred to the GPU.

This difference also shows up when using 3D textures. One of the first GPU computing benchmarks, which showed a serious advantage for AMD, used 3D textures, as it worked with a three-dimensional data array. And the latency of access to textures in Radeon is significantly faster, and the 3D case is additionally more optimized in hardware.

To obtain maximum performance from hardware from various companies, some tuning of the application for a specific card is required, but this is an order of magnitude less significant than the development of an algorithm for the GPU architecture in principle.

Radeon 47xx Series Limitations

In this family, support for GPU computing is incomplete. Three important points can be noted. Firstly, there is no local memory, that is, it is physically there, but does not have the universal access required by the modern standard of GPU programs. It is emulated in software in global memory, meaning its use, unlike a full-featured GPU, will not bring benefits. The second point is the limited support for various atomic memory operations instructions and synchronization instructions. And the third point is the rather small size of the instruction cache: starting from a certain program size, the speed slows down significantly. There are other minor restrictions. We can say that only programs ideally suited for the GPU will work well on this video card. Although in simple test programs that operate only with registers, a video card can show good results in Gigaflops, it is problematic to effectively program something complex for it.

Advantages and disadvantages of Evergreen

If you compare AMD and Nvidia products, from a GPU computing perspective, the 5xxx series looks like a very powerful GT200. So powerful that it surpasses Fermi in peak performance by about two and a half times. Especially after the parameters of the new Nvidia video cards were cut down and the number of cores was reduced. But the introduction of an L2 cache in Fermi simplifies the implementation of some algorithms on the GPU, thus expanding the scope of the GPU. Interestingly, for CUDA programs well optimized for the previous generation of GT200, Fermi’s architectural innovations often did nothing. They accelerated in proportion to the increase in the number of computing modules, that is, less than twice (for single-precision numbers), or even less, because the memory bandwidth did not increase (or for other reasons).

And in tasks that are well suited to the GPU architecture and have a pronounced vector nature (for example, matrix multiplication), Radeon shows performance relatively close to the theoretical peak and outperforms Fermi. Not to mention multi-core CPUs. Especially in problems with single precision numbers.

But Radeon has a smaller die area, less heat dissipation, power consumption, higher yield and, accordingly, lower cost. And directly in 3D graphics tasks, Fermi's gain, if it exists at all, is much less than the difference in the crystal area. This is largely due to the fact that the Radeon computing architecture with 16 compute units per miniprocessor, a wave front size of 64 threads and VLIW vector instructions is excellent for its main task - computing graphics shaders. For the vast majority of ordinary users, gaming performance and price are priorities.

From a professional, scientific software perspective, the Radeon architecture provides the best price-performance, performance-per-watt, and absolute performance on tasks that are inherently well-matched to GPU architectures, allowing for parallelization and vectorization.

For example, in a fully parallel, easily vectorizable key selection task, Radeon is several times faster than GeForce and several tens of times faster than CPU.

This is consistent with the general concept of AMD Fusion, according to which GPUs should complement the CPU, and in the future be integrated into the CPU core itself, just as the math coprocessor was previously moved from a separate chip to the processor core (this happened twenty years ago, before the appearance of the first Pentium processors). The GPU will be an integrated graphics core and vector coprocessor for streaming tasks.

Radeon uses a clever technique of mixing instructions from different wave fronts when executed by function modules. This is easy to do since the instructions are completely independent. The principle is similar to the pipelined execution of independent instructions by modern CPUs. Apparently, this makes it possible to efficiently execute complex, multi-byte vector VLIW instructions. In a CPU, this requires a sophisticated scheduler to identify independent instructions or the use of Hyper-Threading technology, which also supplies the CPU with deliberately independent instructions from different threads.

	measure 0	measure 1	measure 2	measure 3	bar 4	bar 5	bar 6	bar 7	VLIW module
	wave front 0	wave front 1	wave front 0	wave front 1	wave front 0	wave front 1	wave front 0	wave front 1
→	instr. 0	instr. 0	instr. 16	instr. 16	instr. 32	instr. 32	instr. 48	instr. 48	VLIW0
→	instr. 1	…	…	…	…	…	…	…	VLIW1
→	instr. 2	…	…	…	…	…	…	…	VLIW2
→	instr. 3	…	…	…	…	…	…	…	VLIW3
→	instr. 4	…	…	…	…	…	…	…	VLIW4
→	instr. 5	…	…	…	…	…	…	…	VLIW5
→	instr. 6	…	…	…	…	…	…	…	VLIW6
→	instr. 7	…	…	…	…	…	…	…	VLIW7
→	instr. 8	…	…	…	…	…	…	…	VLIW8
→	instr. 9	…	…	…	…	…	…	…	VLIW9
→	instr. 10	…	…	…	…	…	…	…	VLIW10
→	instr. eleven	…	…	…	…	…	…	…	VLIW11
→	instr. 12	…	…	…	…	…	…	…	VLIW12
→	instr. 13	…	…	…	…	…	…	…	VLIW13
→	instr. 14	…	…	…	…	…	…	…	VLIW14
→	instr. 15	…	…	…	…	…	…	…	VLIW15

128 instructions of two wave fronts, each of which consists of 64 operations, are executed by 16 VLIW modules in eight clock cycles. Interleaving occurs, and each module in reality has two clock cycles to execute an entire instruction, provided that on the second clock cycle it starts executing a new one in parallel. This probably helps to quickly execute a VLIW instruction like a1×a2+b1×b2+c1×c2+d1×d2, that is, execute eight such instructions in eight clock cycles. (Formally, it turns out to be one per measure.)

Nvidia apparently does not have such technology. And in the absence of VLIW, high performance using scalar instructions requires high frequency operation, which automatically increases heat dissipation and places high demands on the process (to force the circuit to operate at a higher frequency).

The disadvantage of Radeon from the point of view of GPU computing is its great dislike for branching. GPUs generally do not favor branching due to the technology described above for executing instructions: at once in a group of threads with one program address. (By the way, this technique is called SIMT: Single Instruction - Multiple Threads (one instruction - many threads), by analogy with SIMD, where one instruction performs one operation with different data.) However, Radeon does not particularly like branching: this is caused by the larger size of the bundle of threads . It is clear that if the program is not completely vector, then the larger the size of the warp or wave front, the worse, since when neighboring threads diverge in their program paths, more groups are formed that must be executed sequentially (serialized). Let's say all the threads are scattered, then if the warp size is 32 threads, the program will work 32 times slower. And in the case of size 64, as in Radeon, it is 64 times slower.

This is a noticeable, but not the only manifestation of “hostility”. In Nvidia video cards, each functional module, otherwise called the CUDA core, has a special branch processing unit. And in Radeon video cards with 16 computing modules there are only two branch control units (they are removed from the domain of arithmetic units). So even simple processing of a conditional jump instruction, even if its result is the same for all threads in the wave front, takes additional time. And the speed sags.

AMD also produces CPUs. They believe that for programs with a large number of branches, the CPU is still better suited, while the GPU is intended for pure vector programs.

So Radeon provides less overall programming efficiency, but provides better price/performance in many cases. In other words, there are fewer programs that can be efficiently (profitably) migrated from a CPU to a Radeon than there are programs that can run efficiently on Fermi. But those that can be effectively transferred will work more efficiently on Radeon in many ways.

API for GPU computing

The technical specifications of Radeon themselves look attractive, although there is no need to idealize and absolutize GPU computing. But no less important for productivity is the software necessary for developing and executing a GPU program - compilers from a high-level language and run-time, that is, a driver that interacts between the part of the program running on the CPU and the GPU itself. It is even more important than in the case of a CPU: the CPU does not need a driver to manage data transfers, and from the point of view of the compiler, the GPU is more finicky. For example, the compiler must make do with a minimum number of registers to store intermediate results of calculations, and also carefully integrate function calls, again using a minimum of registers. After all, the fewer registers a thread uses, the more threads can be launched and the more fully the GPU can be loaded, better hiding memory access time.

And software support for Radeon products still lags behind hardware development. (Unlike the situation with Nvidia, where the release of hardware was delayed and the product was released in a stripped-down form.) Just recently, the OpenCL compiler produced by AMD had beta status, with many flaws. It generated erroneous code too often, or refused to compile code from the correct source code, or it itself produced an error and crashed. Only at the end of spring a release with high performance was released. It is also not without errors, but there are significantly fewer of them, and they tend to arise in lateral directions when trying to program something on the verge of correctness. For example, they work with the uchar4 type, which defines a 4-byte four-component variable. This type is in the OpenCL specifications, but it’s not worth working with it on Radeon, because the registers are 128-bit: the same four components, but 32-bit. And such a uchar4 variable will still occupy an entire register, it will only require additional packing operations and access to individual byte components. The compiler should not have any errors, but there are no compilers without flaws. Even Intel Compiler after 11 versions has compilation errors. The identified errors are corrected in the next release, which will be released closer to the fall.

But there are still many things that need improvement. For example, the standard Radeon GPU driver still does not support GPU computing using OpenCL. The user must download and install an additional special package.

But the most important thing is the absence of any function libraries. For double precision real numbers there is not even a sine, cosine or exponent. Well, this is not required for matrix addition and multiplication, but if you want to program something more complex, you need to write all the functions from scratch. Or wait for a new SDK release. ACML (AMD Core Math Library) for the Evergreen GPU family with support for basic matrix functions should be released soon.

At the moment, according to the author of the article, it seems feasible to use the Direct Compute 5.0 API for programming Radeon video cards, naturally taking into account the limitations: targeting the Windows 7 and Windows Vista platform. Microsoft has extensive experience in creating compilers, and we can expect a fully functional release very soon, Microsoft is directly interested in this. But Direct Compute is focused on the needs of interactive applications: to calculate something and immediately visualize the result - for example, the flow of liquid over a surface. This does not mean that it cannot be used simply for calculations, but that is not its natural purpose. Let's say Microsoft does not plan to add library functions to Direct Compute - just those that AMD does not currently have. That is, what can now be effectively calculated on Radeon - some not very sophisticated programs - can also be implemented on Direct Compute, which is much simpler than OpenCL and should be more stable. Plus, it's completely portable and will run on both Nvidia and AMD, so you only have to compile the program once, whereas Nvidia and AMD's OpenCL SDK implementations aren't entirely compatible. (In the sense that if you develop an OpenCL program on an AMD system using the AMD OpenCL SDK, it may not run as easily on Nvidia. You may need to compile the same text using the Nvidia SDK. And, of course, vice versa.)

Then, there is a lot of redundant functionality in OpenCL, since OpenCL is intended to be a universal programming language and API for a wide range of systems. And GPU, and CPU, and Cell. So in case you just need to write a program for a typical user system (processor plus video card), OpenCL does not seem to be “highly productive”, so to speak. Each function has ten parameters, and nine of them must be set to 0. And in order to set each parameter, you need to call a special function, which also has parameters.

And the most important current advantage of Direct Compute is that the user does not need to install a special package: everything that is needed is already in DirectX 11.

Problems of GPU computing development

If we take the sphere of personal computers, the situation is this: there are not many tasks that require large computing power and a conventional dual-core processor is greatly lacking. It was as if large, voracious but clumsy monsters had crawled out of the sea onto land, and there was almost nothing to eat on land. And the primordial abodes of the earth’s surface are decreasing in size, learning to consume less, as always happens when there is a shortage of natural resources. If there was the same need for performance now as 10-15 years ago, GPU computing would be a big hit. And so the problems of compatibility and the relative complexity of GPU programming come to the fore. It's better to write a program that runs on all systems than a program that runs fast but only runs on the GPU.

The prospects for GPUs are somewhat better in terms of use in professional applications and the workstation sector, since there is a greater need for performance there. There are plugins for 3D editors with GPU support: for example, for rendering using ray tracing - not to be confused with regular GPU rendering! Something is also emerging for 2D and presentation editors, with faster creation of complex effects. Video processing programs are also gradually gaining GPU support. The above tasks, due to their parallel nature, fit well with the GPU architecture, but now a very large code base has been created, debugged and optimized for all the capabilities of the CPU, so it will take time for good GPU implementations to appear.

In this segment, such weaknesses of GPUs also appear, such as the limited amount of video memory - approximately 1 GB for conventional GPUs. One of the main factors that reduce the performance of GPU programs is the need to exchange data between the CPU and GPU over a slow bus, and due to limited memory, more data must be transferred. And here AMD’s concept of combining GPU and CPU in one module looks promising: you can sacrifice the high bandwidth of graphics memory for easy and simple access to shared memory, and with lower latency. This high bandwidth of current DDR5 video memory is much more in demand directly from graphics programs than from most GPU computing programs. In general, the shared memory of the GPU and CPU will simply significantly expand the scope of the GPU, making it possible to use its computing capabilities in small subtasks of programs.

And GPUs are most in demand in the field of scientific computing. Several GPU-based supercomputers have already been built, which show very high results in the matrix operations test. Scientific problems are so diverse and numerous that there are always many that fit perfectly into the GPU architecture, for which the use of a GPU makes it easy to obtain high performance.

If you choose one among all the tasks of modern computers, it will be computer graphics - the image of the world in which we live. And the optimal architecture for this purpose cannot be bad. This is such an important and fundamental task that hardware specially designed for it must be universal and optimal for various tasks. Moreover, video cards are successfully evolving.

The question often comes up: why is there no GPU acceleration in Adobe Media Encoder CC? We found out that Adobe Media Encoder uses GPU acceleration, and also noted the nuances of its use. There is also a statement that Adobe Media Encoder CC has removed support for GPU acceleration. This is an erroneous opinion and stems from the fact that the main Adobe Premiere Pro CC program can now work without a registered and recommended video card, and to enable the GPU engine in Adobe Media Encoder CC, the video card must be registered in the documents: cuda_supported_cards or opencl_supported_cards. If everything is clear with nVidia chipsets, just take the name of the chipset and enter it into the cuda_supported_cards document. Then, when using AMD video cards, you need to enter not the name of the chipset, but the code name of the core. So, let's check in practice how to enable the GPU engine in Adobe Media Encoder CC on an ASUS N71JQ laptop with discrete graphics ATI Mobility Radeon HD 5730. Technical data of the ATI Mobility Radeon HD 5730 graphics adapter shown by the GPU-Z utility:

Launch Adobe Premiere Pro CC and turn on the engine: Mercury Playback Engine GPU Acceleration (OpenCL).

Three DSLR videos on a timeline, one above the other, two of them, create a picture-in-picture effect.

Ctrl+M, select the Mpeg2-DVD preset, remove the black bars on the sides using the Scale To Fill option. We also include increased quality for tests without a GPU: MRQ (Use Maximum Render Quality). Click on the button: Export. CPU load up to 20% and random access memory 2.56 GB.

The GPU load of the ATI Mobility Radeon HD 5730 chipset is 97% and 352MB of on-board video memory. The laptop was tested while running on battery power, so the graphics core/memory operates at lower frequencies: 375/810 MHz.

Total rendering time: 1 minute and 55 seconds(turning on/off MRQ when using a GPU engine does not affect the final rendering time).
With the Use Maximum Render Quality checkbox checked, now click on the button: Queue.

Processor clock speed when running on battery power: 930 MHz.

Run AMEEncodingLog and look at the final rendering time: 5 minutes and 14 seconds.

We repeat the test, but with the Use Maximum Render Quality checkbox unchecked, click on the button: Queue.

Total rendering time: 1 minute and 17 seconds.

Now let's turn on the GPU engine in Adobe Media Encoder CC, launch the Adobe Premiere Pro CC program, press the key combination: Ctrl + F12, execute Console > Console View and enter GPUSniffer in the Command field, press Enter.

Select and copy the name to GPU Computation Info.

In the Adobe Premiere Pro CC program directory, open the opencl_supported_cards document, and enter the code name of the chipset in alphabetical order, Ctrl+S.

Click on the button: Queue, and we get GPU acceleration of the rendering of an Adobe Premiere Pro CC project in Adobe Media Encoder CC.

Total time: 1 minute and 55 seconds.

We connect the laptop to the outlet and repeat the results of the calculations. Queue, the MRQ checkbox is unchecked, without turning on the engine, the RAM load has increased a little:

Processor clock speed: 1.6 GHz when running from a wall outlet and mode enabled: High performance.

Total time: 46 seconds.

We turn on the engine: Mercury Playback Engine GPU Acceleration (OpenCL), as seen from the network, the laptop video card is running at its base frequencies, GPU load in Adobe Media Encoder CC reaches 95%.

The total rendering time decreased from 1 minute 55 seconds, before 1 minute and 5 seconds.

*Adobe Media Encoder CC now uses a graphics processing unit (GPU) for rendering. CUDA and OpenCL standards are supported. In Adobe Media Encoder CC, the GPU engine is used for the following rendering processes:
- Change the clarity (from high to standard and vice versa).
- Time code filter.
- Pixel format conversions.
- Disinterleaving.
If you are rendering a Premiere Pro project, AME uses the GPU rendering settings specified for that project. This will take advantage of all the GPU rendering capabilities found in Premiere Pro. AME projects render using a limited set of GPU rendering capabilities. If the sequence is rendered using native support, the GPU setting from AME is applied, the project setting is ignored. In this case, all the GPU rendering capabilities of Premiere Pro are used directly in AME. If the project contains third-party VSTs, the project's GPU setting is used. The sequence is encoded using PProHeadless, as in earlier versions of AME. If Enable Native Premiere Pro Sequence Import is unchecked, PProHeadless and the GPU setting are always used.

We read about a hidden partition on the system disk of the ASUS N71JQ laptop.

There can never be too many nuclei...

Modern GPUs are monstrous, fast beasts capable of chewing gigabytes of data. However, man is cunning and, no matter how much computing power grows, he comes up with more and more complex problems, so the moment comes when we sadly have to admit that optimization is needed 🙁

This article describes the basic concepts in order to make it easier to navigate the theory of gpu optimization and the basic rules so that these concepts have to be addressed less often.

Reasons why GPUs are effective for working with large amounts of data that require processing:

they have great capabilities for parallel execution of tasks (many, many processors)
high memory bandwidth

Memory bandwidth- this is how much information - a bit or a gigabyte - can be transferred per unit of time - a second or a processor cycle.

One of the optimization tasks is to use maximum throughput - to increase performance throughput(ideally it should be equal to memory bandwidth).

To improve bandwidth utilization:

increase the amount of information - use the bandwidth to its fullest (for example, each thread works with float4)
reduce latency – delay between operations

Latency– the period of time between the moments when the controller requested a specific memory cell and the moment when the data became available to the processor for executing instructions. We cannot influence the delay itself in any way - these limitations are present at the hardware level. It is due to this delay that the processor can simultaneously service several threads - while thread A has requested to allocate memory to it, thread B can calculate something, and thread C can wait until the requested data arrives to it.

How to reduce latency if synchronization is used:

reduce the number of threads in a block
increase the number of block groups

Full use of GPU resources – GPU Occupancy

In highbrow conversations about optimization, the term often appears - gpu occupancy or kernel occupancy– it reflects the efficiency of using the video card’s resources. I would like to separately note that even if you use all the resources, this does not mean that you are using them correctly.

The computing power of the GPU is hundreds of computationally hungry processors; when creating a program - the kernel - the burden of distributing the load falls on the shoulders of the programmer. A mistake can result in much of these precious resources sitting idle. Now I will explain why. We'll have to start from afar.

Let me remind you that warp ( warp in NVidia terminology, wavefront – in AMD terminology) is a set of threads that simultaneously perform the same kernel function on the processor. Threads united by the programmer into blocks are divided into warps by a thread scheduler (separately for each multiprocessor) - while one warp is working, the second is waiting for processing memory requests, etc. If some of the warp threads are still performing calculations, while others have already done everything they could, there is an inefficient use of the computing resource - popularly called idle capacity.

Each synchronization point, each branch of logic can generate such an idle situation. The maximum divergence (branching of execution logic) depends on the size of the warp. For NVidia GPUs it is 32, for AMD it is 64.

To reduce multiprocessor downtime during warp execution:

minimize barrier waiting time
minimize the divergence of execution logic in the kernel function

To effectively solve this problem, it makes sense to understand how warps are formed (for the case with several dimensions). In fact, the order is simple - first in X, then in Y and, lastly, in Z.

the kernel is launched with blocks of size 64x16, threads are divided into warps in the order X, Y, Z - i.e. the first 64 elements are divided into two warps, then the second, etc.

The kernel runs with 16x64 blocks. The first and second 16 elements are added to the first warp, the third and fourth - to the second warp, etc.

How to reduce divergence (remember, branching is not always the cause of critical performance loss)

when adjacent flows have different execution paths - there are many conditions and transitions along them - look for ways to restructure
look for an unbalanced load of threads and decisively remove it (this is when not only do we have conditions, but because of these conditions, the first thread always calculates something, and the fifth does not meet this condition and is idle)

How to make the most of your GPU resources

GPU resources, unfortunately, also have their limitations. And, strictly speaking, before launching the kernel function, it makes sense to determine the limits and take these limits into account when distributing the load. Why is it important?

Video cards have restrictions on the total number of threads that one multiprocessor can execute, the maximum number of threads in one block, the maximum number of warps on one processor, restrictions on various types of memory, etc. All this information can be requested either programmatically, through the appropriate API, or previously using utilities from the SDK. (deviceQuery modules for NVidia devices, CLInfo - for AMD video cards).

General practice:

the number of thread blocks/workgroups must be a multiple of the number of stream processors
block/workgroup size must be a multiple of the warp size

It should be taken into account that the absolute minimum is 3-4 warps/wayfronts spinning simultaneously on each processor; wise guides advise proceeding from the consideration of at least seven wayfronts. At the same time, do not forget the hardware restrictions!

Keeping all these details in your head quickly gets boring, so to calculate gpu-occupancy NVidia offered an unexpected tool - an Excel(!) calculator full of macros. There you can enter information on the maximum number of threads for SM, the number of registers and the size of the total (shared) memory available on the stream processor, and the function launch parameters used - and it displays the efficiency of resource use as a percentage (and you are tearing your hair out realizing that in order to use all cores you are missing registers).

Usage information:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#calculating-occupancy

GPU and memory operations

Video cards are optimized for 128-bit memory operations. Those. ideally, each memory manipulation should ideally change 4 four-byte values at a time. The main trouble for a programmer is that modern GPU compilers do not know how to optimize such things. This has to be done directly in the function code and, on average, brings a fraction of a percent increase in performance. The frequency of memory requests has a much greater impact on performance.

The problem is this: each request returns a piece of data that is a multiple of 128 bits in size. And each thread uses only a quarter of it (in the case of a regular four-byte variable). When adjacent threads simultaneously work with data located sequentially in memory cells, this reduces the total number of memory accesses. This phenomenon is called combined read and write operations ( coalesced access – good! both read and write) – and with the correct organization of the code ( strided access to contiguous chunk of memory – bad!) can significantly improve performance. When organizing your core - remember - contiguous access - within the elements of one row of memory, working with column elements is no longer so efficient. Want more details? I liked this pdf - or google for “ memory coalescing techniques “.

The leading position in the “bottleneck” category is occupied by another memory operation – copying data from host memory to GPU . Copying does not happen anyhow, but from a memory area specially allocated by the driver and the system: when there is a request to copy data, the system first copies this data there, and only then uploads it to the GPU. The speed of data transportation is limited by the bandwidth of the PCI Express xN bus (where N is the number of data lines) through which modern video cards communicate with the host.

However, unnecessary copying of slow memory on the host is sometimes an unjustified cost. The solution is to use the so-called pinned memory – a specially marked memory area, so that the operating system is not able to perform any operations with it (for example, dump it into swap/move at its discretion, etc.). Data transfer from the host to the video card is carried out without participation operating system– asynchronously, via DMA (direct memory access).

And finally, a little more about memory. Shared memory on a multiprocessor is usually organized in the form of memory banks containing 32-bit words - data. The number of banks, according to good tradition, varies from one GPU generation to another - 16/32 If each thread accesses a separate bank for data, everything is fine. Otherwise, we get several read/write requests to one bank and we get a conflict ( shared memory bank conflict). Such conflicting calls are serialized and therefore executed sequentially rather than in parallel. If all threads access one bank, a “broadcast” response is used ( broadcast) and there is no conflict. There are several ways to effectively deal with access conflicts, I liked it description of the main techniques for getting rid of access conflicts to memory banks – .

How to make math operations even faster? Remember that:

Double precision calculations are a high load operation with fp64 >> fp32
constants of the form 3.13 in the code, by default, are interpreted as fp64 if 3.14f is not explicitly specified
To optimize mathematics, it would be a good idea to check the guides to see if the compiler has any flags
Manufacturers include features in their SDKs that exploit device features to achieve performance (often at the expense of portability)

It makes sense for CUDA developers to pay close attention to the concept cuda stream allowing you to run several kernel functions on one device at once or combine asynchronous copying of data from the host to the device while executing functions. OpenCL does not yet provide such functionality :)

Scrap for profiling:

NVifia Visual Profiler is an interesting utility that analyzes both CUDA and OpenCL kernels.

P.S. As a more extensive guide to optimization, I can recommend googling all kinds of best practices guide for OpenCL and CUDA.

Today, news about the use of GPUs for general computing can be heard on every corner. Words such as CUDA, Stream and OpenCL have become almost the most cited words on the IT Internet in just two years. However, not everyone knows what these words mean and what the technologies behind them mean. And for Linux users, who are accustomed to “being on the fly,” all this seems like a dark forest.

Birth of GPGPU

We are all used to thinking that the only component of a computer capable of executing any code it is told to do is the central processor. For a long time, almost all mass-produced PCs were equipped with a single processor that handled all conceivable calculations, including the operating system code, all our software and viruses.

Later appeared multi-core processors and multiprocessor systems, in which there were several such components. This allowed machines to perform multiple tasks simultaneously, and the overall (theoretical) system performance increased exactly as much as the number of cores installed in the machine. However, it turned out that it was too difficult and expensive to produce and design multi-core processors.

Each core had to house a full-fledged processor of a complex and intricate x86 architecture, with its own (rather large) cache, instruction pipeline, SSE blocks, many blocks that perform optimizations, etc. and so on. Therefore, the process of increasing the number of cores slowed down significantly, and white university coats, for whom two or four cores were clearly not enough, found a way to use other computing power for their scientific calculations, which was in abundance on the video card (as a result, the BrookGPU tool even appeared, emulating additional processor using DirectX and OpenGL function calls).

Graphics processors, devoid of many of the disadvantages of the central processor, turned out to be an excellent and very fast calculating machine, and very soon GPU manufacturers themselves began to take a closer look at the developments of scientific minds (and nVidia actually hired most of the researchers to work for them). As a result, nVidia CUDA technology appeared, which defines an interface with which it became possible to transfer the calculation of complex algorithms to the shoulders of the GPU without any crutches. Later it was followed by ATi (AMD) with its own version of the technology called Close to Metal (now Stream), and very soon a standard version from Apple appeared, called OpenCL.

Is the GPU everything?

Despite all the advantages, the GPGPU technique has several problems. The first of these is the very narrow scope of application. GPUs have gone far ahead of the central processor in terms of increasing computing power and the total number of cores (video cards carry a computing unit consisting of more than a hundred cores), but such high density is achieved by maximizing the simplification of the design of the chip itself.

In essence, the main task of the GPU comes down to mathematical calculations by using simple algorithms, which receive not very large volumes of predictable data as input. For this reason, GPU cores have a very simple design, scanty cache sizes and a modest set of instructions, which ultimately results in their low cost of production and the possibility of very dense placement on the chip. GPUs are like a Chinese factory with thousands of workers. They do some simple things quite well (and most importantly, quickly and cheaply), but if you entrust them with assembling an airplane, the result will be, at most, a hang glider.

Therefore, the first limitation of GPUs is their focus on fast mathematical calculations, which limits the scope of application of GPUs to assisting in the operation of multimedia applications, as well as any programs involved in complex data processing (for example, archivers or encryption systems, as well as software involved in fluorescence microscopy, molecular dynamics, electrostatics and other things of little interest to Linux users).

The second problem with GPGPU is that not every algorithm can be adapted for execution on the GPU. Individual GPU cores are quite slow, and their power only becomes apparent when working together. This means that the algorithm will be as effective as the programmer can parallelize it effectively. In most cases, only a good mathematician can handle such work, of which there are very few software developers.

And third, GPUs work with memory installed on the graphics card itself, so each time the GPU is used there will be two additional copy operations: input data from the RAM of the application itself and output data from GRAM back to application memory. As you can imagine, this can negate any benefit in application runtime (as is the case with the FlacCL tool, which we'll look at later).

But that's not all. Despite the existence of a generally accepted standard in the form of OpenCL, many programmers still prefer to use vendor-specific implementations of the GPGPU technique. CUDA turned out to be especially popular, which, although it provides a more flexible programming interface (by the way, OpenCL in nVidia drivers implemented on top of CUDA), but tightly ties the application to video cards from one manufacturer.

KGPU or Linux kernel accelerated by GPU

Researchers at the University of Utah have developed a KGPU system that allows some Linux kernel functions to be executed on a GPU using the CUDA framework. To perform this task, a modified Linux kernel and a special daemon are used that runs in user space, listens to kernel requests and passes them to the video card driver using the CUDA library. Interestingly, despite the significant overhead that such an architecture creates, the authors of KGPU managed to create an implementation of the AES algorithm, which increases the encryption speed file system eCryptfs 6 times.

What is there now?

Due to its youth, and also due to the problems described above, GPGPU never became a truly widespread technology, however useful software, using its capabilities, exists (albeit in scanty quantities). Crackers of various hashes were among the first to appear, the algorithms of which are very easy to parallelize.

Multimedia applications were also born, such as the FlacCL encoder, which allows you to transcode an audio track into the FLAC format. Some pre-existing applications have also acquired GPGPU support, the most notable of which is ImageMagick, which can now offload some of its work to the GPU using OpenCL. There are also projects to transfer data archivers and other information compression systems to CUDA/OpenCL (ATi Unixoids are not liked). We will look at the most interesting of these projects in the following sections of the article, but for now let’s try to figure out what we need to get it all started and working stably.

GPUs have long surpassed x86 processors in performance

· Secondly, the latest proprietary drivers for the video card must be installed in the system; they will provide support for both GPGPU technologies native to the card and open OpenCL.

· And thirdly, since distribution developers have not yet begun to distribute application packages with GPGPU support, we will have to build applications ourselves, and for this we need official SDKs from manufacturers: CUDA Toolkit or ATI Stream SDK. They contain the header files and libraries necessary for building applications.

Install CUDA Toolkit

Follow the link above and download the CUDA Toolkit for Linux (you can choose from several versions, for the Fedora, RHEL, Ubuntu and SUSE distributions, there are versions for both x86 and x86_64 architectures). In addition, you also need to download driver kits for developers there (Developer Drivers for Linux, they are first on the list).

Launch the SDK installer:

$ sudo sh cudatoolkit_4.0.17_linux_64_ubuntu10.10.run

When the installation is complete, we proceed to installing the drivers. To do this, shut down the X server:

# sudo /etc/init.d/gdm stop

Open the console and run the driver installer:

$ sudo sh devdriver_4.0_linux_64_270.41.19.run

After the installation is complete, start X:

In order for applications to be able to work with CUDA/OpenCL, we set the path to the directory with CUDA libraries in the LD_LIBRARY_PATH variable:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Or, if you installed the 32-bit version:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib32

You also need to specify the path to the CUDA header files so that the compiler finds them at the application build stage:

$ export C_INCLUDE_PATH=/usr/local/cuda/include

That's it, now you can start building CUDA/OpenCL software.

Install ATI Stream SDK

Stream SDK does not require installation, so the AMD archive downloaded from the website can simply be unpacked into any directory ( best choice will be /opt) and write the path to it in the same LD_LIBRARY_PATH variable:

$ wget http://goo.gl/CNCNo

$ sudo tar -xzf ~/AMD-APP-SDK-v2.4-lnx64.tgz -C /opt

$ export LD_LIBRARY_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/lib/x86_64/

$ export C_INCLUDE_PATH=/opt/AMD-APP-SDK-v2.4-lnx64/include/

As with the CUDA Toolkit, x86_64 must be replaced with x86 on 32-bit systems. Now go to the root directory and unpack the icd-registration.tgz archive (this is a kind of free license key):

$ sudo tar -xzf /opt/AMD-APP-SDK-v2.4-lnx64/icd-registration.tgz - WITH /

We check the correct installation/operation of the package using the clinfo tool:

$ /opt/AMD-APP-SDK-v2.4-lnx64/bin/x86_64/clinfo

ImageMagick and OpenCL

OpenCL support has been available in ImageMagick for quite some time, but it is not enabled by default in any distribution. Therefore, we will have to compile IM ourselves from source. There is nothing complicated about this, everything you need is already in the SDK, so the assembly does not require installing any additional libraries from nVidia or AMD. So, download/unpack the archive with the sources:

$ wget http://goo.gl/F6VYV

$ tar -xjf ImageMagick-6.7.0-0.tar.bz2

$ cd ImageMagick-6.7.0-0

$ sudo apt-get install build-essential

We launch the configurator and grab its output for OpenCL support:

$ LDFLAGS=-L$LD_LIBRARY_PATH ./confi gure | grep -e cl.h -e OpenCL

The correct output from the command should look something like this:

checking CL/cl.h usability... yes

checking CL/cl.h presence... yes

checking for CL/cl.h... yes

checking OpenCL/cl.h usability... no

checking OpenCL/cl.h presence... no

checking for OpenCL/cl.h... no

checking for OpenCL library... -lOpenCL

The word "yes" must be marked either in the first three lines or in the second (or both options at once). If this is not the case, then most likely the C_INCLUDE_PATH variable was not initialized correctly. If the word "no" is marked last line, then the problem is in the LD_LIBRARY_PATH variable. If everything is ok, start the build/installation process:

$sudo make install clean

Let's check that ImageMagick was indeed compiled with OpenCL support:

$ /usr/local/bin/convert -version | grep Features

Features: OpenMP OpenCL

Now let's measure the resulting speed gain. The ImageMagick developers recommend using the convolve filter for this:

$ time /usr/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

$ time /usr/local/bin/convert image.jpg -convolve "-1, -1, -1, -1, 9, -1, -1, -1, -1" image2.jpg

Some other operations, such as resizing, should now also work much faster, but you shouldn't expect ImageMagick to start processing graphics at breakneck speed. So far, a very small part of the package has been optimized using OpenCL.

FlacCL (Flacuda)

FlacCL is an encoder of audio files in the FLAC format, using the capabilities of OpenCL in its work. It is included in the CUETools package for Windows, but thanks to mono it can also be used on Linux. To obtain an archive with an encoder, run the following command:

$ mkdir flaccl && cd flaccl

$ wget www.cuetools.net/install/flaccl03.rar

$ sudo apt-get install unrar mono

$ unrar x fl accl03.rar

So that the program can find the OpenCL library, we make a symbolic link:

$ ln -s $LD_LIBRARY_PATH/libOpenCL.so libopencl.so

Now let's run the encoder:

$ mono CUETools.FLACCL.cmd.exe music.wav

If the error message "Error: Requested compile size is bigger than the required workgroup size of 32" is displayed on the screen, then the video card in our system is too weak, and the number of involved cores should be reduced to the specified number using the '-- flag group-size XX', where XX is the required number of cores.

I’ll say right away that due to the long initialization time of OpenCL, noticeable gains can only be obtained on sufficiently long tracks. FlacCL processes short audio files at almost the same speed as its traditional version.

oclHashcat or quick brute force

As I already said, the developers of various crackers and brute-force password systems were among the first to add GPGPU support to their products. For them new technology became a real holy grail, which made it possible to easily transfer naturally easily parallelized code to the shoulders of fast GPU processors. Therefore, it is not surprising that there are now dozens of different implementations of such programs. But in this article I will talk about only one of them - oclHashcat.

oclHashcat is a hacker that can guess passwords based on their hash with extreme high speed, while harnessing the power of the GPU using OpenCL. If you believe the measurements published on the project website, the speed of selecting MD5 passwords on nVidia GTX580 is up to 15,800 million combinations per second, thanks to which oclHashcat is able to find an average complexity eight-character password in just 9 minutes.

The program supports OpenCL and CUDA, MD5, md5($pass.$salt), md5(md5($pass)), vBulletin algorithms< v3.8.5, SHA1, sha1($pass.$salt), хэши MySQL, MD4, NTLM, Domain Cached Credentials, SHA256, поддерживает распределенный подбор паролей с задействованием мощности нескольких машин.

$7z x oclHashcat-0.25.7z

$cd oclHashcat-0.25

And run the program (we’ll use a sample list of hashes and a sample dictionary):

$ ./oclHashcat64.bin example.hash ?l?l?l?l example.dict

oclHashcat will open the text of the user agreement, which you must agree to by typing "YES". After this, the search process will begin, the progress of which can be found by pressing ~~. To pause the process, click~~

~~To resume - . You can also use direct enumeration (for example, from aaaaaaaa to zzzzzzzz):~~

~~$ ./oclHashcat64.bin hash.txt ?l?l?l?l ?l?l?l?l~~

And various modifications of the dictionary and the direct search method, as well as their combinations (you can read about this in the file docs/examples.txt). In my case, the speed of searching through the entire dictionary was 11 minutes, while direct searching (from aaaaaaaa to zzzzzzzz) lasted about 40 minutes. The average speed of the GPU (RV710 chip) was 88.3 million/s.

~~conclusions~~

Despite many different limitations and the complexity of software development, GPGPU is the future of high-performance desktop computers. But the most important thing is that you can use the capabilities of this technology right now, and this applies not only to Windows machines, but also to Linux.