GPU "Task Manager"

Does anybody know if there's something like a "Task Manager" for nVIDIA GPUs so that I can monitor memory usage as something is running on the video card?

(I put it under the benchmarking section because I'm actually running a MATLAB code right now that I am using to benchmark between my CPU and the GPU.)

My CPU benchmark is showing a fairly consistent result, about 2.5e6 iterations/second. But for my GTX660 Superclocked though, when I go from 6e3 iterations to 6e4 iterations, the (effective) computational rate drops by HALF and I'm trying to figure out why.

Suggestions would be greatly appreciated.

gpu-z

It'll show the memory usage? Would it show it "per task" like the Windows Task Manager does, or is just the total aggregiate?

It doesn't look like that GPU-Z will list it out by task or process, but it does have a data-logging capability that is helping me deep dive a LOT further into the issue.

So, for those that might be interested, here's the MATLAB code:

Code:

% for j=1:4 tic; reset(gpuDevice(1)); clear all; % clean up format long; % show double precision R_i_gpu=gpuArray(6); % initial radius dL_gpu=gpuArray(1.e-5); % delta length n_gpu=R_i_gpu/dL_gpu; % calculate the number of steps or intervals theta_i_gpu=dL_gpu/R_i_gpu; % calculate the initial theta R_final_gpu=gpuArray(2); % final radius dR_gpu=R_final_gpu-R_i_gpu; % calculate delta radius d_theta_gpu=gpuArray(3*pi/2); % angle that the radius varies over (i.e. (R_final-R_initial)/d_theta)) dR_d_theta_gpu=dR_gpu/d_theta_gpu; % calculate the rate of change of radius with respect to theta % Ri=R_i-(dR_d_theta*theta_i) % thetai=dL/Ri Ri_gpu=R_i_gpu; % initialise radius thetai_gpu=theta_i_gpu; % initialise theta n=gather(n_gpu); itime=toc tic; for i=1:n-1; % for loop Ri_gpu=Ri_gpu+(dR_d_theta_gpu*thetai_gpu); % update radius thetai_gpu=dL_gpu/Ri_gpu; % update theta R_gpu(i)=Ri_gpu; % put the radius into a column array theta_gpu(i)=thetai_gpu; % put the theta into a column array % A_gpu=[R_gpu ;theta_gpu]'; % create the radius/theta array end rtime=toc %% tic % R_gpu=[R_i_gpu R_gpu]'; % horizontally concatenate the initial radius with the radius array that's calculated for each of the interval steps theta_gpu=[theta_i_gpu theta_gpu]'; % hcat the initial theta with the theta array theta_sum_gpu=sum(theta_gpu); % sum the theta (in radians) theta_sum_deg_gpu=theta_sum_gpu*360/(2*pi); % convert the theta sum to degrees % ptime=toc % end

(this is the GPU version of it, I also have a CPU-based version of it, so instead of all those gpuArrays, it's just "regular" arrays.)

The code solve a problem where it answers the question if you have a part of a circle (like a pulley that will only rotate 270 degrees) and the radius of the pulley changes, how much does the pulley have to rotate in order to get move the end that you're pulling on a certain, known distance?

It's an iterative solution where I break up the total distance I want it to move into a whole bunch of smaller pieces, and calculate what's the angle required for each piece, and then add all of those results together.

After I wrote the CPU-based code, I converted it over into GPU-based code to see if I can make it run faster still.

It is a HIGHLY single-threaded, single process computation where the result of the current step is highly dependent on the result from the previous step. (So where a lot of the CPU/GPU comparisons/benchmarks deals with a lot of matrices, and other such computations that can be highly parallelized, this is a highly linear solution that CAN'T be parallelized (because of the intrinsic dependency).

So it made for a REALLY interesting benchmarking code.

And here are the resulting plots from the GPU-Z data logger.

http://imageshack.com/a/img843/132/4l4q.png
This is a detail of the first 200 data points or so from the data logger showing the GPU memory usage.

http://imageshack.com/a/img838/5184/udmv.png
This is a detail of the first 200 data points or so from the data logger showing the GPU load.

http://imageshack.com/a/img842/2227/a6v4.png
This is the total overall GPU memory usage where the small pieces of length that the problem is broken up into ranged from 1e-3 to 1e-5 inches.

http://imageshack.com/a/img841/4782/lsng.png
This is the total overall GPU load for the same comparison.

So this question came up because as I was running both the CPU and the GPU versions of this program, the CPU was showing to be SUBSTANTIALLY faster than the GPU. And conventional data tells us that shouldn't be the case. So, I started working and looking into why that might be the case. And I've commented out sections of the code that I don't really need right now or suppressed the output. So, now I am able to get a much linear relationship between the time it takes to solve and the size of the problem. (The CPU shows pretty much a 1:1 relationship on the log-log plot of size vs. time). But the GPU plot showed some variability, and that calculating the effective computational rate was also showing a slowdown on the GPU side and I couldn't really explain why.

Couple of other notes about this analysis/code:
a) The hardware is as follows:
3930K OC'd to 4.5 GHz, 64 GB of RAM, 500 GB 7.2krpm SATA 6 Gbps drive, 240 GB Intel 530 Series SSD SATA 6 Gbps, EVGA GTX660 Superclocked (2 GB).

b) The code IS running in double precision, which I know from previous testing that the GTX660 Superclocked doesn't have AS much of an advantage over the CPU (I think that it worked out be on average about only 15% faster) when solving various sizes of Ax=B.

What's weird is the spike-y behaviour of the memory usage. And I wanted to plot that because as I am building the column array, I wanted to make sure that I wasn't exceeding the 2 GB of RAM that's on the video card, which is causing to swap between the GPU RAM and the main system RAM (which it doesn't seem to look like that it's going here). But what is weird is that the GPU load is going up as it progresses further (which - again, conventional way of thinking should say that if you're running a linear iterative code, the GPU load should max out immediately and stay maxed out for the duration of the run).

But that is not the case here, so it is making me think that there's something funky going on between MATLAB and the MATLAB GPU/CUDA interface.

So here is just more detail/information (for those that might be interested) about testing/benchmarking GPUs using MATLAB GPU code (and comparing it back again the MATLAB CPU version of the same code).

And that the current results is also telling me that GPUs are NOT necessarily faster for computational intensive tasks, ESPECIALLY if it's HIGHLY single threaded and single-process, if it is an iterative 1-D solution, and if it uses double precision.

(*I MIGHT be able to parallelize it but it would mean that I would have to solve it kinda recursively, and if I can recurse it at any point in time (so....recurse it n times @ n), then there's an idea that's running around in my head that so long as I can do that, then I can break it up into like 6 sections, and the closed intervals will now be from a-b/6, b/6-2*b/6, etc. and solve each of those pieces separately. But that's super risky, and makes it EXTREMELY complicated (converting the linear iterative solution into a recursive one) and it would be VERY easy to make a mistake in doing that).)