Does Matlab cause Cuda to leak memory due to CUcontext caching?
Is using cudaDeviceReset() after computations the normal way to use the GPU from Matlab? I can't use the GPU computation in the latest version of Matlab because my GPU doesn't support Compute Capability 1.3+, and I don't want to pay tons of money to Accelereyes Jacket for using a simple Cuda function like cudaMemGetInfo() or my simple Cuda kernels.
I've found some very frustrating behavior when calling Cuda from Matlab. In Visual Studio 2008, I wrote a trivial DLL which uses the standard MEX interface to run one Cuda query: how much RAM is free on the device (Listing 1).
// cudaMemoryCheck.cpp : Defines the exported functions for the DLL application.
#include <mex.h>
#include <cuda.h>
#include <driver_types.h>
#include <cuda_runtime_api.h>
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
size_t free = 0, total = 0;
cudaError_t result = cudaMemGetInfo(&free, &total);
mexPrintf("free memory in bytes %u (%u MB), total memory in bytes %u (%u MB). ", free, free/1024/1024, total, total/1024/1024);
if( total > 0 )
mexPrintf("%2.2f%% free\n", (100.0*free)/total );
else
mexPrintf("\n");
// this is the critical line!
cudaDeviceReset();
}
I compile the project a Win32 DLL (release mode) where I export mexFunction using a DEF file, and rename the DLL file extension to .mexw32.
When I run cudaMemoryCheck from Matlab, I find that my GPU will leak memory if the cudaDeviceReset() is commented out. Here's my trivial Matlab code (Listing 2):
addpath('C:\Users\admin\Documents\Visual Studio 2008\Projects\cudaMemoryCheck\Release')
for i=1:20
clear mex
cudaMemoryCheck;
end
Running this function in Matlab, I see:
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
free memory in bytes 57393152 (54 MB), total memory in bytes 244776960 (233 MB). 23.45% free
The output from Matlab is very different when cudaDeviceReset() is commented out:
free memory in bytes 37019648 (35 MB), total memory in bytes 244776960 (233 MB). 15.12% free
free memory in bytes 25092096 (23 MB), total memory in bytes 244776960 (233 MB). 10.25% free
free memory in bytes 13549568 (12 MB), total memory in bytes 244776960 (233 MB). 5.54% free
free memory in bytes 12107776 (11 MB), total memory in bytes 244776960 (233 MB). 4.95% free
free memory in bytes 8568832 (8 MB), total memory in bytes 244776960 (233 MB). 3.50% free
free memory in bytes 9617408 (9 MB), total memory in bytes 244776960 (233 MB). 3.93% free
free memory in bytes 6078464 (5 MB), total memory in bytes 244776960 (233 MB). 2.48% free
free memory in bytes 8044544 (7 MB), total memory in bytes 244776960 (233 MB). 3.29% free
free memory in bytes 5816320 (5 MB), total memory in bytes 244776960 (233 MB). 2.38% free
free memory in bytes 7520256 (7 MB), total memory in bytes 244776960 (233 MB). 3.07% free
free memory in bytes 8830976 (8 MB), total memory in bytes 244776960 (233 MB). 3.61% free
free memory in bytes 5292032 (5 MB), total memory in bytes 244776960 (233 MB). 2.16% free
free memory in bytes 3407872 (3 MB), total memory in bytes 244776960 (233 MB). 1.39% free
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
free memory in bytes 0 (0 MB), total memory in bytes 0 (0 MB).
So I've concluded that even though my MEX function allocates no memory on the GPU, the Cuda Runtime API is creating new CUcontexts every time the MEX function runs, and it never clears them until I close Matlab or I use cudaDeviceReset(). Eventually the GPU runs out of memory despite the fact that I did not allocate anything on it!
I do not like using cudaDeviceReset(). The API says, "The function cudaDeviceReset() will deinitialize the primary context for the calling thread's current device immediately" and "It is the caller's responsibility to ensure that the device is not being accessed by any other host threads from the process when this function is called." In other words, using cudaDeviceReset() could terminate other GPU calculations immediately and without warning. I have not found any documentation that using cudaDeviceReset() frequently is normal, so I don't want to do it. I will accept any answer here that proves that using cudaDeviceReset() is normal and required.
Version info: NVIDIA GPU Computing Toolkit 4.0, Matlab 7.8.0 (R2009a, 32-bit), Windows 7 Enterprise SP1 (64-bit), Nvidia Quadro NVS 420 (latest Nvidia drivers, 270.81).
I can also reproduce this problem on Windows XP (32-bit, SP3) with a GeForce 8400 GS, same Matlab, Visual Studio, and GPU Computing Toolkit.
Output of deviceQuery.exe:
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Quadro NVS 420"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 233 MBytes (244776960 bytes)
( 1) Multiprocessors x ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock Speed: 1.40 GHz
Memory Clock rate: 700.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro NVS 420"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 234 MBytes (244908032 bytes)
( 1) Multiprocessors x ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock Speed: 1.40 GHz
Memory Clock rate: 700.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = C开发者_运维技巧UDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Quadro NVS 420, Device = Quadro NVS 420
I don't think you should need to use cudaDeviceReset
, what happens if you omit the call to clear mex
? Why are you doing that in the first place? That will cause MATLAB to unload your MEX file, and I suspect that is at the root of the memory leak.
精彩评论