博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
gpu合并访问和取模对速度的影响
阅读量:4153 次
发布时间:2019-05-25

本文共 3714 字,大约阅读时间需要 12 分钟。

#include 
#include
#include
#include
#include
#include
#include
using namespace std;#define IDX2C(i,j,rows) (((j)*(rows)+(i)))#define IDX2R(i,j,cols) (((i)*(cols)+(j)))#define BLOCK_SIZE 32#define CHECK_EQ1(a,b) do { \ if ((a) != (b)) { \ cout <<__FILE__<<" : "<< __LINE__<<" : check failed because "<
<<"!="<<
inline void printMtx(T *mtx, int row, int col) { for (int i = 0; i < row; ++i) { for (int j = 0; j < col; ++j) { cout << mtx[IDX2C(i,j,row)] << " "; } cout << endl; }}//if mtx is a sub-matrix// 1. elements is continue storage, row and col is sub-matrix size// 2. elements is non continue, row is matrix row template
inline void printMtxg(T *mtx, int row, int col) { T *c = (T*)malloc(sizeof(T)*row*col); CUDA_CHECK(cudaMemcpy(c, mtx, sizeof(T)*row*col, cudaMemcpyDeviceToHost)); cudaDeviceSynchronize(); printMtx(c,row,col); free(c);}template
inline void printVec(T *vec, int len) { for (int i = 0; i < len; ++i) cout <
<< " "; cout << endl;}template
inline void printVecg(T *gvec, int len) { T *vec = (T*)malloc(sizeof(T)*len); CUDA_CHECK(cudaMemcpy(vec,gvec,sizeof(T)*len,cudaMemcpyDeviceToHost)); printVec(vec,len); free(vec);}bool validate(double *rst, double *grst, int row, int col) { //cout << "cpu rst\n"; //printMtxt(rst, row, col); //cout << "gpu rst\n"; //printMtxgt(grst, row, col); double *crst = (double *)malloc(sizeof(double)*row*col); CUDA_CHECK(cudaMemcpy(crst, grst, sizeof(double)*row*col, cudaMemcpyDeviceToHost)); bool flag = true; for (int i = 0; i < row; ++i) { for (int j = 0; j < col; ++j) { if (rst[IDX2C(i,j, row)] != crst[IDX2C(i,j,row)]){ //return false; flag = false; cout <
<<","<
<<" "<
"<
<
>>(gmat, row, col,garr, len);//a block process a column, grid.x maximum dimension is very large, need col blocks in x direction. data access is continue CUDA_CHECK(cudaPeekAtLastError()); CUDA_CHECK(cudaDeviceSynchronize()); gettimeofday(&e1, NULL); cout << "gpu0 real time used: " << e1.tv_sec-b1.tv_sec + (double)(e1.tv_usec-b1.tv_usec)/1000000 <
>>(gmat, row, col,garr, len);//a thread calc an element in gmat. data access is very bad CUDA_CHECK(cudaPeekAtLastError()); CUDA_CHECK(cudaDeviceSynchronize()); gettimeofday(&e1, NULL); cout << "gpu1 real time used: " << e1.tv_sec-b1.tv_sec + (double)(e1.tv_usec-b1.tv_usec)/1000000 <
65535) y = 65535; dimGrid.x = col; dimGrid.y = y; //dim3 dimGrid(col,y); addMinusMtx2<<
>>(gmat, row, col,garr, len);//almost same with addMinusMtx0, only diff is in y direction, has blocks. CUDA_CHECK(cudaPeekAtLastError()); CUDA_CHECK(cudaDeviceSynchronize()); gettimeofday(&e1, NULL); cout << "gpu2 real time used: " << e1.tv_sec-b1.tv_sec + (double)(e1.tv_usec-b1.tv_usec)/1000000 <
>>(gmat, row, col,garr, len);//process gmat as array, need modulo operation. in gpu, integer divsion and modelo operation are costly: tens of instructions on compute capabllity 1.0, but below 20 instructions for 2.x and higher. CUDA_CHECK(cudaPeekAtLastError()); CUDA_CHECK(cudaDeviceSynchronize()); gettimeofday(&e1, NULL); cout << "gpu3 real time used: " << e1.tv_sec-b1.tv_sec + (double)(e1.tv_usec-b1.tv_usec)/1000000 <

nvcc  -arch=sm_35  mtxOp.cu -o mtxOp

./mtxOp 30000 8000

gpu0 real time used: 0.028922

gpu1 real time used: 0.106911
gpu2 real time used: 0.028231
gpu3 real time used: 0.027024

因为矩阵存储是column-major,所以方法1的速度最慢,主要是访问显存不能合并访问,一个warp中的连续的线程不能访问连续的数据

方法0的思路是,有多少个列,就有多少个block, 只有x方向的block,这是因为x方向可以有2147483647个block, 可以认为在显存的大小下,一般不能超过这个block数量的限度

一个bock有1024个线程,也是只有x方向上的,这1024个线程循环处理这一列,这样就能保证合并访问

方法2在方法0的基础上又更近一步,在y方向上也有block

方法3就是把矩阵看做一个向量,但是需要用到取模操作,对于计算能力在1.0而言,取模操作非常慢,由于3.5计算能力对取模优化的还不错,所以速度是最快的

如果方法3去掉取模操作,就可以对比取模操作的影响,时间是0.025607, 速度提升了0.0014s

转载地址:http://eueti.baihongyu.com/

你可能感兴趣的文章
把类成员函数封装成线程API所需要的函数
查看>>
HTTP Live Streaming直播(iOS直播)技术分析与实现
查看>>
Ribbon界面图标可以直接用PNG做透明图标
查看>>
向其他软件窗口、控件发送消息的方法
查看>>
word或者pdf文件全部保存为图片的方法
查看>>
VS2010下SQLite3生成lib库文件
查看>>
sqlite3的helloworld
查看>>
MFC下支持中文的SQLite3封装类使用
查看>>
简单高效的多线程日志类
查看>>
研华USB4711A采集卡高速中断模式采集总结
查看>>
从零起步CMFCToolBar用法详解
查看>>
CMFCRibbonStatusBar用法
查看>>
CMFCListCtrl控件使用
查看>>
CMFCControlRendererInfo类的参数
查看>>
史上最详细MFC调用mapX5.02.26步骤(附地图测试GST文件)
查看>>
CMFCShellListCtrl使用方法
查看>>
mapnik的demo运行
查看>>
python支持下的mapnik安装
查看>>
The Android ION memory allocator, DMABUF is mentioned as well
查看>>
spin_lock and mutex_lock
查看>>