標籤:

碼字的cuBLAS學習記錄(二)

碼字的cuBLAS學習記錄(二)

著急換種學習方法,所以upload第一篇,慢慢找坑的出口。

The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA?CUDA? runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).

本想好好寫寫嘚瑟下。。。額。。菜菜如我。

#include <stdio.h>#include <stdlib.h>#include <math.h>#include <cuda_runtime.h>#include "cublas_v2.h"#define M 6#define N 5#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))static __inline__ void modify (cublasHandle_t handle, float *m, int ldm, int n, int p, int q, float alpha, float beta){ cublasSscal (handle, n-p+1, &alpha, &m[IDX2F(p,q,ldm)], ldm); cublasSscal (handle, ldm-p+1, &beta, &m[IDX2F(p,q,ldm)], 1);}int main (void){ cudaError_t cudaStat; cublasStatus_t stat; cublasHandle_t handle; int i, j; float* devPtrA; float* a = 0; a = (float *)malloc (M * N * sizeof (*a)); if (!a) { printf ("host memory allocation failed"); return EXIT_FAILURE; } for (j = 1; j <= N; j++) { for (i = 1; i <= M; i++) { a[IDX2F(i,j,M)] = (float)((i-1) * M + j); } } cudaStat = cudaMalloc ((void**)&devPtrA, M*N*sizeof(*a)); if (cudaStat != cudaSuccess) { printf ("device memory allocation failed"); return EXIT_FAILURE; } stat = cublasCreate(&handle); if (stat != CUBLAS_STATUS_SUCCESS) { printf ("CUBLAS initialization failed
"); return EXIT_FAILURE; } stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M); if (stat != CUBLAS_STATUS_SUCCESS) { printf ("data download failed"); cudaFree (devPtrA); cublasDestroy(handle); return EXIT_FAILURE; } modify (handle, devPtrA, M, N, 2, 3, 16.0f, 12.0f); stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M); if (stat != CUBLAS_STATUS_SUCCESS) { printf ("data upload failed"); cudaFree (devPtrA); cublasDestroy(handle); return EXIT_FAILURE; } cudaFree (devPtrA); cublasDestroy(handle); for (j = 1; j <= N; j++) { for (i = 1; i <= M; i++) { printf ("%7.0f", a[IDX2F(i,j,M)]); } printf ("
"); } free(a); return EXIT_SUCCESS;}

1. cublas中數據布局是列為主,上面代碼是1-based index,6列5行的矩陣

2. 定義了兩個宏,#define IDX2F(i,j,ld)& #define IDX2C(i,j,ld)

其實覺得沒很大區別哎,雖然名字看上去好像前者跟Fortran相關,後者跟C相關,但是都是定義列存儲啊

3. 自定義的modify函數中用到cublasSscal,這個費解的函數介紹如下

void cublasSscal (int n, float alpha, float *x, int incx)

replace single-precision vector x with single-precision alpha * x.

n:number of elements in input vector;

alpha:single-precision scalar multiplier;

x:single-precision vector with n elements;

incx:storage spacing between elements of x。

用法如 modify (handle, devPtrA, M, N, 2, 3, 16.0f, 12.0f);

意思呢,修改GPU中devPtrA矩陣中第2列第三行開始的一些數據。

哪些數據呢,查看modify,涉及到如下

cublasSscal (handle, n-p+1, &alpha, &m[IDX2F(p,q,ldm)], ldm);

cublasSscal (handle, ldm-p+1, &beta, &m[IDX2F(p,q,ldm)], 1);

即cublasSscal(handle,4,16,devPtrA[IDX2F(2,3,6)],6);

cublasSscal(handle,5,12,devPtrA[IDX2F(2,3,6)],1);

前者表示將devPtrA中第2列第3行開始的4個連續的列數據乘以16;

後者表示將devPtrA中第2列第三行開始的5個連續的行數據乘以12;

結果

end前是原始6列5行的矩陣

end後是修改後的6列5行的矩陣,1728=9*12*16,這個數據被加工了兩次,畢竟是起點

然後列數據小於4個,所以可以看到3個列數據被修改了。

為了驗證如上的解釋,修改了modify

modify (handle, devPtrA, M, N, 2, 2,10.0f, 12.0f);]

結果如下:(^o^)/


推薦閱讀:

別忘了,我們都曾是賽車手啊!
濤濤說車-保養超跑GK5 去凱擇賽道日與眾超跑廝殺(含視頻)
極簡的Motec圈速對比分析方法
濤濤說車-胖瘸子獲亞軍 China GT保時捷911賽車駕駛技巧分享(含視頻)
飛時代軌道賽車活動頒獎流程

TAG:賽車 |