如何在自定义TensorFlow C op中调用sgemm_C

概述我跟着 tutorial on how to define my own op for TensorFlow in C++. 我想在我的自定义TensorFlow C op中调用sgemm.我正在编写两个内核,一个用于CUDA,另一个用于CPU.在每种情况下,sgemm调用怎么样？或者是否存在适用于这两种情况的通用方法？我尝试使用此代码段,但由于缺少包含文件,我无法使其工作(请参阅here)：我跟着 @L_502_0@. @H_404_7@

@H_404_7@我想在我的自定义TensorFlow C op中调用sgemm.我正在编写两个内核,一个用于CUDA,另一个用于cpu.在每种情况下,sgemm调用怎么样？或者是否存在适用于这两种情况的通用方法？

@H_404_7@我尝试使用此代码段,但由于缺少包含文件,我无法使其工作(请参阅here)：

@H_404_7@

auto dev_ctx = context->op_device_context();auto* dev_stream = dev_ctx->stream();OP_REQUIRES(context,dev_stream,errors::Internal("No stream available."));bool blas_launch_status =    dev_stream         ->ThenBlasGemm(...

@H_404_7@此外,不确定这是否是通用的,或者仅适用于CUDA.

@H_404_7@这有记录吗？

@H_404_7@如何在我的GPU / CUDA实现中调用cublasSgemm？
或者更准确地说,如何获得cublasHandle_t？

@H_404_7@我在TF代码中搜索了一下,并且有一个class CUDABlas似乎提供了围绕cuBLAS函数的包装器.我需要使用它还是可以直接使用cublasSgemm？
我想我需要使用包装器,因为这将确保CUDA流执行器保持在一个理智的状态？我如何使用包装器？

@H_404_7@我还发现contrib/rnn/kernels/blas_gemm.cc和core/kernels/matmul_op.cc似乎做了我想要的.代码如下所示：

@H_404_7@

#define EIGEN_USE_THREADS#if Google_CUDA#include "tensorflow/core/platform/stream_executor.h"#endif  // Google_CUDA#include "tensorflow/contrib/rnn/kernels/blas_gemm.h"#include "tensorflow/core/framework/op_kernel.h"namespace tensorflow {#if Google_CUDAnamespace {template <typename T>perftools::gputools::DeviceMemory<T> AsDeviceMemory(const T* cuda_memory) {  perftools::gputools::DeviceMemoryBase wrapped(const_cast<T*>(cuda_memory));  perftools::gputools::DeviceMemory<T> typed(wrapped);  return typed;}}  // namespace#endif  // Google_CUDAnamespace functor {template <typename T>voID TensorCuBlasGemm<T>::operator()(OpKernelContext* ctx,bool transa,bool transb,uint64 m,uint64 n,uint64 k,T Alpha,const T* a,int lda,const T* b,int ldb,T beta,T* c,int ldc) {#if Google_CUDA  perftools::gputools::blas::Transpose trans[] = {      perftools::gputools::blas::Transpose::kNoTranspose,perftools::gputools::blas::Transpose::kTranspose};  auto a_ptr = AsDeviceMemory(a);  auto b_ptr = AsDeviceMemory(b);  auto c_ptr = AsDeviceMemory(c);  bool blas_launch_status =      ctx->op_device_context()          ->stream()          ->ThenBlasGemm(trans[transa],trans[transb],m,n,k,Alpha,a_ptr,lda,b_ptr,ldb,beta,&c_ptr,ldc)          .ok();  OP_REQUIRES(ctx,blas_launch_status,errors::Aborted("CuBlasGemm Failed!"));#else  ctx->SetStatus(errors::InvalIDArgument("CuBlasGemm needs CUDA."));#endif}

@H_404_7@即在我的计算(OpKernelContext * ctx)中,我会打电话

@H_404_7@

ctx->op_device_context()      ->stream()      ->ThenBlasGemm(...)

@H_404_7@我试过了,但似乎有一些包含标题丢失了(TensorFlow 0.12.0 with GPU for linux).我得到了致命的错误：tensorflow / stream_executor / lib / status.h：没有这样的文件或目录.我报道上游here.

@H_404_7@有没有关于所有这些的文档,即如何处理cuBLAS,或者这个DeviceStream接口,流执行器逻辑等？

@H_404_7@我目前的解决方案有点像黑客.对于cpu,我尝试链接系统上的一些可用的Blas库,并从那里使用sgemm.对于CUDA,我链接到tensorflow / contrib / rnn / python / ops / _lstm_ops.so,因为在那里我找到了TensorCuBlasGemm,我可以使用它.参见here.基本上,在该贡献中,他们面临同样的问题,并提出this.但这部分取决于一般不可用的包含文件,请参阅上面的问题.

解决方法您可以尝试以下适用于我的以下内容：
在开头的* .cu.cc文件中： @H_404_7@

@H_404_7@

#include <cublas_v2.h>cublasHandle_t cublas_handle = NulL;

@H_404_7@在仿函数实现中的相同* .cu.cc文件中：

@H_404_7@

if (cublas_handle == NulL){    assert(cublasCreate(&cublas_handle) == CUBLAS_STATUS_SUCCESS);    asert(cublasSetStream(cublas_handle,d.stream()) == CUBLAS_STATUS_SUCCESS);}

@H_404_7@其中d从* .cc文件作为参数传递到仿函数中,其值为ctx-> eigen_device< Eigen :: GpuDevice>()

@H_404_7@希望这会有所帮助,欢呼！

总结

以上是内存溢出为你收集整理的如何在自定义TensorFlow C op中调用sgemm全部内容，希望文章能够帮你解决如何在自定义TensorFlow C op中调用sgemm所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1225006.html