#include <TensorReductionSycl.h>
Public Types | |
typedef Self::CoeffReturnType | CoeffReturnType |
Static Public Member Functions | |
static void | run (const Self &self, Op &reducer, const Eigen::SyclDevice &dev, CoeffReturnType *output) |
Static Public Attributes | |
static const bool | HasOptimizedImplementation = false |
For now let's start with a full reducer Self is useless here because in expression construction we are going to treat reduction as a leafnode. we want to take reduction child and then build a construction and apply the full reducer function on it. Fullreducre applies the reduction operation on the child of the reduction. once it is done the reduction is an empty shell and can be thrown away and treated as
Definition at line 103 of file TensorReductionSycl.h.
typedef Self::CoeffReturnType Eigen::internal::FullReducer< Self, Op, const Eigen::SyclDevice, Vectorizable >::CoeffReturnType |
Definition at line 105 of file TensorReductionSycl.h.
|
inlinestatic |
this is the child of reduction
initial reduction. If the size is less than red_factor we only creates one thread.
if the shared memory is less than the GRange, we set shared_mem size to the TotalSize and in this case one kernel would be created for recursion to reduce all to one.
creating the shared memory for calculating reduction. This one is used to collect all the reduced value of shared memory as we dont have global barrier on GPU. Once it is saved we can recursively apply reduction on it in order to reduce the whole.
reduction cannot be captured automatically through our device conversion recursion. The reason is that reduction has two behaviour the first behaviour is when it is used as a root to lauch the sub-kernel. The second one is when it is treated as a leafnode to pass the calculated result to its parent kernel. While the latter is automatically detected through our device expression generator. The former is created here.
This is the evaluator for device_self_expr. This is exactly similar to the self which has been passed to run function. The difference is the device_evaluator is detectable and recognisable on the device.
const cast added as a naive solution to solve the qualifier drop error
This is used to recursively reduce the tmp value to an element of 1;
Definition at line 108 of file TensorReductionSycl.h.
|
static |
Definition at line 106 of file TensorReductionSycl.h.