SyTen

◆ cukrn_dot_elems_per_worker

constexpr std::size_t cukrn_dot_elems_per_worker = 4
constexpr

Number of elements each thread adds up in the dot kernel, 4 seems to be the optimum for a Tesla P100 in a real-world test.