Number of threads per thread block for the dot kernel, 16 seems to be the optimum for Telsa P100 in a real-world test.
dot