Number of elements each thread adds up in the dot kernel, 4 seems to be the optimum for a Tesla P100 in a real-world test.
dot