Triton Vector Addition Kernel, part 4: Benchmarking vs PyTorch and tuning