Efficiency of CUDA vector types (float2, float3, float4)
https://stackoverflow.com/questions/26676806/efficiency-of-cuda-vector-types-float2-float3-float4
I’m expanding njuffa’s comment into a worked example. In that example, I’m simply adding two arrays in three different ways: loading the data as float, float2 or float4.
These are the timings on a GT540M and on a Kepler K20c card:
GT540M
float - Elapsed time: 74.1 ms
float2 - Elapsed time: 61.0 ms
float4 - Elapsed time: 56.1 ms
Kepler K20c
float - Elapsed time: 4.4 ms
float2 - Elapsed time: 3.3 ms
float4 - Elapsed time: 3.2 ms
As it can be seen, loading the data as float4 is the fastest approach.