Note: SIMD on x86 has unaligned instructions that used to be much slower (decoded differently) than their aligned counterparts.
For example, on Pentium 3 and Pentium Core 2, the unaligned instructions took twice as many cycles to execute. On modern x86 family processors, it’s the same cycle count either way. The only perf penalty one should account for is crossing of cache lines, generally a much smaller problem.
For example, on Pentium 3 and Pentium Core 2, the unaligned instructions took twice as many cycles to execute. On modern x86 family processors, it’s the same cycle count either way. The only perf penalty one should account for is crossing of cache lines, generally a much smaller problem.