首页 > c > 英特尔AVX在C中不一致_mm256_load_si256整数运算

英特尔AVX在C中不一致_mm256_load_si256整数运算 (Intel AVX inconsistent _mm256_load_si256 integer operation in C)

问题

为了并行化我的基于数组的代码,我试图弄清楚如何利用英特尔AVX内在函数在大型数组上执行并行操作。

从我已经阅读的文档中可以看出,256位AVX向量将支持多达8个并行32位整数/ 32位浮点数或最多4个并行64位双精度数。浮动部分没有给我任何问题,工作正常,但整数AVX函数让我头疼,让我用下面的代码来演示:

命令行选项-mavx与符合AVX的Intel处理器一起使用。我不会使用AVX2功能。编译将在Ubuntu 16.04上使用GNU99 C完成。

AVX FP:

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

int main() 
{ 
    float data[8] = {1.f,2.f,3.f,4.f,5.f,6.f,7.f,8.f};
    __m256 points = _mm256_loadu_ps(&data[0]);

    for(int i = 0; i < 8; i++)
        printf("%f\n",points[i]);

    return 0;
}

输出:

1.000000
2.000000
3.000000
4.000000
5.000000
6.000000
7.000000
8.000000

这完全是应该的,但是当使用整数加载AVX函数时不是这种情况:

AVX INT:

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

int main() 
{ 
    int data[8] = {1,2,3,4,5,6,7,8};
    __m256i points = _mm256_loadu_si256((__m256i *)&data[0]);

    for(int i = 0; i < 8; i++)
        printf("%d\n",points[i]);

    return 0;
}

输出:

1
3
5
7
1048576 [ out of bounds ]
0 [ out of bounds ]
1 [ out of bounds ]
3 [ out of bounds ]

如您所见,加载仅在__m256i类型变量中生成4个元素,其中仅从原始数组加载第一个,第三个,第五个和第七个元素。超出第四个元素,引用超出范围。

如何将整个数据集加载到整数AVX数据类型中产生所需的结果,就像AVX浮点数据类型一样?

解决方法

您正在使用GNU C扩展来索引向量,[]而不是将其存储回数组。英特尔关于内在函数的文档没有什么可说的,并非所有编译器都支持它(例如MSVC没有)。

GCC定义__m256iGNU C本机向量long long<immintrin.h>没有__m256i为SIMD矢量定义不同类型的int或者short,并且__m256i不记得它来自何处或如何设置。(与FP矢量不同,其中有单独的C类型pspd,所以__m128d _mm_castps_pd(__m128)如果你想使用shufpdunpcklpdps矢量上,你必须这样做)

您可以typedef像您一样使用本机矢量类型v8si(请参阅上一个链接到gcc文档),或使用像Agner Fog的VCL这样的库,它为您提供类似Vec8i(8签名int)或Vec32uc(32未签名char)的类型。他们有运算符重载,让你写a + b的,而不是_mm256_add_epi32(a, b)_mm256_add_epi8(a,b)根据类型。或者使用[]而不是_mm_extract_epi32/ epi8 / epi16 / epi64。


请参阅打印__m128i变量以获取便携式和安全/正确的方法来循环/打印出英特尔固有SIMD变量的元素。TL:DR:_mm_store/ _mm256_store到tmp数组和索引。它是可移植的,并且它可以优化(pextrd对于整数或只是一个随机播放的FP),在简单的情况下没有实际的存储/重新加载。

问题

To parallelize my array based code I am trying to figure out how to utilize the Intel AVX intrinsics functions to perform parallel operations on large arrays.

From the documentation I have read that 256 bit AVX vectors will support up to 8 parallel 32 bit integers / 32 bit floats or up to 4 parallel 64 bit doubles. The float portion is giving me no issues and works fine, but the integer AVX functions are giving me a headache, let me use the following code to demonstrate:

The command line option -mavx is used in conjunction with an AVX compliant Intel processor. I will not be using AVX2 features. Compilation will be done using GNU99 C on Ubuntu 16.04.

AVX FP:

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

int main() 
{ 
    float data[8] = {1.f,2.f,3.f,4.f,5.f,6.f,7.f,8.f};
    __m256 points = _mm256_loadu_ps(&data[0]);

    for(int i = 0; i < 8; i++)
        printf("%f\n",points[i]);

    return 0;
}

Output:

1.000000
2.000000
3.000000
4.000000
5.000000
6.000000
7.000000
8.000000

This is exactly as it should be, however this is not the case when using the integer load AVX function:

AVX INT:

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

int main() 
{ 
    int data[8] = {1,2,3,4,5,6,7,8};
    __m256i points = _mm256_loadu_si256((__m256i *)&data[0]);

    for(int i = 0; i < 8; i++)
        printf("%d\n",points[i]);

    return 0;
}

Output:

1
3
5
7
1048576 [ out of bounds ]
0 [ out of bounds ]
1 [ out of bounds ]
3 [ out of bounds ]

As you can see the load only produces 4 elements in the __m256i type variable of which only the first, third, fifth and seventh element are loaded from the original array. Beyond the fourth element the reference goes out of bounds.

How do I produce the desired result of loading the entire data set in order into the integer AVX data type, much like the AVX floating point data type?

解决方法

You're using a GNU C extension to index a vector with [] instead of storing it back to an array. Intel's documentation for intrinsics has nothing to say about this, and not all compilers support it (e.g. MSVC doesn't).

GCC defines __m256i as a GNU C native vector of long long. <immintrin.h> doesn't define different __m256i types for SIMD vectors of int or short, and __m256i doesn't remember anything about where it came from / how it was set. (Unlike for FP vectors where there are separate C types for ps and pd, so you have to __m128d _mm_castps_pd(__m128) if you want to use shufpd or unpcklpd on a ps vector)

You can typedef native vector types like v8si yourself (see the previous link to gcc docs), or use a library like Agner Fog's VCL that gives you types like Vec8i (8 signed int) or Vec32uc (32 unsigned char). They have operator overloads that let you write a + b instead of _mm256_add_epi32(a, b) or _mm256_add_epi8(a,b) depending on type. Or use [] instead of _mm_extract_epi32 / epi8 / epi16 / epi64.


See print a __m128i variable for portable and safe/correct ways to loop over / print out the elements of an Intel intrinsic SIMD variable. TL:DR: _mm_store / _mm256_store to a tmp array and index that. It's portable, and it optimizes away (to a pextrd for integer or just a shuffle for FP), no actual store/reload in simple cases.

相似信息