- Feature Name:
vendor_intrinsics
- Start Date: 2018-02-04
- RFC PR: rust-lang/rfcs#2325
- Rust Issue: rust-lang/rust#48556
Summary
The purpose of this RFC is to provide a framework for SIMD to be used on stable Rust. It proposes stabilizing x86-specific vendor intrinsics, but includes the scaffolding for other platforms as well as a future portable SIMD design
Motivation
Stable Rust today does not typically
The goal of this RFC is to enable using SIMD intrinsics on stable Rust, and in general
Note that this is certainly not the first discussion to broach the topic of SIMD in Rust, but rather this has been an ongoing discussion for quite some time now! For example the simd crate started long ago, we've had rfcs, we've had a lot of discussions on internals, and the stdsimd crate has been implemented.
This RFC draws from much of the historical feedback and design
Guide-level explanation
Let's say you've just heard about this fancy feature called
When inspecting the assembly you notice that rustc is making use of the %xmmN
registers which you've read is related to SSE on your CPU. You know, however, that your CPU supports up to AVX2 which has bigger registers, so you'd like to get access to them!
Your first solution to this problem is to compile with -C target-feature=+avx2
, and after that you see the %ymmN
registers being used, yay! Unfortunately though you're publishing this binary
And sure enough you see the %ymmN
registers getting used in this function! Note, however, that because you've explicitlyunsafe
, as specified
And sure enough once again we see that foo
is dispatching at runtime to the appropriate function, and only foo_avx2
is using our %ymmN
registers!
Ok great! At this point we've seen how to enable CPU features for functions-at-a-time as well as how they could be used in a larger context
For explicitstd::arch
. The std::arch
module is definedstd::arch
with types translated to Rust (e.g. int32_t
becomes i32
). Vendor specific__m128i
on Intel will also live in std::arch
.
For example let's say that we're writing a function that encodes&[u8]
in ascii hex and we want to convert&[1, 2]
to "0102"
. The stdsimd crate currently has this as an example, and let's take
First up you'll see the dispatch routine like we wrote above:
Here we have some routine business about hex encoding in general,is_target_feature_detected!
macro in libstd we saw above we'll dispatch to the correct one at runtime.
Takinghex_encode_sse41
we see that it starts out with a bunch of weird looking function calls:
As it turns out though, these are all Intel SIMD intrinsics! For example _mm_set1_epi8
is defined__m128i
, a 128-bit integer
These functions are all imported through std::arch::*
at the top of the example (in this case stdsimd::vendor::*
). We go on to use a bunch of these intrinsics throughout the hex_encode_sse41
function to actually do the hex encoding.
The example listed
test benches::large_default ... bench: 73,432 ns/iter (+/- 12,526) = 14279 MB/s
test benches::large_fallback ... bench: 1,711,030 ns/iter (+/- 286,642) = 612 MB/s
test benches::small_default ... bench: 30 ns/iter (+/- 18) = 3900 MB/s
test benches::small_fallback ... bench: 204 ns/iter (+/- 74) = 573 MB/s
test benches::x86::large_avx2 ... bench: 69,742 ns/iter (+/- 9,157) = 15035 MB/s
test benches::x86::large_sse41 ... bench: 108,463 ns/iter (+/- 70,250) = 9667 MB/s
test benches::x86::small_avx2 ... bench: 25 ns/iter (+/- 8) = 4680 MB/s
test benches::x86::small_sse41 ... bench: 25 ns/iter (+/- 14) = 4680 MB/s
Or in other words, our runtime dispatch implementation
With std::arch
and is_target_feature_detected!
we've now written a program that's 20x faster on supported hardware, yet it also continues to run on older hardware as well! Not bad for a few dozen lines on each function!
Note that this RFC is explicitlystd::arch
are platform specific
Furthermore LLVM does quite a good job with a portable u32x4
type, for example, in terms
- The intrinsics will not takeとるportable types as arguments.引数For example
u32x4
and__m128i
will be different types on x86. The two types, however, will be convertible between one another (either via transmutes or via explicit明示的なfunctions). This conversion変換will have zero run-time実行時の(に)cost. - The portable simd types will likely live in a module like
std::simd
rather thanstd::arch
.
The designstd::simd
module!
Reference-level explanation
Stable SIMD in Rust ends up requiring a surprising number of both language
The #[target_feature]
Attribute
The #[target_feature]
attribute was specified
The only currently allowedenable
(one day we may allowdisable
). The string values acceptedenable
will be separately stabilized but are likely to be guided by vendor definitions.avx2
for Rust.
There's a good number of these features supported by the compiler today. It's expected that when stabilizing other pieces of this RFC the names of the following
aes
avx2
avx
bmi2
bmi
- to be renamed tobmi1
, the name Intel gives itfma
fxsr
lzcnt
popcnt
rdrnd
rdseed
sse2
sse3
sse4.1
sse4.2
sse
ssse3
xsave
xsavec
xsaveopt
xsaves
Note that AVX-512 names are missing from this list,mmx
is missing from this list.sse4a
, tbm
), and so do ARM, MIPS, and PowerPC, but none of these feature names a proposed for becoming stable in the first pass.
The target_feature
value in #[cfg]
In additioncfg_target_feature
feature today in rustc, and can be seen via:
Additionally this is also made available to cfg!
:
The #[cfg]
attribute and cfg!
macro statically resolve-C target-feature
flag to the compiler. This flag to the compiler accepts
The is_target_feature_detected!
Macro
One mode of operation
The crux of this support in libstd is this macro providedis_target_feature_detected!
. The macro will accept#[target_feature(enable = ...)]
for the platform you're compiling for. Finally, the macro will resolvebool
result.
For example on x86 you could write:
It would, however, be an error to write this on x86 cpus:
The macro is intended to be implementedstd
crate (not core
) and made available via the normal macro preludes. The implementationstdsimd
does today, notably:
- The first time the macro is invoked呼び出すall the local CPU features will be detected.
- The detected features will then be cached globally (when possible and currently in a bitset) for the rest of the execution実行of the program.
- Furtherさらなる、それ以上invocations呼び出しof
is_target_feature_detected!
are expected to be cheap runtime dispatches. (aka load a value and check whether a bit is set) - Exception:例外in some cases the result結果、戻り値of the macro is statically known: for example,
is_target_feature_detected!("sse2")
when the binary2進数is being compiled with "sse42" globally. In these cases, none of the steps above are performed and the macro just expands totrue
.
The exact method of CPU feature detection variouscpuid
instruction/proc
mounted information on Linux. It's expected that the detection will vary for each particular target, as necessary.
Note that the implementation/proc
is used that requires libc to be available or File
in one formcpuid
instruction), but for consistency across platforms the macro will only be available in libstd for now. This placement can of course be relaxed in the future if necessary.
The std::arch
Module
This is where the real meat is. A new module will be addedstd::arch
. This module will also be available in core::arch
(and std
will simply reexport it). The contents of this module provide no portabilitystd::os
and unlike the rest of std
). APIs present
The contents of the arch
modules are definedarch
module itself. The standard library will not deviate in naming or type signature
For example most Intel intrinsics start with _mm_
or _mm256_
for 128 and 256-bit registers. While perhaps unergonomic, we'll be sticking to what Intel says. Note that all intrinsics will also be unsafe
, according
Function signaturesint32_t
, but otherwise
The current proposed mapping for x86 intrinsics is:
What Intel says | Rust Type |
---|---|
void* | *mut u8 |
char | i8 |
short | i16 |
int | i32 |
long long | i64 |
const int | i32 [0] |
[0] required to be compile-time constants.
Other than these exceptionsstd::arch
modules for SIMD registers! For example these new types will all be presentstd::arch
on x86 platforms:
__m128
__m128d
__m128i
__m256
__m256d
__m256i
(note that AVX-512 types will come in the future!)
Infrastructure-wise the contents of std::arch
are expected to continue to be definedstdsimd
crate/repository. Intrinsics defined
Currently today on x86 and ARM platforms the stdsimd crate performs all these checks, but these checks are not yet implemented
It's not expected that the contents of std::arch
will remain staticstdsimd
and make their way into the main Rust repository. For example there are not currently any implemented
The types in std::arch
It's worth paying close attention to the types in std::arch
. Types like __m128i
are intended to representOption<__m128i>
in your program! Most generic containers and such probably aren't written with packed SIMD types in mind, and it'd be a bummer if everything stopped working once you used a packed SIMD type in one of them.
Instead it will be required that the types definedstd::arch
do indeed work when used in "nonstandard" contexts. For example Option<__m128i>
should never produce
Implementation-wise these packed SIMD types are implemented
The Rust ABI will currently be implemented
Again though, note that this section
Intrinsics in std::arch
and constant定数 arguments引数
There are a number of intrinsics on x86 (and other) platforms that require their arguments_mm_insert_pi16
requires its third argument
Eventually we will likely have some formconst
argumentsconst
machinery to guaranteestdsimd
crate will have an unstable attribute where the compiler can help provide this guarantee.
It's hoped that this restrictionstdsimd
to be forward compatible with a future const-powered world of Rust but in the meantime not otherwise
Portable packed SIMD
So-called "portable" packed SIMD types are currently implementedu8x16
and explicitlyu8
in this case). These types are intended to unconditionally available (like the rest of libstd) and simply optimized much more aggressively on platforms that have native support for the various
For example u8x16::add
may be implemented
It's intended that this RFC neither includes nor rules out the additionstd::simd
module. These types will be orthogonal to scalable-vector types which are expected to be proposed in another, also different, RFC. What this RFC does do, however, is explicitly
- The portable SIMD types (both packed and scalable) will not be used in intrinsics.
- The per-architecture SIMD types will be distinct区別された/独立したtypes from the portable SIMD types.
Or, in other words, it's intended that portable SIMD types are entirely decoupled from intrinsics. If they both end up being implemented
Not stabilizing MMX in this RFC
This RFC proposed notably omitting__m64
in other words. The MMX type __m64
and the intrinsics have been somewhat problematic in a number of ways. Known cases include:
- MMX intrinsics aren't always desirable
- LLVM codegen errors happen with debuginfo enabled and MMX
- LLVM codegen errors with MMX types and i586
Due to these issues having an unclear conclusion as well as a seeming lack of desire to stabilize MMX intrinsics, the __m64
and all related intrinsics will not be stabilized via this RFC.
Drawbacks
This RFC represents
Due to the enormity of what's being added
Rationale and alternatives代わりのもの、選択肢
Over the years quite a few iterations
Portable types in architecture interfaces
It was initially attempted in the stdsimd crate that we would use the portable types on all of the intrinsics. For example instead of:
we would instead define
The latter definition__m128i
).
The downside of this approach, however, is that Intel isn't telling us what to do. While that may sound simple, this RFC is proposing an additioni8x16
or i16x8
?)
Furthermore not all intrinsics from Intel actually have an interpretationu8x16
and when 1 interpretsu16x8
(as an example). This effectively means that there isn't a correct choice in all situations for what portable type should be used.
Consequently
There is interest by both current stdsimd
maintainers and users to expose a "better-typed" SIMD API in crates.io that builds on top of the intrinsics proposed for stabilization here.
Stabilizing SIMD implementation実装 details
Another alternative#[repr(simd)]
or the ability to write extern "platform-intrinsics" { ... }
or #[link_llvm_intrinsic...]
. This is certainly a much smaller surface area to stabilize (aka not thousands of intrinsics).
This avenue was decided against, however, for a few reasons:
- Such raw生のinterfaces may change over time as they simply represent表現するLLVM as a current point in time rather than what LLVM wants to do in the future.
- Alternate implementations実装of rustc or alternate rustc backends like Cranelift may not expose the same sort of functionality that LLVM provides,与えるor implementing実装するthe interfaces may be much more difficult in alternate backends than in LLVM's.
As a result,stdsimd
to live on crates.io) we'll instead pull in stdsimd
to the standard library and expose it as the stable interface to SIMD in Rust.
Unresolved questions
There's a number of unresolved questions around stabilizing SIMD today which don't pose serious blockers and may also wish to be considered
Relying on unexported LLVM APIs
The staticcfg!
and #[cfg]
currently relies on a Rust-specific patch to LLVM. LLVM internal knows all about hierarchies-C target-feature=+avx2
then cfg!(target_feature = "sse2")
also needs to resolvetrue
. Rustc, however, does not know about these features and relies on learning this information through LLVM.
Unfortunately though LLVM does not actually export this information for us to consume (as far as we know). As a resultcfg!
macro may not work correctly when used in conjunction with -C target-feature
or -C target-cpu
flags.
It appears
Packed SIMD types in extern
functions are not sound
The packed SIMD types have particular care paid to them with respect to their ABI in Rust and how they're passed between functions, notably to ensure
A consequenceextern
then the same bug will arise. It may be possible to implement
What if we're wrong?
Despite the CI infrastructure of the stdsimd
crate it seems inevitable that we'll get an intrinsic wrong at some point. What do we do in a situation like that? This situation is somewhat analagous to the libc
crate but there you can fix the problem downstream (just have a corrected type/definition) for vendor intrinsics it's not so easy.
Currently it seems that our only recourse would be to add a 2
suffix to the function name or otherwise