Summary

Lay the ground work for building powerful SIMD functionality.

Motivation

SIMD (Single-Instruction Multiple-Data) is an important part of performant modern applications. Most CPUs used for that sort of task provide dedicated hardware and instructions

命令
for operating on multiple
複数の
values in a single
単一の
instruction,
命令
and exposing this is an important part of being a low-level language.
言語

This RFC lays the ground-work for building nice SIMD functionality, but doesn't fill everything out. The goal here is to provide the raw

生の
types and access to the raw
生の
instructions
命令
on each platform.

(An earlier variant of this RFC was discussed as a pre-RFC.)

Where does this code go? Aka. why not in std?

This RFC is focused on building stable, powerful SIMD functionality in external crates, not std.

This makes it much easier to support functionality only "occasionally" available with Rust's preexisting cfg system. There's no way for std to conditionally provide an API based

基となる、基底(の)
on the target features used for the final artifact. Building std in every configuration is certainly untenable. Hence, if it were to be in std, there would need to be some highly delayed cfg system to support that sort of conditional
条件付き、条件的
API exposure.

With an external crate, we can leverage cargo's existing build infrastructure: compiling with some target features will rebuild with those features enabled.

Detailed design
設計(する)

The design

設計(する)
comes in three parts, all on the path to stabilisation:

  • types (feature(repr_simd))
  • operations
    演算、操作
    (feature(platform_intrinsics))
  • platform detection (feature(cfg_target_feature))

The general

一般
idea is to avoid
避ける、回避する
bad performance cliffs, so that an intrinsic call
呼び出し
in Rust maps to preferably one CPU instruction,
命令
or, if not, the "optimal" sequence
連なり、並び
required to do the given
与えられた
operation
演算、操作
anyway. This means exposing a lot of platform specific
特定の
details, since platforms behave
振る舞う
very differently: both across architecture families (x86, x86-64, ARM, MIPS, ...), and even within a family (x86-64's Skylake, Haswell, Nehalem, ...).

There is definitely a common core of SIMD functionality shared across many platforms, but this RFC doesn't try to extract

抽出する
that, it is just building tools that can be wrapped into a more uniform API later.

Types

There is a new attribute: repr(simd).

#![allow(unused)] fn main() { #[repr(simd)] struct f32x4(f32, f32, f32, f32); #[repr(simd)] struct Simd2<T>(T, T); }

The simd repr can be attached to a struct

構造、構造体
and will cause
起こす
such a struct
構造、構造体
to be compiled to a SIMD vector. It can be generic, but it is required that any fully
完全に
monomorphised instance
実例
of the type consist
構成される
of only a single
単一の
"primitive" type, repeated some number of times.

The repr(simd) may not enforce that any trait bounds

制限する、結び付けられて
exists/does the right thing at the type checking level for generic repr(simd) types. As such, it will be possible to get the code-generator to error out (ala the old transmute size errors), however, this shouldn't cause
起こす
problems in practice: libraries wrapping this functionality would layer type-safety on top (i.e. generic repr(simd) types would use some unsafe trait as a bound
制限する、結び付けられて
that is designed
設計(する)
to only be implemented
実装する
by types that will work).

Adding

たす
repr(simd) to a type may increase its minimum/preferred alignment,
揃えること
based
基となる、基底(の)
on platform behaviour. (E.g. x86 wants its 128-bit SSE vectors to be 128-bit aligned.)

Operations
演算、操作

CPU vendors usually offer "standard" C headers for their CPU specific

特定の
operations,
演算、操作
such as arm_neon.h and the ...mmintrin.h headers for x86(-64).

All of these would be exposed as compiler intrinsics with names very similar

似ている、同様の
to those that the vendor suggests (only difference would be some form
形式、形態、形作る
of manual
マニュアル、手動
namespacing, e.g. prefixing
接頭辞
with the CPU target), loadable via an extern block with an appropriate ABI. This subset
部分集合
of intrinsics would be on the path to stabilisation (that is, one can "import" them with extern in stable code), and would not be exported by std.

Example:

#![allow(unused)] fn main() { extern "platform-intrinsic" { fn x86_mm_abs_epi16(a: Simd8<i16>) -> Simd8<i16>; // ... } }

These all use entirely concrete

具体的な/具象的な
types, and this is the core interface to these intrinsics: essentially it is just allowing
許可する、可能にする
code to exactly
正確に
specify
特定する、指定する、規定する
a CPU instruction
命令
to use. These intrinsics only actually work on a subset
部分集合
of the CPUs that Rust targets, and will result
結果、戻り値
in compile time errors if they are called
呼び出し
on platforms that do not support them. The signatures
シグネチャ
are typechecked, but in a "duck-typed" manner: it will just ensure
保証する
that the types are SIMD vectors with the appropriate length and element
要素
type, it will not enforce a specific
特定の
nominal type.

NB. The structural typing is just for the declaration:

宣言
if a SIMD intrinsic is declared
宣言
to take
とる
a type X, it must always be called
呼び出し
with X, even if other types are structurally equal
等しい
to X. Also, within a signature,
シグネチャ
SIMD types that must be structurally equal
等しい
must be nominally equal.
等しい
I.e. if the add_... all refer
参照する
to the same intrinsic to add a SIMD vector of bytes,

#![allow(unused)] fn main() { // (same length) struct A(u8, u8, ..., u8); struct B(u8, u8, ..., u8); extern "platform-intrinsic" { fn add_aaa(x: A, y: A) -> A; // ok fn add_bbb(x: B, y: B) -> B; // ok fn add_aab(x: A, y: A) -> B; // error, expected B, found A fn add_bab(x: B, y: A) -> B; // error, expected A, found B } fn double_a(x: A) -> A { add_aaa(x, x) } fn double_b(x: B) -> B { add_aaa(x, x) // error, expected A, found B } }

There would additionally be a small set

セットする、集合
of cross-platform operations
演算、操作
that are either generally efficiently supported everywhere or are extremely useful. These won't necessarily map to a single
単一の
instruction,
命令
but will be shimmed as efficiently as possible.

  • shuffles and extracting/inserting elements
    要素
  • comparisons
    比較
  • arithmetic
    算術
  • conversions
    変換

All of these intrinsics are imported via an extern directive similar

似ている、同様の
to the process for pre-existing intrinsics like transmute, however, the SIMD operations
演算、操作
are provided
与える
under a special ABI: platform-intrinsic. Use of this ABI (and hence the intrinsics) is initially feature-gated under the platform_intrinsics feature name. Why platform-intrinsic rather than say simd-intrinsic? There are non-SIMD platform-specific instructions
命令
that may be nice to expose (for example, Intel defines
定義する
an _addcarry_u32 intrinsic corresponding
対応する
to the ADC instruction).

Shuffles & element
要素
operations
演算、操作

One of the most powerful features of SIMD is the ability to rearrange data within vectors, giving super-linear speed-ups sometimes. As such, shuffles are exposed generally: intrinsics that represent

表現する
arbitrary
任意の
shuffles.

This may violate the "one instruction

命令
per instrinsic" principal depending on the shuffle, but rearranging SIMD vectors is extremely useful, and providing
与える
a direct intrinsic lets the compiler (a) do the programmers work in synthesising the optimal (short) sequence
連なり、並び
of instructions
命令
to get a given
与えられた
shuffle and (b) track data through shuffles without having to understand all the details of every platform specific
特定の
intrinsic for shuffling.

#![allow(unused)] fn main() { extern "platform-intrinsic" { fn simd_shuffle2<T, U>(v: T, w: T, idx: [i32; 2]) -> U; fn simd_shuffle4<T, U>(v: T, w: T, idx: [i32; 4]) -> U; fn simd_shuffle8<T, U>(v: T, w: T, idx: [i32; 8]) -> U; fn simd_shuffle16<T, U>(v: T, w: T, idx: [i32; 16]) -> U; // ... } }

The raw

生の
definitions
定義
are only checked for validity at monomorphisation time, ensure
保証する
that T and U are SIMD vector with the same element
要素
type, U has the appropriate length etc. Libraries can use traits to ensure
保証する
that these will be enforced by the type checker too.

This approach has similar

似ている、同様の
type "safety"/code-generation errors to the vectors themselves.

These operations

演算、操作
are semantically:
意味論的に

#![allow(unused)] fn main() { // vector of double length let z = concat(v, w); return [z[idx[0]], z[idx[1]], z[idx[2]], ...] }

The index array

配列
idx has to be compile time constants.
定数
Out of bounds
制限する、結び付けられて
indices yield
産出する、出力する
errors.

Similarly,

同様に
intrinsics for inserting/extracting elements
要素
into/out of vectors are provided,
与える
to allow
許可する、可能にする
modelling the SIMD vectors as actual
実際の
CPU registers as much as possible:

#![allow(unused)] fn main() { extern "platform-intrinsic" { fn simd_insert<T, Elem>(v: T, i0: u32, elem: Elem) -> T; fn simd_extract<T, Elem>(v: T, i0: u32) -> Elem; } }

The i0 indices do not have to be constant.

定数
These are equivalent
等価
to v[i0] = elem and v[i0] respectively.
それぞれ
They are type checked similarly
同様に
to the shuffles.

Comparisons
比較

Comparisons

比較
are implemented
実装する
via intrinsics. The raw
生の
signatures
シグネチャ
would look like:

#![allow(unused)] fn main() { extern "platform-intrinsic" { fn simd_eq<T, U>(v: T, w: T) -> U; fn simd_ne<T, U>(v: T, w: T) -> U; fn simd_lt<T, U>(v: T, w: T) -> U; fn simd_le<T, U>(v: T, w: T) -> U; fn simd_gt<T, U>(v: T, w: T) -> U; fn simd_ge<T, U>(v: T, w: T) -> U; } }

These are type checked during code-generation similarly

同様に
to the shuffles: ensuring that T and U have the same length, and that U is appropriately "boolean"-y. Libraries can use traits to ensure
保証する
that these will be enforced by the type checker too.

Arithmetic
算術

Intrinsics will be provided

与える
for arithmetic
算術
operations
演算、操作
like addition
追加
and multiplication.
乗算

#![allow(unused)] fn main() { extern "platform-intrinsic" { fn simd_add<T>(x: T, y: T) -> T; fn simd_mul<T>(x: T, y: T) -> T; // ... } }

These will have codegen time checks that the element

要素
type is correct:

  • add, sub, mul: any float or integer
    整数
    type
  • div: any float type
  • and, or, xor, shl (shift left), shr (shift right): any integer
    整数
    type

(The integer

整数
types are i8, ..., i64, u8, ..., u64 and the float types are f32 and f64.)

Why not inline asm?

One alternative

代わりのもの、選択肢
to providing
与える
intrinsics is to instead just use inline-asm to expose each CPU instruction.
命令
However, this approach has essentially only one benefit (avoiding defining
定義する
the intrinsics), but several downsides, e.g.

  • assembly is generally a black-box to optimisers, inhibiting optimisations, like algebraic simplification/transformation,
  • programmers would have to manually synthesise the right sequence
    連なり、並び
    of operations
    演算、操作
    to achieve a given
    与えられた
    shuffle, while having a generic shuffle intrinsic lets the compiler do it (NB. the intention is that the programmer will still have access to the platform specific
    特定の
    operations
    演算、操作
    for when the compiler synthesis isn't quite right),
  • inline assembly is not currently stable in Rust and there's not a strong push for it to be so in the immediate future (although this could change).

Benefits of manual

マニュアル、手動
assembly writing, like instruction
命令
scheduling and register allocation
割当
don't apply
適用する
to the (generally) one-instruction asm! blocks that replace the intrinsics (they need to be designed
設計(する)
so that the compiler has full control
制御する
over register allocation,
割当
or else the result
結果、戻り値
will be strictly worse). Those possible advantages of hand written assembly over intrinsics only come in to play when writing longer blocks of raw
生の
assembly, i.e. some inner
内側の
loop might be faster when written as a single
単一の
chunk of asm rather than as intrinsics.

Platform Detection

The availability of efficient

効率のよい
SIMD functionality is very fine-grained, and our current cfg(target_arch = "...") is not precise enough. This RFC proposes a target_feature cfg, that would be set
セットする、集合
to the features of the architecture that are known to be supported by the exact target e.g.

  • a default x86-64 compilation would essentially only set
    セットする、集合
    target_feature = "sse" and target_feature = "sse2"
  • compiling with -C target-feature="+sse4.2" would set
    セットする、集合
    target_feature = "sse4.2", target_feature = "sse.4.1", ..., target_feature = "sse".
  • compiling with -C target-cpu=native on a modern CPU might set
    セットする、集合
    target_feature = "avx2", target_feature = "avx", ...

The possible values of target_feature will be a selected whitelist, not necessarily just everything LLVM understands. There are other non-SIMD features that might have target_features set

セットする、集合
too, such as popcnt and rdrnd on x86/x86-64.)

With a cfg_if! macro that expands to the first cfg that is satisfied (ala @alexcrichton's cfg-if), code might look like:

#![allow(unused)] fn main() { cfg_if_else! { if #[cfg(target_feature = "avx")] { fn foo() { /* use AVX things */ } } else if #[cfg(target_feature = "sse4.1")] { fn foo() { /* use SSE4.1 things */ } } else if #[cfg(target_feature = "sse2")] { fn foo() { /* use SSE2 things */ } } else if #[cfg(target_feature = "neon")] { fn foo() { /* use NEON things */ } } else { fn foo() { /* universal fallback */ } } } }

Extensions

  • scatter/gather operations

    演算、操作
    allow
    許可する、可能にする
    (partially) operating on a SIMD vector of pointers. This would require allowing
    許可する、可能にする
    pointers(/references?) in repr(simd) types.

  • allow

    許可する、可能にする
    (and ignore
    無視する
    for everything but type checking) zero-sized types in repr(simd) structs,
    構造、構造体
    to allow
    許可する、可能にする
    tagging them with markers

  • the shuffle intrinsics could be made more relaxed in their type checking (i.e. not require that they return their second type parameter), to allow

    許可する、可能にする
    more type safety when combined
    合体する、組み合わせる
    with generic simd types:

    #[repr(simd)] struct Simd2<T>(T, T); extern "platform-intrinsic" { fn simd_shuffle2<T, U>(x: T, y: T, idx: [u32; 2]) -> Simd2<U>; }

    This should be a backwards-compatible generalisation.

Alternatives
代わりのもの、選択肢

  • Intrinsics could instead by namespaced by ABI, extern "x86-intrinsic", extern "arm-intrinsic".

  • There could be more syntactic support for shuffles, either with true syntax,

    文法
    or with a syntax
    文法
    extension. The latter might look like: shuffle![x, y, i0, i1, i2, i3, i4, ...]. However, this requires that shuffles are restricted
    制限する
    to a single
    単一の
    type only (i.e. Simd4<T> can be shuffled to Simd4<T> but nothing else), or some sort of type synthesis. The compiler has to somehow work out the return value:

    #![allow(unused)] fn main() { let x: Simd4<u32> = ...; let y: Simd4<u32> = ...; // reverse all the elements. let z = shuffle![x, y, 7, 6, 5, 4, 3, 2, 1, 0]; }

    Presumably z should be Simd8<u32>, but it's not obvious how the compiler can know this. The repr(simd) approach means there may be more than one SIMD-vector type with the Simd8<u32> shape (or, in fact, there may be zero).

  • With type-level integers,

    整数
    there could be one shuffle intrinsic:

    fn simd_shuffle<T, U, const N: usize>(x: T, y: T, idx: [u32; N]) -> U;

    NB. It is possible to add this as an additional

    追加の
    intrinsic (possibly deprecating the simd_shuffleNNN forms) later.

  • Type-level values can be applied

    適用する
    more generally: since the shuffle indices have to be compile time constants,
    定数
    the shuffle could be

    fn simd_shuffle<T, U, const N: usize, const IDX: [u32; N]>(x: T, y: T) -> U;
  • Instead of platform detection, there could be feature detection (e.g. "platform supports something equivalent

    等価
    to x86's DPPS"), but there probably aren't enough cross-platform commonalities for this to be worth it. (Each "feature" would essentially be a platform specific
    特定の
    cfg anyway.)

  • Check vector operators

    演算子
    in debug mode just like the scalar versions.

  • Make fixed length arrays

    配列
    repr(simd)-able (via just flattening), so that, say, #[repr(simd)] struct
    構造、構造体
    u32x4([u32; 4]);
    and #[repr(simd)] struct
    構造、構造体
    f64x8([f64; 4], [f64; 4]);
    etc works. This will be most useful if/when we allow
    許可する、可能にする
    generic-lengths, #[repr(simd)] struct
    構造、構造体
    Simd<T, n>([T; n]);

  • have 100% guaranteed

    保証する
    type-safety for generic #[repr(simd)] types and the generic intrinsics. This would probably require a relatively complicated set
    セットする、集合
    of traits (with compiler integration).

Unresolved questions

  • Should integer
    整数
    vectors get division automatically? Most CPUs don't support them for vectors.
  • How should out-of-bounds shuffle and insert/extract indices be handled?