Summary

In string literal contexts, restrict

制限する
\xXX escape sequences
連なり、並び
to just the range
範囲
of ASCII characters,
文字
\x00 -- \x7F. \xXX inputs in string literals with higher numbers are rejected (with an error message suggesting that one use an \uNNNN escape).

Motivation

In a string literal context,

文脈、背景
the current \xXX character
文字
escape sequence
連なり、並び
is potentially confusing when given
与えられた
inputs greater than 0x7F, because it does not encode
符号化する
that byte literally,
文字通りに
but instead encodes
符号化する
whatever the escape sequence
連なり、並び
\u00XX would produce.
産出する

Thus,

それゆえに、従って、
for inputs greater than 0x7F, \xXX will encode
符号化する
multiple
複数の
bytes into the generated string literal, as illustrated
描写する、示す
in the Rust example appendix.

This is different from what C/C++ programmers might expect (see Behavior

of xXX in C appendix).

(It would not be legal

(文法的に)適格
to encode
符号化する
the single
単一の
byte literally
文字通りに
into the string literal, since then the string would not be well-formed UTF-8.)

It has been suggested that the \xXX character

文字
escape should be removed entirely (at least from string literal contexts). This RFC is taking
とる
a slightly less aggressive stance: keep \xXX, but only for ASCII inputs when it occurs
起こる
in string literals. This way, people can continue using this escape format (which shorter than the \uNNNN format) when it makes sense.

Here are some links to discussions on this topic, including direct comments that suggest exactly

正確に
the strategy of this RFC.

  • https://github.com/rust-lang/rfcs/issues/312
  • https://github.com/rust-lang/rust/issues/12769
  • https://github.com/rust-lang/rust/issues/2800#issuecomment-31477259
  • https://github.com/rust-lang/rfcs/pull/69#issuecomment-43002505
  • https://github.com/rust-lang/rust/issues/12769#issuecomment-43574856
  • https://github.com/rust-lang/meeting-minutes/blob/master/weekly-meetings/2014-01-21.md#xnn-escapes-in-strings
  • https://mail.mozilla.org/pipermail/rust-dev/2012-July/002025.html

Note in particular the meeting minutes bullet, where the team explicitly

明示的に
decided to keep things "as they are".

However, at the time of that meeting, Rust did not have byte string literals; people were converting

変換する
string-literals into byte arrays
配列
via the bytes! macro. (Likewise, the rust-dev post is also from a time, summer 2012, when we did not have byte-string literals.)

We are in a different world now. The fact that now \xXX denotes

指す、示す、表す
a code unit in a byte-string literal, but in a string literal denotes
指す、示す、表す
a codepoint, does not seem elegant; it rather seems like a source
元の
of confusion. (Caveat: While Felix does believe this assertion,
アサーション
this context-dependent interpretation
解釈
of \xXX does have precedent in both Python and Racket; see Racket example and Python example appendices.)

By restricting

制限する
\xXX to the range
範囲
0x00--0x7F, we side-step the question of "is it a code unit or a code point?" entirely (which was the real context
文脈、背景
of both the rust-dev thread and the meeting minutes bullet). This RFC is a far more conservative choice that we can safely make for the short term
項、用語
(i.e. for the 1.0 release) than it would have been to switch to a "\xXX is a code unit" interpretation.
解釈

The expected outcome is reduced confusion for C/C++ programmers (which is, after all, our primary

主要な、初等の、第一の
target audience for conversion), and any other language
言語
where \xXX never results
結果、戻り値
in more than one byte. The error message will point them to the syntax
文法
they need to adopt.

Detailed design
設計(する)

In string literal contexts, \xXX inputs with XX > 0x7F are rejected (with an error message that mentions either, or both, of \uNNNN escapes and the byte-string literal format b"..").

The full byte range

範囲
remains supported when \xXX is used in byte-string literals, b"..."

Raw

生の
strings by design
設計(する)
do not offer escape sequences,
連なり、並び
so they are unchanged.

Character

文字
and string escaping routines (such as core::char::escape_unicode, and such as used by the "{:?}" formatter) are updated so that string inputs that previously would previously have printed \xXX with XX > 0x7F are updated to use \uNNNN escapes instead.

Drawbacks

Some reasons not to do this:

  • we think that the current behavior

    ふるまい
    is intuitive,

  • it is consistent with language

    言語
    X (and thus
    それゆえに、従って、
    has precedent),

  • existing libraries are relying on this behavior,

    ふるまい
    or

  • we want to optimize

    最適化する
    for inputting characters
    文字
    with codepoints in the range
    範囲
    above 0x7F in string-literals, rather than optimizing for ASCII.

The thesis of this RFC is that the first bullet is a falsehood.

While there is some precedent for the "\xXX is code point" interpretation

解釈
in some languages,
言語
the majority do seem to favor the "\xXX is code unit" point of view. The proposal of this RFC is side-stepping the distinction by limiting the input range
範囲
for \xXX.

The third bullet is a strawman since we have not yet released 1.0, and thus

それゆえに、従って、
everything is up for change.

This RFC makes no comment on the validity of the fourth bullet.

Alternatives

  • We could remove \xXX entirely from string literals. This would require people to use the \uNNNN escape format even for bytes in the range

    範囲
    00--0x7F, which seems annoying.

  • We could switch \xXX from meaning code point to meaning code unit in both string literal and byte-string literal contexts. This was previously considered

    みなす、考慮する
    and explicitly
    明示的に
    rejected in an earlier meeting, as discussed in the Motivation section.

Unresolved questions

None.

Appendices

Behavior
ふるまい
of xXX in C

Here is a C program illustrating

描写する、示す
how xXX escape sequences
連なり、並び
are treated
取り扱う
in string literals in that context:
文脈、背景

#include <stdio.h> int main() { char *s; s = "a"; printf("s[0]: %d\n", s[0]); printf("s[1]: %d\n", s[1]); s = "\x61"; printf("s[0]: %d\n", s[0]); printf("s[1]: %d\n", s[1]); s = "\x7F"; printf("s[0]: %d\n", s[0]); printf("s[1]: %d\n", s[1]); s = "\x80"; printf("s[0]: %d\n", s[0]); printf("s[1]: %d\n", s[1]); return 0; }

Its output is the following:

% gcc example.c && ./a.out s[0]: 97 s[1]: 0 s[0]: 97 s[1]: 0 s[0]: 127 s[1]: 0 s[0]: -128 s[1]: 0

Rust example

Here is a Rust program that explores the various

さまざまな
ways \xXX sequences
連なり、並び
are treated
取り扱う
in both string literal and byte-string literal contexts.

#![feature(macro_rules)] fn main() { macro_rules! print_str { ($r:expr, $e:expr) => { { println!("{:>20}: \"{}\"", format!("\"{}\"", $r), $e.escape_default()) } } } macro_rules! print_bstr { ($r:expr, $e:expr) => { { println!("{:>20}: {}", format!("b\"{}\"", $r), $e) } } } macro_rules! print_bytes { ($r:expr, $e:expr) => { println!("{:>9}.as_bytes(): {}", format!("\"{}\"", $r), $e.as_bytes()) } } // println!("{}", b"\u0000"); // invalid: \uNNNN is not a byte escape. print_str!(r"\0", "\0"); print_bstr!(r"\0", b"\0"); print_bstr!(r"\x00", b"\x00"); print_bytes!(r"\x00", "\x00"); print_bytes!(r"\u0000", "\u0000"); println!(""); print_str!(r"\x61", "\x61"); print_bstr!(r"a", b"a"); print_bstr!(r"\x61", b"\x61"); print_bytes!(r"\x61", "\x61"); print_bytes!(r"\u0061", "\u0061"); println!(""); print_str!(r"\x7F", "\x7F"); print_bstr!(r"\x7F", b"\x7F"); print_bytes!(r"\x7F", "\x7F"); print_bytes!(r"\u007F", "\u007F"); println!(""); print_str!(r"\x80", "\x80"); print_bstr!(r"\x80", b"\x80"); print_bytes!(r"\x80", "\x80"); print_bytes!(r"\u0080", "\u0080"); println!(""); print_str!(r"\xFF", "\xFF"); print_bstr!(r"\xFF", b"\xFF"); print_bytes!(r"\xFF", "\xFF"); print_bytes!(r"\u00FF", "\u00FF"); println!(""); print_str!(r"\u0100", "\u0100"); print_bstr!(r"\x01\x00", b"\x01\x00"); print_bytes!(r"\u0100", "\u0100"); }

In current Rust, it generates

生成する
output as follows:
下記の、次に続く、追従する

% rustc --version && echo && rustc example.rs && ./example rustc 0.12.0-pre (d52d0c836 2014-09-07 03:36:27 +0000) "\0": "\x00" b"\0": [0] b"\x00": [0] "\x00".as_bytes(): [0] "\u0000".as_bytes(): [0] "\x61": "a" b"a": [97] b"\x61": [97] "\x61".as_bytes(): [97] "\u0061".as_bytes(): [97] "\x7F": "\x7f" b"\x7F": [127] "\x7F".as_bytes(): [127] "\u007F".as_bytes(): [127] "\x80": "\x80" b"\x80": [128] "\x80".as_bytes(): [194, 128] "\u0080".as_bytes(): [194, 128] "\xFF": "\xff" b"\xFF": [255] "\xFF".as_bytes(): [195, 191] "\u00FF".as_bytes(): [195, 191] "\u0100": "\u0100" b"\x01\x00": [1, 0] "\u0100".as_bytes(): [196, 128] %

Note that the behavior

ふるまい
of \xXX on byte-string literals matches
一致する、マッチさせる
the expectations established by the C program in Behavior
ふるまい
of xXX in C
; that is good. The problem is the behavior
ふるまい
of \xXX for XX > 0x7F in string-literal contexts, namely in the fourth and fifth examples where the .as_bytes() invocations
呼び出し
are showing that the underlying byte array
配列
has two elements
要素
instead of one.

Racket example

% racket Welcome to Racket v5.93. > (define a-string "\xbb\n") > (display a-string) » > (bytes-length (string->bytes/utf-8 a-string)) 3 > (define a-byte-string #"\xc2\xbb\n") > (bytes-length a-byte-string) 3 > (display a-byte-string) » > (exit) %

The above code illustrates

描写する、示す
that in Racket, the \xXX escape sequence
連なり、並び
denotes
指す、示す、表す
a code unit in byte-string context
文脈、背景
(#".." in that language), while it denotes
指す、示す、表す
a code point in string context
文脈、背景
("..").

Python example

% python Python 2.7.5 (default, Mar 9 2014, 22:15:05) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> a_string = u"\xbb\n"; >>> print a_string » >>> len(a_string.encode("utf-8")) 3 >>> a_byte_string = "\xc2\xbb\n"; >>> len(a_byte_string) 3 >>> print a_byte_string » >>> exit() %

The above code illustrates

描写する、示す
that in Python, the \xXX escape sequence
連なり、並び
denotes
指す、示す、表す
a code unit in byte-string context
文脈、背景
(".." in that language), while it denotes
指す、示す、表す
a code point in unicode string context
文脈、背景
(u"..").