- Start Date: 2014-09-26
- RFC PR: 326
- Rust Issue: rust-lang/rust#18062
Summary
In string literal contexts, restrict\xXX
escape sequences\x00
-- \x7F
. \xXX
inputs in string literals with higher numbers are rejected (with an error message suggesting that one use an \uNNNN
escape).
Motivation
In a string literal context,\xXX
character0x7F
, because it does not encode\u00XX
would produce.
Thus,0x7F
, \xXX
will encode
This is different from what C/C++ programmers might expect (see Behavior
(It would not be legal
It has been suggested that the \xXX
character\xXX
, but only for ASCII inputs when it occurs\uNNNN
format) when it makes sense.
Here are some links to discussions on this topic, including direct comments that suggest exactly
- https://github.com/rust-lang/rfcs/issues/312
- https://github.com/rust-lang/rust/issues/12769
- https://github.com/rust-lang/rust/issues/2800#issuecomment-31477259
- https://github.com/rust-lang/rfcs/pull/69#issuecomment-43002505
- https://github.com/rust-lang/rust/issues/12769#issuecomment-43574856
- https://github.com/rust-lang/meeting-minutes/blob/master/weekly-meetings/2014-01-21.md#xnn-escapes-in-strings
- https://mail.mozilla.org/pipermail/rust-dev/2012-July/002025.html
Note in particular the meeting minutes bullet, where the team explicitly
However, at the time of that meeting, Rust did not have byte string literals; people were convertingbytes!
macro. (Likewise, the rust-dev post is also from a time, summer 2012, when we did not have byte-string literals.)
We are in a different world now. The fact that now \xXX
denotes\xXX
does have precedent in both Python and Racket; see Racket example and Python example appendices.)
By restricting\xXX
to the range0x00
--0x7F
, we side-step the question of "is it a code unit or a code point?" entirely (which was the real context\xXX
is a code unit" interpretation.
The expected outcome is reduced confusion for C/C++ programmers (which is, after all, our primary\xXX
never results
Detailed design設計(する)
In string literal contexts, \xXX
inputs with XX > 0x7F
are rejected (with an error message that mentions either, or both, of \uNNNN
escapes and the byte-string literal format b".."
).
The full byte range\xXX
is used in byte-string literals, b"..."
Raw
Charactercore::char::escape_unicode
, and such as used by the "{:?}"
formatter) are updated so that string inputs that previously would previously have printed \xXX
with XX > 0x7F
are updated to use \uNNNN
escapes instead.
Drawbacks
Some reasons not to do this:
-
we think that the current behavior
ふるまいis intuitive, -
it is consistent with language
言語X (and thusそれゆえに、従って、has precedent), -
existing libraries are relying on this behavior,
ふるまいor -
we want to optimize
最適化するfor inputting characters文字with codepoints in the range範囲above0x7F
in string-literals, rather than optimizing for ASCII.
The thesis of this RFC is that the first bullet is a falsehood.
While there is some precedent for the "\xXX
is code point" interpretation\xXX
is code unit" point of view. The proposal of this RFC is side-stepping the distinction by limiting the input range\xXX
.
The third bullet is a strawman since we have not yet released 1.0, and thus
This RFC makes no comment on the validity of the fourth bullet.
Alternatives
-
We could remove
\xXX
entirely from string literals. This would require people to use the\uNNNN
escape format even for bytes in the range範囲00
--0x7F
, which seems annoying. -
We could switch
\xXX
from meaning code point to meaning code unit in both string literal and byte-string literal contexts. This was previously consideredみなす、考慮するand explicitly明示的にrejected in an earlier meeting, as discussed in the Motivation section.節
Unresolved questions
None.
Appendices
Behaviorふるまい of xXX in C
Here is a C program illustratingxXX
escape sequences
#include <stdio.h>
int main() {
char *s;
s = "a";
printf("s[0]: %d\n", s[0]);
printf("s[1]: %d\n", s[1]);
s = "\x61";
printf("s[0]: %d\n", s[0]);
printf("s[1]: %d\n", s[1]);
s = "\x7F";
printf("s[0]: %d\n", s[0]);
printf("s[1]: %d\n", s[1]);
s = "\x80";
printf("s[0]: %d\n", s[0]);
printf("s[1]: %d\n", s[1]);
return 0;
}
Its output is the following:
% gcc example.c && ./a.out
s[0]: 97
s[1]: 0
s[0]: 97
s[1]: 0
s[0]: 127
s[1]: 0
s[0]: -128
s[1]: 0
Rust example
Here is a Rust program that explores the various\xXX
sequences
#![feature(macro_rules)]
fn main() {
macro_rules! print_str {
($r:expr, $e:expr) => { {
println!("{:>20}: \"{}\"",
format!("\"{}\"", $r),
$e.escape_default())
} }
}
macro_rules! print_bstr {
($r:expr, $e:expr) => { {
println!("{:>20}: {}",
format!("b\"{}\"", $r),
$e)
} }
}
macro_rules! print_bytes {
($r:expr, $e:expr) => {
println!("{:>9}.as_bytes(): {}", format!("\"{}\"", $r), $e.as_bytes())
} }
// println!("{}", b"\u0000"); // invalid: \uNNNN is not a byte escape.
print_str!(r"\0", "\0");
print_bstr!(r"\0", b"\0");
print_bstr!(r"\x00", b"\x00");
print_bytes!(r"\x00", "\x00");
print_bytes!(r"\u0000", "\u0000");
println!("");
print_str!(r"\x61", "\x61");
print_bstr!(r"a", b"a");
print_bstr!(r"\x61", b"\x61");
print_bytes!(r"\x61", "\x61");
print_bytes!(r"\u0061", "\u0061");
println!("");
print_str!(r"\x7F", "\x7F");
print_bstr!(r"\x7F", b"\x7F");
print_bytes!(r"\x7F", "\x7F");
print_bytes!(r"\u007F", "\u007F");
println!("");
print_str!(r"\x80", "\x80");
print_bstr!(r"\x80", b"\x80");
print_bytes!(r"\x80", "\x80");
print_bytes!(r"\u0080", "\u0080");
println!("");
print_str!(r"\xFF", "\xFF");
print_bstr!(r"\xFF", b"\xFF");
print_bytes!(r"\xFF", "\xFF");
print_bytes!(r"\u00FF", "\u00FF");
println!("");
print_str!(r"\u0100", "\u0100");
print_bstr!(r"\x01\x00", b"\x01\x00");
print_bytes!(r"\u0100", "\u0100");
}
In current Rust, it generates
% rustc --version && echo && rustc example.rs && ./example
rustc 0.12.0-pre (d52d0c836 2014-09-07 03:36:27 +0000)
"\0": "\x00"
b"\0": [0]
b"\x00": [0]
"\x00".as_bytes(): [0]
"\u0000".as_bytes(): [0]
"\x61": "a"
b"a": [97]
b"\x61": [97]
"\x61".as_bytes(): [97]
"\u0061".as_bytes(): [97]
"\x7F": "\x7f"
b"\x7F": [127]
"\x7F".as_bytes(): [127]
"\u007F".as_bytes(): [127]
"\x80": "\x80"
b"\x80": [128]
"\x80".as_bytes(): [194, 128]
"\u0080".as_bytes(): [194, 128]
"\xFF": "\xff"
b"\xFF": [255]
"\xFF".as_bytes(): [195, 191]
"\u00FF".as_bytes(): [195, 191]
"\u0100": "\u0100"
b"\x01\x00": [1, 0]
"\u0100".as_bytes(): [196, 128]
%
Note that the behavior\xXX
on byte-string literals matches\xXX
for XX > 0x7F
in string-literal contexts, namely in the fourth and fifth examples where the .as_bytes()
invocations
Racket example
% racket
Welcome to Racket v5.93.
> (define a-string "\xbb\n")
> (display a-string)
»
> (bytes-length (string->bytes/utf-8 a-string))
3
> (define a-byte-string #"\xc2\xbb\n")
> (bytes-length a-byte-string)
3
> (display a-byte-string)
»
> (exit)
%
The above code illustrates\xXX
escape sequence#".."
in that language), while it denotes".."
).
Python example
% python
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a_string = u"\xbb\n";
>>> print a_string
»
>>> len(a_string.encode("utf-8"))
3
>>> a_byte_string = "\xc2\xbb\n";
>>> len(a_byte_string)
3
>>> print a_byte_string
»
>>> exit()
%
The above code illustrates\xXX
escape sequence".."
in that language), while it denotesu".."
).