Summary

Generalize the WTF-8 encoding to allow

許可する、可能にする
OsStr to use the pattern API methods.

Motivation

OsStr is missing many common string methods compared

比較する
to the standard str or even [u8]. There have been numerous attempts to expand the API surface, the latest one being RFC #1309, which leads to an attempt to revamp the std::pattern::Pattern API, but eventually closed due to inactivity and lack of resource.

Over the past several years, there has been numerous requests and attempts to implement

実装する
these missing functions in particular OsStr::starts_with (1, 2, 3, 4, 5, 6).

The main difficulty applying

適用する
str APIs to OsStr is WTF-8. A surrogate pair (e.g. U+10000 = d800 dc00) is encoded as a 4-byte sequence
連なり、並び
(f0 90 80 80) similar
似ている、同様の
to UTF-8, but an unpaired surrogate (e.g. U+D800 alone) is encoded as a completely distinct
区別された/独立した
3-byte sequence
連なり、並び
(ed a0 80). Naively extending
拡張する
the slice-based pattern API will not work, e.g. you cannot find any ed a0 80 inside f0 90 80 80, so .starts_with() is going to be more complex,
複素数、複文の
and .split() certainly cannot borrow a well-formed WTF-8 slice from it.

The solution proposed by RFC #1309 is to create two sets

セットする、集合
of APIs. One, .contains_os(), .starts_with_os(), .ends_with_os() and .replace() which do not require borrowing, will support using &OsStr as input. The rest like .split(), .matches() and .trim() which require borrowing, will only accept
受け付ける、受理する
UTF-8 strings as input.

The “pattern 2.0” API does not split into two sets

セットする、集合
of APIs, but will panic when the search string starts with or ends with an unpaired surrogate.

We feel that these designs

設計(する)
are not elegant enough. This RFC attempts to fix the problem by going one level lower,
下方の、小文字の
by generalizing
一般
WTF-8 so that splitting a surrogate pair is allowed,
許可する、可能にする
so we could search an OsStr with an OsStr using a single
単一の
Pattern API without panicking.

Guide-level explanation

The following new methods are now available to OsStr. They behave

振る舞う
the same as their counterpart in str.

#![allow(unused)] fn main() { impl OsStr { pub fn contains<'a, P>(&'a self, pat: P) -> bool where P: Pattern<&'a Self>; pub fn starts_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<&'a Self>; pub fn ends_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; pub fn find<'a, P>(&'a self, pat: P) -> Option<usize> where P: Pattern<&'a Self>; pub fn rfind<'a, P>(&'a self, pat: P) -> Option<usize> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; /// Finds the first range of this string which contains the pattern. /// /// # Examples /// /// ```rust /// let path = OsStr::new("/usr/bin/bash"); /// let range = path.find_range("/b"); /// assert_eq!(range, Some(4..6)); /// assert_eq!(path[range.unwrap()], OsStr::new("/b")); /// ``` pub fn find_range<'a, P>(&'a self, pat: P) -> Option<Range<usize>> where P: Pattern<&'a Self>; /// Finds the last range of this string which contains the pattern. /// /// # Examples /// /// ```rust /// let path = OsStr::new("/usr/bin/bash"); /// let range = path.rfind_range("/b"); /// assert_eq!(range, Some(8..10)); /// assert_eq!(path[range.unwrap()], OsStr::new("/b")); /// ``` pub fn rfind_range<'a, P>(&'a self, pat: P) -> Option<Range<usize>> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; // (Note: these should return a concrete iterator type instead of `impl Trait`. // For ease of explanation the concrete type is not listed here.) pub fn split<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>; pub fn rsplit<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; pub fn split_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>; pub fn rsplit_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; pub fn splitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>; pub fn rsplitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; pub fn matches<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>; pub fn rmatches<'a, P>(&self, pat: P) -> impl Iterator<Item = &'a Self> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; pub fn match_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)> where P: Pattern<&'a Self>; pub fn rmatch_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; // this is new pub fn match_ranges<'a, P>(&'a self, pat: P) -> impl Iterator<Item = (Range<usize>, &'a Self)> where P: Pattern<&'a Self>; // this is new pub fn rmatch_ranges<'a, P>(&'a self, pat: P) -> impl Iterator<Item = (Range<usize>, &'a Self)> where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; pub fn trim_matches<'a, P>(&'a self, pat: P) -> &'a Self where P: Pattern<&'a Self>, P::Searcher: DoubleEndedSearcher<&'a Self>; pub fn trim_left_matches<'a, P>(&'a self, pat: P) -> &'a Self where P: Pattern<&'a Self>; pub fn trim_right_matches<'a, P>(&'a self, pat: P) -> &'a Self where P: Pattern<&'a Self>, P::Searcher: ReverseSearcher<&'a Self>; pub fn replace<'a, P>(&'a self, from: P, to: &'a Self) -> Self::Owned where P: Pattern<&'a Self>; pub fn replacen<'a, P>(&'a self, from: P, to: &'a Self, count: usize) -> Self::Owned where P: Pattern<&'a Self>; } }

We also allow

許可する、可能にする
slicing an OsStr.

#![allow(unused)] fn main() { impl Index<RangeFull> for OsStr { ... } impl Index<RangeFrom<usize>> for OsStr { ... } impl Index<RangeTo<usize>> for OsStr { ... } impl Index<Range<usize>> for OsStr { ... } }

Example:

#![allow(unused)] fn main() { // (assume we are on Windows) let path = OsStr::new(r"C:\Users\Admin\😀\😁😂😃😄.txt"); // can use starts_with, ends_with assert!(path.starts_with(OsStr::new(r"C:\"))); assert!(path.ends_with(OsStr::new(".txt")); // can use rfind_range to get the range of substring let last_backslash = path.rfind_range(OsStr::new(r"\")).unwrap(); assert_eq!(last_backslash, 16..17); // can perform slicing. let file_name = &path[last_backslash.end..]; // can perform splitting, even if it results in invalid Unicode! let mut parts = file_name.split(&*OsString::from_wide(&[0xd83d])); assert_eq!(parts.next(), Some(OsStr::new(""))); assert_eq!(parts.next(), Some(&*OsString::from_wide(&[0xde01]))); assert_eq!(parts.next(), Some(&*OsString::from_wide(&[0xde02]))); assert_eq!(parts.next(), Some(&*OsString::from_wide(&[0xde03]))); assert_eq!(parts.next(), Some(&*OsString::from_wide(&[0xde04, 0x2e, 0x74, 0x78, 0x74]))); assert_eq!(parts.next(), None); }

Reference-level explanation

It is trivial to apply

適用する
the pattern API to OsStr on platforms where it is just an [u8]. The main difficulty is on Windows where it is an [u16] encoded as WTF-8. This RFC thus
それゆえに、従って、
focuses on Windows.

We will generalize the encoding of OsStr to OMG-WTF-8 which specifies

特定する、指定する、規定する
these two capabilities:

  1. Slicing a surrogate pair in half:

    #![allow(unused)] fn main() { let s = OsStr::new("\u{10000}"); assert_eq!(&s[..2], &*OsString::from_wide(&[0xd800])); assert_eq!(&s[2..], &*OsString::from_wide(&[0xdc00])); }
  2. Finding a surrogate code point, no matter paired or unpaired:

    #![allow(unused)] fn main() { let needle = OsString::from_wide(&[0xdc00]); assert_eq!(OsStr::new("\u{10000}").find(&needle), Some(2)); assert_eq!(OsString::from_wide(&[0x3f, 0xdc00]).find(&needle), Some(1)); }

These allow

許可する、可能にする
us to implement
実装する
the “Pattern 1.5” API for all OsStr without panicking. Implementation
実装
detail can be found in the omgwtf8 package.

Slicing

A surrogate pair is a 4-byte sequence

連なり、並び
in both UTF-8 and WTF-8. We support slicing it in half by representing
表現する
the high surrogate by the first 3 bytes, and the low surrogate by the last 3 bytes.

"\u{10000}" = f0 90 80 80 "\u{10000}"[..2] = f0 90 80 "\u{10000}"[2..] = 90 80 80

The index splitting the surrogate pair will be positioned at the middle of the 4-byte sequence

連なり、並び
(index "2" in the above example).

Note that this means:

  1. x[..i] and x[i..] will have overlapping
    重なる/重なり
    parts. This makes OsStr::split_at_mut (if exists) unable to split a surrogate pair in half. This also means Pattern<&mut OsStr> cannot be implemented
    実装する
    for &OsStr.
  2. The length of x[..n] may be longer than n.

Platform-agnostic guarantees
保証する

If an index points to an invalid position (e.g. \u{1000}[1..] or "\u{10000}"[1..] or "\u{10000}"[3..]), a panic will be raised, similar

似ている、同様の
to that of str. The following are guaranteed
保証する
to be valid
有効な、正しい
positions on all platforms:

  • 0.
  • self.len().
  • The returned indices from find(), rfind(), match_indices() and rmatch_indices().
  • The returned ranges
    範囲
    from find_range(), rfind_range(), match_ranges() and rmatch_ranges().

Index arithmetic

算術
is wrong for OsStr, i.e. i + n may not produce
産出する
the correct index (see Drawbacks).

For WTF-8 encoding on Windows, we define:

定義する

  • boundary of a character
    文字
    or surrogate byte sequence
    連なり、並び
    is Valid.
    有効な、正しい
  • middle (byte 2) of a 4-byte sequence
    連なり、並び
    is Valid.
    有効な、正しい
  • interior of a 2- or 3-byte sequence
    連なり、並び
    is Invalid.
  • byte 1 or 3 of a 4-byte sequence
    連なり、並び
    is Invalid.

Outside of Windows where the OsStr consists

構成される
of arbitrary
任意の
bytes, all numbers within 0 ..= self.len() are considered
みなす、考慮する
a valid
有効な、正しい
index. This is because we want to allow
許可する、可能にする
os_str.find(OsStr::from_bytes(b"\xff")), and thus
それゆえに、従って、
cannot use UTF-8 to reason with a Unix OsStr.

Note that we have never guaranteed

保証する
the actual
実際の
OsStr encoding, these should only be considered
みなす、考慮する
an implementation
実装
detail.

Comparison
比較
and storage
保存場所、保管スペース

All OsStr strings with sliced

スライスされた
4-byte sequence
連なり、並び
can be converted
変換する
back to proper WTF-8 with an O(1) transformation:

  • If the string starts with [\x80-\xbf]{3}, replace these 3 bytes with the canonical low surrogate encoding.
  • If the string ends with [\xf0-\xf4][\x80-\xbf]{2}, replace these 3 bytes with the canonical high surrogate encoding.

We can this transformation canonicalization”.

All owned OsStr should be canonicalized

正規化する
to contain
含む
well-formed WTF-8 only: Box<OsStr>, Rc<OsStr>, Arc<OsStr> and OsString.

Two OsStr are compared

比較する
equal
等しい
if they have the same canonicalization. This may slightly reduce the performance with a constant
定数
overhead, since there would be more checking involving the first and last three bytes.

Matching
一致する、マッチさせる

When an OsStr is used for matching,

一致する、マッチさせる
an unpaired low surrogate at the beginning and unpaired high surrogate at the end must be replaced by regular
普通の、正規の
expressions
that match
一致する、マッチさせる
all pre-canonicalization possibilities. For instance,
実例
matching
一致する、マッチさせる
for xxxx\u{d9ab} would create the following regex:

xxxx( \xed\xa6\xab # canonical representation | \xf2\x86[\xb0-\xbf] # split representation )

and matching

一致する、マッチさせる
for \u{dcef}xxxx with create the following regex:

( \xed\xb3\xaf # canonical representation | [\x80-\xbf][\x83\x93\xa3\xb3]\xaf # split representation )xxxx

After finding a match,

一致する、マッチさせる
if the end points to the middle of a 4-byte sequence,
連なり、並び
the search engine should move backward by 2 bytes before continuing. This ensure
保証する
searching for \u{dc00}\u{d800} in \u{10000}\u{10000}\u{10000} will properly yield
産出する、出力する
2 matches.
一致する、マッチさせる

Pattern API

As of Rust 1.25, we can search a &str using a character,

文字
a character
文字
set
セットする、集合
or another string, powered
累乗
by RFC #528 a.k.a. “Pattern API 1.0”.

There are some drafts to generalize this so that we could retain mutability and search in more types such as &[T] and &OsStr, as described

記述する
in various
さまざまな
comments (“v1.5 and v2.0”). A proper RFC has not been proposed so far.

This RFC assumes the target of generalizing

一般
the Pattern API beyond &str is accepted,
受け付ける、受理する
enabling us to provide a uniform search API between different types of haystack and needles. However, this RFC does not rely on a generalized Pattern API. If this RFC is stabilized without a generalized Pattern API, the new methods described
記述する
in the Guide-level explanation section
can take
とる
&OsStr instead of impl Pattern<&OsStr>, but this may hurt future compatibility
互換性
due to inference breakage if generalized Pattern API is indeed implemented.
実装する

Assuming we do want to generalize Pattern API, the implementor should note the issue of splitting a surrogate pair:

  1. A match
    一致する、マッチさせる
    which starts with a low surrogate will point to byte 1 of the 4-byte sequence
    連なり、並び
  2. An index always point to byte 2 of the 4-byte sequence
    連なり、並び
  3. A match
    一致する、マッチさせる
    which ends with a high surrogate will point to byte 3 of the 4-byte sequence
    連なり、並び

Implementation

実装
should note these different offsets when converting
変換する
between different kinds of cursors. In the omgwtf8::pattern module, based
基となる、基底(の)
on the “v1.5” draft, this behavior
ふるまい
is enforced in the API design
設計(する)
by using distinct
区別された/独立した
types for the start and end cursors.

The following outlines the generalized Pattern API which could work for &OsStr:

#![allow(unused)] fn main() { // in module `core::pattern`: pub trait Pattern<H: Haystack>: Sized { type Searcher: Searcher<H>; fn into_searcher(self, haystack: H) -> Self::Searcher; fn is_contained_in(self, haystack: H) -> bool; fn is_prefix_of(self, haystack: H) -> bool; fn is_suffix_of(self, haystack: H) -> bool where Self::Searcher: ReverseSearcher<H>; } pub trait Searcher<H: Haystack> { fn haystack(&self) -> H; fn next_match(&mut self) -> Option<(H::StartCursor, H::EndCursor)>; fn next_reject(&mut self) -> Option<(H::StartCursor, H::EndCursor)>; } pub trait ReverseSearcher<H: Haystack>: Searcher<H> { fn next_match_back(&mut self) -> Option<(H::StartCursor, H::EndCursor)>; fn next_reject_back(&mut self) -> Option<(H::StartCursor, H::EndCursor)>; } pub trait DoubleEndedSearcher<H: Haystack>: ReverseSearcher<H> {} // equivalent to SearchPtrs in "Pattern API 1.5" // and PatternHaystack in "Pattern API 2.0" pub trait Haystack: Sized { type StartCursor: Copy + PartialOrd<Self::EndCursor>; type EndCursor: Copy + PartialOrd<Self::StartCursor>; // The following 5 methods are same as those in "Pattern API 1.5" // except the cursor type is split into two. fn cursor_at_front(hs: &Self) -> Self::StartCursor; fn cursor_at_back(hs: &Self) -> Self::EndCursor; unsafe fn start_cursor_to_offset(hs: &Self, cur: Self::StartCursor) -> usize; unsafe fn end_cursor_to_offset(hs: &Self, cur: Self::EndCursor) -> usize; unsafe fn range_to_self(hs: Self, start: Self::StartCursor, end: Self::EndCursor) -> Self; // And then we want to swap between the two cursor types unsafe fn start_to_end_cursor(hs: &Self, cur: Self::StartCursor) -> Self::EndCursor; unsafe fn end_to_start_cursor(hs: &Self, cur: Self::EndCursor) -> Self::StartCursor; } }

For the &OsStr haystack, we define

定義する
both StartCursor and EndCursor as *const u8.

The start_to_end_cursor function will return cur + 2 if we find that cur points to the middle of a 4-byte sequence.

連なり、並び

The start_cursor_to_offset function will return cur - hs + 1 if we find that cur points to the middle of a 4-byte sequenced.

These type safety measures ensure

保証する
functions utilizing a generic Pattern can get the correctly overlapping
重なる/重なり
slices when splitting a surrogate pair.

#![allow(unused)] fn main() { // (actual code implementing `.split()`) match self.matcher.next_match() { Some((a, b)) => unsafe { let haystack = self.matcher.haystack(); let a = H::start_to_end_cursor(&haystack, a); let b = H::end_to_start_cursor(&haystack, b); let elt = H::range_to_self(haystack, self.start, a); // ^ without `start_to_end_cursor`, the slice `elt` may be short by 2 bytes self.start = b; // ^ without `end_to_start_cursor`, the next starting position may skip 2 bytes Some(elt) }, None => self.get_end(), } }

Drawbacks

  • It breaks the invariant x[..n].len() == n.

    Note that OsStr did not provide a slicing operator,

    演算子
    and it already violated
    違反する
    the invariant (x + y).len() == x.len() + y.len().

  • A surrogate code point may be 2 or 3 indices long depending on context.

    文脈、背景

    This means code using x[i..(i+n)] may give wrong result.

    結果、戻り値

    #![allow(unused)] fn main() { let needle = OsString::from_wide(&[0xdc00]); let haystack = OsStr::new("\u{10000}a"); let index = haystack.find(&needle).unwrap(); let matched = &haystack[index..(index + needle.len())]; // `matched` will contain "\u{dc00}a" instead of "\u{dc00}". }

    As a workaround, we introduced find_range and match_ranges. Note that this is already a problem to solve if we want to make Regex a pattern of strings.

Rationale and alternatives
代わりのもの、選択肢

Indivisible surrogate pair

This RFC is the only design

設計(する)
which allows
許可する、可能にする
borrowing a sub-slice of a surrogate code point from a surrogate pair.

An alternative

代わりのもの、選択肢
is keep using the vanilla WTF-8, and treat
取り扱う
a surrogate pair as an atomic entity:
実体、存在物
makes it impossible to split a surrogate pair after it is formed.
形式、形態、形作る
The advantages are that

  • The pattern API becomes a simple substring
    部分文字列
    search.
  • Slicing behavior
    ふるまい
    is consistent with str.

There are two potential implementations

実装
when we want to match
一致する、マッチさせる
with an unpaired surrogate:

  1. Declare

    宣言する
    that a surrogate pair does not contain
    含む
    the unpaired surrogate
    , i.e. make "\u{10000}".find("\u{d800}") return None. An unpaired surrogate can only be used to match
    一致する、マッチさせる
    another unpaired surrogate.

    If we choose this, it means x.find(z).is_some() does not imply (x + y).find(z).is_some().

  2. Disallow

    許可しない
    matching
    一致する、マッチさせる
    when the pattern contains
    含む
    an unpaired surrogate at the boundary
    , i.e. make "\u{10000}".find("\u{d800}") panic. This is the approach chosen by “Pattern API 2.0”.

Note that, for consistency, we need to make "\u{10000}".starts_with("\u{d800}") return false or panic.

Slicing at real byte offset

The current RFC defines

定義する
the index that splits a surrogate pair into half at byte 2 of the 4-byte sequence.
連なり、並び
This has the drawback of "\u{10000}"[..2].len() == 3, and caused index arithmetic
算術
to be wrong.

"\u{10000}" = f0 90 80 80 "\u{10000}"[..2] = f0 90 80 "\u{10000}"[2..] = 90 80 80

The main advantage of this scheme is we could use the same number as the start and end index.

#![allow(unused)] fn main() { let s = OsStr::new("\u{10000}"); assert_eq!(s.len(), 4); let index = s.find('\u{dc00}').unwrap(); let right = &s[index..]; // [90 80 80] let left = &s[..index]; // [f0 90 80] }

An alternative

代わりのもの、選択肢
make the index refer
参照する
to the real byte offsets:

"\u{10000}" = f0 90 80 80 "\u{10000}"[..3] = f0 90 80 "\u{10000}"[1..] = 90 80 80

However the question would be, what should s[..1] do?

  • Panic But this means we cannot get left. We could inspect the raw

    生の
    bytes of s itself and perform &s[..(index + 2)], but we never explicitly
    明示的に
    exposed the encoding of OsStr, so we cannot read a single
    単一の
    byte and thus
    それゆえに、従って、
    impossible to do this.

  • Treat

    取り扱う
    as same as s[..3] But then this inherits
    継承する
    all the disadvantages of using 2 as valid
    有効な、正しい
    index, plus we need to consider
    考える、みなす
    whether s[1..3] and s[3..1] should be valid.
    有効な、正しい

Given

与えられた
these, we decided not to treat
取り扱う
the real byte offsets as valid
有効な、正しい
indices.

Unresolved questions

None yet.