1
1
# Unicode conformance
2
2
3
3
This document describes the regex crate's conformance to Unicode's
4
- [ UTS #18 ] ( http ://unicode.org/reports/tr18/)
4
+ [ UTS #18 ] ( https ://unicode.org/reports/tr18/)
5
5
report, which lays out 3 levels of support: Basic, Extended and Tailored.
6
6
7
7
Full support for Level 1 ("Basic Unicode Support") is provided with two
@@ -10,7 +10,7 @@ exceptions:
10
10
1 . Line boundaries are not Unicode aware. Namely, only the ` \n `
11
11
(` END OF LINE ` ) character is recognized as a line boundary.
12
12
2 . The compatibility properties specified by
13
- [ RL1.2a] ( http ://unicode.org/reports/tr18/#RL1.2a)
13
+ [ RL1.2a] ( https ://unicode.org/reports/tr18/#RL1.2a)
14
14
are ASCII-only definitions.
15
15
16
16
Little to no support is provided for either Level 2 or Level 3. For the most
@@ -61,18 +61,18 @@ provide a convenient way to construct character classes of groups of code
61
61
points specified by Unicode. The regex crate does not provide exhaustive
62
62
support, but covers a useful subset. In particular:
63
63
64
- * [ General categories] ( http ://unicode.org/reports/tr18/#General_Category_Property)
65
- * [ Scripts and Script Extensions] ( http ://unicode.org/reports/tr18/#Script_Property)
66
- * [ Age] ( http ://unicode.org/reports/tr18/#Age)
64
+ * [ General categories] ( https ://unicode.org/reports/tr18/#General_Category_Property)
65
+ * [ Scripts and Script Extensions] ( https ://unicode.org/reports/tr18/#Script_Property)
66
+ * [ Age] ( https ://unicode.org/reports/tr18/#Age)
67
67
* A smattering of boolean properties, including all of those specified by
68
- [ RL1.2] ( http ://unicode.org/reports/tr18/#RL1.2) explicitly.
68
+ [ RL1.2] ( https ://unicode.org/reports/tr18/#RL1.2) explicitly.
69
69
70
70
In all cases, property name and value abbreviations are supported, and all
71
71
names/values are matched loosely without regard for case, whitespace or
72
72
underscores. Property name aliases can be found in Unicode's
73
- [ ` PropertyAliases.txt ` ] ( http ://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
73
+ [ ` PropertyAliases.txt ` ] ( https ://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
74
74
file, while property value aliases can be found in Unicode's
75
- [ ` PropertyValueAliases.txt ` ] ( http ://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
75
+ [ ` PropertyValueAliases.txt ` ] ( https ://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
76
76
file.
77
77
78
78
The syntax supported is also consistent with the UTS #18 recommendation:
@@ -149,10 +149,10 @@ properties correspond to properties required by RL1.2):
149
149
150
150
## RL1.2a Compatibility Properties
151
151
152
- [ UTS #18 RL1.2a] ( http ://unicode.org/reports/tr18/#RL1.2a)
152
+ [ UTS #18 RL1.2a] ( https ://unicode.org/reports/tr18/#RL1.2a)
153
153
154
154
The regex crate only provides ASCII definitions of the
155
- [ compatibility properties documented in UTS #18 Annex C] ( http ://unicode.org/reports/tr18/#Compatibility_Properties)
155
+ [ compatibility properties documented in UTS #18 Annex C] ( https ://unicode.org/reports/tr18/#Compatibility_Properties)
156
156
(sans the ` \X ` class, for matching grapheme clusters, which isn't provided
157
157
at all). This is because it seems to be consistent with most other regular
158
158
expression engines, and in particular, because these are often referred to as
@@ -165,7 +165,7 @@ Their traditional ASCII definition can be used by disabling Unicode. That is,
165
165
166
166
## RL1.3 Subtraction and Intersection
167
167
168
- [ UTS #18 RL1.3] ( http ://unicode.org/reports/tr18/#Subtraction_and_Intersection)
168
+ [ UTS #18 RL1.3] ( https ://unicode.org/reports/tr18/#Subtraction_and_Intersection)
169
169
170
170
The regex crate provides full support for nested character classes, along with
171
171
union, intersection (` && ` ), difference (` -- ` ) and symmetric difference (` ~~ ` )
@@ -178,7 +178,7 @@ For example, to match all non-ASCII letters, you could use either
178
178
179
179
## RL1.4 Simple Word Boundaries
180
180
181
- [ UTS #18 RL1.4] ( http ://unicode.org/reports/tr18/#Simple_Word_Boundaries)
181
+ [ UTS #18 RL1.4] ( https ://unicode.org/reports/tr18/#Simple_Word_Boundaries)
182
182
183
183
The regex crate provides basic Unicode aware word boundary assertions. A word
184
184
boundary assertion can be written as ` \b ` , or ` \B ` as its negation. A word
@@ -196,9 +196,9 @@ the following classes:
196
196
* ` \p{gc:Connector_Punctuation} `
197
197
198
198
In particular, this differs slightly from the
199
- [ prescription given in RL1.4] ( http ://unicode.org/reports/tr18/#Simple_Word_Boundaries)
199
+ [ prescription given in RL1.4] ( https ://unicode.org/reports/tr18/#Simple_Word_Boundaries)
200
200
but is permissible according to
201
- [ UTS #18 Annex C] ( http ://unicode.org/reports/tr18/#Compatibility_Properties) .
201
+ [ UTS #18 Annex C] ( https ://unicode.org/reports/tr18/#Compatibility_Properties) .
202
202
Namely, it is convenient and simpler to have ` \w ` and ` \b ` be in sync with
203
203
one another.
204
204
@@ -211,7 +211,7 @@ boundaries is currently sub-optimal on non-ASCII text.
211
211
212
212
## RL1.5 Simple Loose Matches
213
213
214
- [ UTS #18 RL1.5] ( http ://unicode.org/reports/tr18/#Simple_Loose_Matches)
214
+ [ UTS #18 RL1.5] ( https ://unicode.org/reports/tr18/#Simple_Loose_Matches)
215
215
216
216
The regex crate provides full support for case insensitive matching in
217
217
accordance with RL1.5. That is, it uses the "simple" case folding mapping. The
@@ -226,7 +226,7 @@ then all characters classes are case folded as well.
226
226
227
227
## RL1.6 Line Boundaries
228
228
229
- [ UTS #18 RL1.6] ( http ://unicode.org/reports/tr18/#Line_Boundaries)
229
+ [ UTS #18 RL1.6] ( https ://unicode.org/reports/tr18/#Line_Boundaries)
230
230
231
231
The regex crate only provides support for recognizing the ` \n ` (` END OF LINE ` )
232
232
character as a line boundary. This choice was made mostly for implementation
@@ -239,7 +239,7 @@ well, and in theory, this could be done efficiently.
239
239
240
240
## RL1.7 Code Points
241
241
242
- [ UTS #18 RL1.7] ( http ://unicode.org/reports/tr18/#Supplementary_Characters)
242
+ [ UTS #18 RL1.7] ( https ://unicode.org/reports/tr18/#Supplementary_Characters)
243
243
244
244
The regex crate provides full support for Unicode code point matching. Namely,
245
245
the fundamental atom of any match is always a single code point.
0 commit comments