Skip to content

Question about \p script matching #351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elirnm opened this issue Mar 28, 2017 · 6 comments
Closed

Question about \p script matching #351

elirnm opened this issue Mar 28, 2017 · 6 comments
Labels

Comments

@elirnm
Copy link

elirnm commented Mar 28, 2017

The docs just say that \p{Blah} matches "Unicode character class (general category or script)." I took this to mean that it would match against anything listed under scripts on http://www.unicode.org/charts/, but there's a number of those that it doesn't recognize, including some ones I would have expected by default such as CJK (or CJK Unified Ideographs).

Is there some documentation of which scripts are supported in \p matching?

@BurntSushi
Copy link
Member

Is there some documentation of which scripts are supported in \p matching?

The documentation is intended to be provided by Unicode.

More specifically, the scripts available in this crate are generated directly from this file: http://www.unicode.org/Public/UNIDATA/Scripts.txt

I don't know what the correspondence is between the link you provided and the Scripts.txt file. :-/ It seems like there is a lot of overlap, and maybe there is just a naming mismatch. For example, Scripts.txt contains a Han script, but I don't see a granular breakdown of CJK as listed in your chart page.

@elirnm
Copy link
Author

elirnm commented Mar 28, 2017

Ok, thanks. It looks like Unicode explains the discrepancy on the help page for charts.:

Script and symbol groups are arranged in an order that provide the best fit for the table. The order of presentation may change as additional code charts are added. The names of blocks may be an abbreviated or otherwise modified form of the official block name. In some cases, a subhead may simply be a name which covers the content of several closely related blocks. The names of ranges are freely chosen and may change.

@elirnm elirnm closed this as completed Mar 28, 2017
@BurntSushi
Copy link
Member

@elirnm Well that seems... a little annoying. Maybe we should provide the full list in the regex docs somewhere if Unicode isn't going to make it more accessible themselves.

@elirnm
Copy link
Author

elirnm commented Mar 29, 2017

Or at least provide a link to that scripts file so users can get some idea of what's supported?

@saona-raimundo
Copy link

Hi!
I would like to revive this thread (I can open a new one if needed).
I was having trouble understanding the extent of the \p notation.
Here are my thoughts.

  • The crate-level documentation has a list. The first element links to UNICODE.md, while the second to last element talks about the \p notation.
    Since elements of a list are at the same level, I suggest bringing UNICODE.md before the whole list.

  • The documentation in UNICODE.md includes a list of "all properties supported by the regex crate", but the list is missing "Greek", which is part of the examples in the documentation.
    Is the list not updated? would you like the list to be generated from the implementation?

@BurntSushi
Copy link
Member

@saona-raimundo While your comment is tangentially related, please don't bump issues closed years ago. If you have questions, you should open a new Discussion. If you click the "new issue" button, there is even an explicit category for "ask a question":

ask

I copied your second bullet point into a distinct question here and tried to answer it: #1144

I don't understand your first bullet point. Please open a new discussion question and fill out the question more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants