-
Notifications
You must be signed in to change notification settings - Fork 102
JSON.parse fails for cweb/url-testing/master/urls.json #232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Closing based on the following paragraph of the JSON standard:
There is nothing we can sensibly return in this case. A fatal runtime exception is reasonable. |
This is of course a consequence of using UTF8, as opposed to the UTF16 that Webkit (or whoever these tests are intended for) are using. The unicode FAQ clarifies the situation of code points like
|
Node handles this file without issues: $ node
> var fs = require("fs");
> var content = fs.readFileSync("urls.json.txt");
> var jsonContent = JSON.parse(content);
> jsonContent.tests.group[0].test[4]
{ id: '5',
name: 'Illegal Unicode half-surrogate U+D800',
url: 'http://www.example.com/#\\�\\好',
expect_url: 'http://www.example.com/#\\�\\好' } I don't think the underlying encoding (UTF8 vs UTF16) is relevant when dealing with '\u' escape sequences. My understanding is that there are no "invalid" Strings in Julia v0.7. @StefanKarpinski ? while true
s = String(rand(UInt8, 100))
println(s)
println([c for c in s])
end julia> VERSION
v"0.7.0-DEV.3429"
julia> x = "foo\uDEADbar"
"foo\udeadbar"
julia> Vector{Char}(x)
7-element Array{Char,1}:
'f'
'o'
'o'
'\udead'
'b'
'a'
'r' With the patch below, the file parses as expected: julia> x = JSON.parse(String(read("urls.json.txt")))
julia> x["tests"]["group"][1]["test"][5]
Dict{String,Any} with 4 entries:
"name" => "Illegal Unicode half-surrogate U+D800"
"expect_url" => "http://www.example.com/#\\�\\好"
"id" => "5"
"url" => "http://www.example.com/#\\\ud800\\好" diff --git a/src/Parser.jl b/src/Parser.jl
index de04aa0..fd66ef3 100644
--- a/src/Parser.jl
+++ b/src/Parser.jl
@@ -265,6 +265,7 @@ end
function parse_string(ps::ParserState)
b = UInt8[]
+ surrogate = UInt16(0x0000)
incr!(ps) # skip opening quote
while true
c = advance!(ps)
@@ -272,19 +273,29 @@ function parse_string(ps::ParserState)
if c == BACKSLASH
c = advance!(ps)
if c == LATIN_U # Unicode escape
- append!(b, Vector{UInt8}(string(read_unicode_escape!(ps))))
+ u1 = read_four_hex_digits!(ps)
+ if surrogate > 0
+ append!(b, Vector{UInt8}(string(utf16_get_supplementary(surrogate, u1))))
+ elseif utf16_is_surrogate(u1)
+ surrogate = u1
+ else
+ append!(b, Vector{UInt8}(string(Char(u1))))
+ end
+ continue
else
c = get(ESCAPES, c, 0x00)
c == 0x00 && _error(E_BAD_ESCAPE, ps)
- push!(b, c)
end
- continue
elseif c < SPACE
_error(E_BAD_CONTROL, ps)
elseif c == STRING_DELIM
return String(b)
end
+ if surrogate > 0
+ append!(b, Vector{UInt8}(string(Char(surrogate))))
+ surrogate = UInt16(0x0000)
+ end
push!(b, c)
end
end
diff --git a/src/specialized.jl b/src/specialized.jl
index e139c80..d8a05fd 100644
--- a/src/specialized.jl
+++ b/src/specialized.jl
@@ -1,4 +1,5 @@
# Specialized functions for increased performance when JSON is in-memory
+#=
function parse_string(ps::MemoryParserState)
# "Dry Run": find length of string so we can allocate the right amount of
# memory from the start. Does not do full error checking.
@@ -20,6 +21,7 @@ function parse_string(ps::MemoryParserState)
String(b)
end
+=#
"""
Scan through a string at the current parser state and return a tuple containing
@@ -43,6 +45,7 @@ This function will throw an error if:
No error is thrown when other invalid backslash escapes are encountered.
"""
+#=
function predict_string(ps::MemoryParserState)
e = endof(ps.utf8data)
fastpath = true # true if no escapes in this string, so it can be copied
@@ -123,6 +126,7 @@ function parse_string(ps::MemoryParserState, b)
ps.s = s + 1
b
end
+=#
function parse_number(ps::MemoryParserState)
s = p = ps.s |
To me, this says that the specification allows "invalid" Unicode.
To me, "or even" says that runtime exceptions are the least desirable behaviour. |
As far as I know Node encodes strings in UTF16, where these codepoints do make "sense". I am aware that Julia v0.7 does support such codepoints, but whether we want to create them is another matter entirely. Even in Julia v0.7 we have the following: julia> isvalid("\ud800")
false (If we make this change, we would also want to discontinue support for v0.6, which treats such things as "invalid UTF8 data".) My interpretation of "surrogate code points are not valid Unicode scalar values, are not valid in UTFs, and cannot be mapped through UTFs" is that mapping these is a violation of the Unicode standard. |
Out of curiosity, what is your use case for parsing these half-surrogates? I can't imagine any well-formed "real" data will have them. If the use case is to parse non-well-formed data, I would argue that this is a non-standard use case and should be supported via non-standard string parsing contexts, which can be implemented like the existing non-standard integer contexts. |
I think the fact that Julia separates
The JSON only comes into this because the URL test suite originates in JavaScript and JSON is JavaScript's serialisation format. It seems to me that the simplest way to think about this is that if Julia is trying to be compatible with JavaScript's serialisation format, then it should handle anything that JavaScript can handle.
A serialisation/deserialisation mechanism should be tested for corner cases as well as "expected likely real data". Anything that can be deserialised to Julia, then reseraliased to the JSON without changing it should be handled. If Julia had no consistent way of representing U+D800 in a See http://seriot.ch/parsing_json.php, under "Escaped Invalid Characters":
See also https://github.com/nst/JSONTestSuite
|
I will reopen this for additional discussion. I am still not convinced that creating invalid data by default is a safe policy. Perhaps we can default to replacing them with U+FFFD? |
What we've learned the hard way – i.e. by trying it and finding that it causes rampant problems – is that you should not have APIs where errors are unavoidably thrown due to bad data beyond your control. In such cases, the only solution is to use try/catch which is generally not considered good Julia code, especially in contexts where performance matters. So your other options here are:
The current situation in the JSON library seems to be neither of these and to fall very much into the "failing due to bad data beyond the user's control" camp, which is not good. In this case I can't see any reason not to treat invalid data the same way the new |
After thinking it over, WTF-8 is not too terrible, although I would still eventually like this to be customizable through a StringContext (certainly I would like the ability to enforce validity without having to recursively descend through the parsed object). @samoconnor would you be able to rebase the patch above without commenting out the performance optimization specialization, and submit it as a PR? |
What's so bad about recursively descending through the parsed object? If you want to find out if anything is invalid, that's easy and efficient to implement as a recursive function. If you want to replace invalid data with replacement characters or replace it with nothing, that's also easy to implement and the returned structure can share all objects that are fully valid without any copying, so unless you're expecting many objects to be invalid, it'll be efficient enough. |
Hi @TotalVerb, that patch is just a proof-of-concept that shows that a simple hack can make the parser accept the test input. I don't know if it is suitable for production. e.g. it comments out most of Something to consider: There might be advantages to adding a |
@StefanKarpinski The direction that JSON.jl has moved in lately is to allow greater customization of the parser through the @samoconnor I may have time later this week to look into it, and perhaps get started on ParserContext expansion at the same time. Thanks for the issue report and discussion! |
Note also https://tools.ietf.org/html/rfc8259#section-9
So there is really no choice for a compliant parser. You MUST accept the text irrespective of the higher-level grouping or meaning of |
FYI, this got me thinking enough to put together some proof of concept code: |
I think that kind of implementation would be pretty easy to integrate w/ JSON2.jl |
Implementation of |
This file implements Julia's Feel free to borrow ideas/snippets/whole-files for use in JSON.jl. Background info: https://discourse.julialang.org/t/announce-a-different-way-to-read-json-data-lazyjson-jl/9046 |
Input: https://raw.githubusercontent.com/cweb/url-testing/master/urls.json
The text was updated successfully, but these errors were encountered: