Skip to content

Errors for invalid range indexing into strings? #5446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnmyleswhite opened this issue Jan 19, 2014 · 10 comments
Closed

Errors for invalid range indexing into strings? #5446

johnmyleswhite opened this issue Jan 19, 2014 · 10 comments
Labels
needs decision A decision on this change is needed

Comments

@johnmyleswhite
Copy link
Member

I might not appreciate the reasoning behind it, but I find the following behaviors for range indexing into strings a little odd:

julia> s = string('ñ')
"ñ"

julia> s[2]
ERROR: invalid UTF-8 character index

julia> s[2:2]
""

julia> s[20:20]
""
@EyeOfPython
Copy link

I would rather consider this as a bug.

@ivarne
Copy link
Member

ivarne commented Jan 20, 2014

Looking at the source, this seems deliberate.

function getindex(s::UTF8String, r::Range1{Int})
    a, b = first(r), last(r)
    i = isvalid(s,a) ? a : nextind(s,a)
    j = b < endof(s) ? nextind(s,b)-1 : endof(s.data)
    UTF8String(s.data[i:j])
end

It might be useful in some settings to ensure that any index range returns something, but it seems like this code is written to obfuscate the details of variable length encodings, more than to help users to deal with it.

The BoundsError() was hidden by @StefanKarpinski in 61a3d0d without an explanation for why.
I lost track of the nextind() behaviour at 17d320b

@johnmyleswhite
Copy link
Member Author

I'm not sure we should allow every index range to return something. That's not how arrays work, for example:

julia> a = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

julia> a[4:4]
ERROR: BoundsError()
 in copy! at array.jl:49
 in getindex at array.jl:296

@ivarne
Copy link
Member

ivarne commented Jan 20, 2014

How could I not mention that I found out that that is not how ASCIIString works either.

@ivarne
Copy link
Member

ivarne commented Jan 20, 2014

This still seems funky

julia> a = "æøå";
julia> a[1:1] * a[2:2] * a[3:3] * a[4:4] * a[5:5]
"æøå"
julia> a[1:1] * a[2:2] * a[3:3] * a[4:4] * a[5:5] * a[6:6]
ERROR: BoundsError()
julia> length(a.data)
6

@JeffBezanson
Copy link
Member

Accessing a[6] is not valid. Also, range indexing had a behavior where if i and j were valid indexes, it also allowed a[i+1:j], and would adjust the first index to the next character. This is not strictly correct but provides some leeway for code that naively uses i+1. I decided to keep this behavior for maximum compatibility.

@ivarne
Copy link
Member

ivarne commented Jan 20, 2014

Accessing a[2] and a[4] is also invalid, but a[2:2] and a[4:4] is valid but gives a empty string.

Now that we are in the business of changing the behaviour. Is there any reason to not throw the not yet defined InvalidUTF8Position exception for these ranges that splits a utf8 character?

@JeffBezanson
Copy link
Member

Giving that error is fine with me. I was just being cautious.

Again, in a[2:2] the first 2 is adjusted to the next valid character index, so you get an empty range. 6 is also adjusted to the "next" character, which is actually out of bounds. I agree this is a bit weird, but then again this implementation used to allow a[1000000:1000000].

@johnmyleswhite
Copy link
Member Author

+1 for throwing an error

@nolta
Copy link
Member

nolta commented Apr 4, 2014

This feels weird:

julia> s = "ααα"
"ααα"

julia> s[1:2]
"α"

julia> s[3:4]
"α"

julia> s[5:6]
ERROR: BoundsError()

Could we throw the BoundsError if the second index goes past sizeof, instead of endof?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests

5 participants