Skip to content

checkbounds for strings #22548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bkamins opened this issue Jun 26, 2017 · 8 comments
Closed

checkbounds for strings #22548

bkamins opened this issue Jun 26, 2017 · 8 comments
Labels
strings "Strings!"

Comments

@bkamins
Copy link
Member

bkamins commented Jun 26, 2017

Currently checkbounds is inconsitent with getindex for String.

Example:

julia> x="∀"
"∀"

julia> checkbounds(x, 1:3)
ERROR: BoundsError: attempt to access "∀"
  at index [1:3]
Stacktrace:
 [1] checkbounds(::String, ::UnitRange{Int64}) at .\strings\basic.jl:241

julia> x[1:3]
"∀"

The consequence is that in particular in SubString we get inconsistent behavior of getindex:

julia> y = SubString(x, 1, 1)
"∀"

julia> y[1:3]
ERROR: BoundsError: attempt to access "∀"
  at index [1:3]
Stacktrace:
 [1] checkbounds(::SubString{String}, ::UnitRange{Int64}) at .\strings\basic.jl:241
 [2] getindex(::SubString{String}, ::UnitRange{Int64}) at .\strings\types.jl:82

I would recommend to make it consistent. The difference in current functionality for string x is:

  • getindex: allow indexing to sizeof(x);
  • chcekbounds allows indexing to endof(x).

I do not have a strong preference of one over the other but I feel that this should be consistent.

@tkelman tkelman added the strings "Strings!" label Jun 26, 2017
@nalimilan
Copy link
Member

I agree. I think the consensus is that indices should always point to the beginning of a codepoint. All other cases should be an error. See @stevengj's comments at #14158 (comment) and #14158 (comment), as well as the rest of the thread. I'm not sure why the PR was abandoned at the time, but a PR implementing the strict behavior would be welcome AFAICT. See also #5446.

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2017

I agree with @nalimilan. Actually #22511 it would be simpler and more consistent to implement if lower and upper index would be required to point to the beginning of a codepoint.

As this issue is revisited the additional question is if the following statements should throw an error or go through and return empty string:

For invalid indices:

x="café"
x[0:-1]
x[1:0]
x[20:1]

For valid indices but in empty range:

x[2:1]

@nalimilan
Copy link
Member

I'd use the same rules as for arrays. I just realized indexing an array with an empty range always returns an empty array, even if the range endpoints are completely out of bounds. That's kind of surprising to me, but better stick to that, and possibly change it later for both strings and arrays if we decide to.

@bkamins
Copy link
Member Author

bkamins commented Jun 26, 2017

That is exactly why I have asked. I understand that the desired rules in x[i:j] are:

  1. if i>j then always return an empty string;
  2. else
    1. if i and j are beginning of codepoint return a substring
    2. else raise error

Then the following definition should work:

function getindex(s::String, r::UnitRange{Int})
    isempty(r) && return ""
    i, j = first(r), last(r)
    isvalid(s, i) || throw(BoundsError(s, i))
    isvalid(s, j) || throw(BoundsError(s, j))
    j = nextind(s,j)-1
    unsafe_string(pointer(s,i), j-i+1)
end

@nalimilan
Copy link
Member

The semantics look correct, but this method is very performance-sensitive so we should probably stick to the current code as much as possible. AFAICT it should already throw an error when i doesn't point to the beginning of a code point, so we only need to check for invalid j by calling is_valid_continuation in the same way as for i. Or am I missing something? Anyway this will have to be benchmarked (especially for small strings) to ensure we don't pay a too high price just to make indexing rules stricter.

@bkamins
Copy link
Member Author

bkamins commented Jun 27, 2017

This is the redesigned code:

function g(s::String, r::UnitRange{Int})
    isempty(r) && return ""
    l = s.len
    i = first(r)
    (i < 1 || i > l) && throw(BoundsError(s, i))
    @inbounds si = codeunit(s, i)
    is_valid_continuation(si) && throw(BoundsError(s, i))
    j = last(r)
    j > l && throw(BoundsError(s, j))
    @inbounds sj = codeunit(s, j)
    is_valid_continuation(sj) && throw(BoundsError(s, j))
    j = nextind(s,j)-1
    unsafe_string(pointer(s,i), j-i+1)
end

and here is the benchmark:

x = "a∀"
for i in 1:2
    println("\nx[$i:$i]")
    display(@benchmark x[1:1])
    println("\ng(x,$i:$i)")
    display(@benchmark g(x, 1:1))
end

producing

x[1:1]                                  
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     49.752 ns (0.00% GC)
  median time:      53.505 ns (0.00% GC)
  mean time:        61.104 ns (8.49% GC)
  maximum time:     2.790 μs (92.57% GC)
  --------------                        
  samples:          10000               
  evals/sample:     994                 
                                        
g(x,1:1)                                
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     46.887 ns (0.00% GC)
  median time:      49.702 ns (0.00% GC)
  mean time:        57.250 ns (9.17% GC)
  maximum time:     2.368 μs (95.39% GC)
  --------------                        
  samples:          10000               
  evals/sample:     995                 
                                        
x[2:2]                                  
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     50.385 ns (0.00% GC)
  median time:      53.651 ns (0.00% GC)
  mean time:        61.642 ns (8.85% GC)
  maximum time:     2.443 μs (92.25% GC)
  --------------                        
  samples:          10000               
  evals/sample:     1000                
                                        
g(x,2:2)                                
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     49.919 ns (0.00% GC)
  median time:      53.185 ns (0.00% GC)
  mean time:        61.730 ns (8.83% GC)
  maximum time:     2.334 μs (92.88% GC)
  --------------                        
  samples:          10000               
  evals/sample:     1000                

so it seems to be roughly equivalent.
The only issue is that I understand that in the near future line l = s.len would have to be changed to something else as it is planned that String will loose this attribute.

@bkamins
Copy link
Member Author

bkamins commented Jun 27, 2017

And a small fix as codeunit can be optimized out.

function g(s::String, r::UnitRange{Int})
    isempty(r) && return ""
    l = s.len
    i = first(r)
    (i < 1 || i > l) && throw(BoundsError(s, i))
    @inbounds si = unsafe_load(pointer(s),i)
    is_valid_continuation(si) && throw(BoundsError(s, i))
    j = last(r)
    j > l && throw(BoundsError(s, j))
    @inbounds sj = unsafe_load(pointer(s),j)
    is_valid_continuation(sj) && throw(BoundsError(s, j))
    j = nextind(s,j)-1
    unsafe_string(pointer(s,i), j-i+1)
end

and corrected benchmark code:

x = "a∀"
for i in 1:2
    println("\nx[$i:$i]")
    display(@benchmark $x[1:1])
    println("\ng(x,$i:$i)")
    display(@benchmark g($x, 1:1))
end

giving

x[1:1]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.259 ns (0.00% GC)
  median time:      26.125 ns (0.00% GC)
  mean time:        30.188 ns (9.53% GC)
  maximum time:     2.348 μs (96.74% GC)
  --------------
  samples:          10000
  evals/sample:     1000

g(x,1:1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.259 ns (0.00% GC)
  median time:      25.659 ns (0.00% GC)
  mean time:        30.032 ns (9.68% GC)
  maximum time:     2.290 μs (96.37% GC)
  --------------
  samples:          10000
  evals/sample:     1000

x[2:2]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     23.793 ns (0.00% GC)
  median time:      26.125 ns (0.00% GC)
  mean time:        30.006 ns (9.61% GC)
  maximum time:     2.263 μs (96.15% GC)
  --------------
  samples:          10000
  evals/sample:     1000

g(x,2:2)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.259 ns (0.00% GC)
  median time:      25.659 ns (0.00% GC)
  mean time:        30.052 ns (9.97% GC)
  maximum time:     3.207 μs (96.58% GC)
  --------------
  samples:          10000
  evals/sample:     1000

@nalimilan
Copy link
Member

Looks good, that's certainly worth a PR. But I would keep codeunit, anyway the compiler is going to inline it.

The only issue is that I understand that in the near future line l = s.len would have to be changed to something else as it is planned that String will loose this attribute.

You can just use sizeof.

bkamins added a commit to bkamins/julia that referenced this issue Jun 27, 2017
Make `getindex` for `String` check if indices are valid.
vtjnash pushed a commit to bkamins/julia that referenced this issue Sep 19, 2017
Closes JuliaLang#22548

fixes a bug with use of prevind in dates/io.jl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

3 participants