checkbounds for strings #22548

bkamins · 2017-06-26T07:36:05Z

Currently checkbounds is inconsitent with getindex for String.

Example:

julia> x="∀"
"∀"

julia> checkbounds(x, 1:3)
ERROR: BoundsError: attempt to access "∀"
  at index [1:3]
Stacktrace:
 [1] checkbounds(::String, ::UnitRange{Int64}) at .\strings\basic.jl:241

julia> x[1:3]
"∀"

The consequence is that in particular in SubString we get inconsistent behavior of getindex:

julia> y = SubString(x, 1, 1)
"∀"

julia> y[1:3]
ERROR: BoundsError: attempt to access "∀"
  at index [1:3]
Stacktrace:
 [1] checkbounds(::SubString{String}, ::UnitRange{Int64}) at .\strings\basic.jl:241
 [2] getindex(::SubString{String}, ::UnitRange{Int64}) at .\strings\types.jl:82

I would recommend to make it consistent. The difference in current functionality for string x is:

getindex: allow indexing to sizeof(x);
chcekbounds allows indexing to endof(x).

I do not have a strong preference of one over the other but I feel that this should be consistent.

The text was updated successfully, but these errors were encountered:

nalimilan · 2017-06-26T12:35:06Z

I agree. I think the consensus is that indices should always point to the beginning of a codepoint. All other cases should be an error. See @stevengj's comments at #14158 (comment) and #14158 (comment), as well as the rest of the thread. I'm not sure why the PR was abandoned at the time, but a PR implementing the strict behavior would be welcome AFAICT. See also #5446.

bkamins · 2017-06-26T15:20:51Z

I agree with @nalimilan. Actually #22511 it would be simpler and more consistent to implement if lower and upper index would be required to point to the beginning of a codepoint.

As this issue is revisited the additional question is if the following statements should throw an error or go through and return empty string:

For invalid indices:

x="café"
x[0:-1]
x[1:0]
x[20:1]

For valid indices but in empty range:

x[2:1]

nalimilan · 2017-06-26T15:27:09Z

I'd use the same rules as for arrays. I just realized indexing an array with an empty range always returns an empty array, even if the range endpoints are completely out of bounds. That's kind of surprising to me, but better stick to that, and possibly change it later for both strings and arrays if we decide to.

bkamins · 2017-06-26T21:29:05Z

That is exactly why I have asked. I understand that the desired rules in x[i:j] are:

if i>j then always return an empty string;
else
1. if i and j are beginning of codepoint return a substring
2. else raise error

Then the following definition should work:

function getindex(s::String, r::UnitRange{Int})
    isempty(r) && return ""
    i, j = first(r), last(r)
    isvalid(s, i) || throw(BoundsError(s, i))
    isvalid(s, j) || throw(BoundsError(s, j))
    j = nextind(s,j)-1
    unsafe_string(pointer(s,i), j-i+1)
end

nalimilan · 2017-06-27T08:26:14Z

The semantics look correct, but this method is very performance-sensitive so we should probably stick to the current code as much as possible. AFAICT it should already throw an error when i doesn't point to the beginning of a code point, so we only need to check for invalid j by calling is_valid_continuation in the same way as for i. Or am I missing something? Anyway this will have to be benchmarked (especially for small strings) to ensure we don't pay a too high price just to make indexing rules stricter.

bkamins · 2017-06-27T09:29:19Z

This is the redesigned code:

function g(s::String, r::UnitRange{Int})
    isempty(r) && return ""
    l = s.len
    i = first(r)
    (i < 1 || i > l) && throw(BoundsError(s, i))
    @inbounds si = codeunit(s, i)
    is_valid_continuation(si) && throw(BoundsError(s, i))
    j = last(r)
    j > l && throw(BoundsError(s, j))
    @inbounds sj = codeunit(s, j)
    is_valid_continuation(sj) && throw(BoundsError(s, j))
    j = nextind(s,j)-1
    unsafe_string(pointer(s,i), j-i+1)
end

and here is the benchmark:

x = "a∀"
for i in 1:2
    println("\nx[$i:$i]")
    display(@benchmark x[1:1])
    println("\ng(x,$i:$i)")
    display(@benchmark g(x, 1:1))
end

producing

x[1:1]                                  
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     49.752 ns (0.00% GC)
  median time:      53.505 ns (0.00% GC)
  mean time:        61.104 ns (8.49% GC)
  maximum time:     2.790 μs (92.57% GC)
  --------------                        
  samples:          10000               
  evals/sample:     994                 
                                        
g(x,1:1)                                
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     46.887 ns (0.00% GC)
  median time:      49.702 ns (0.00% GC)
  mean time:        57.250 ns (9.17% GC)
  maximum time:     2.368 μs (95.39% GC)
  --------------                        
  samples:          10000               
  evals/sample:     995                 
                                        
x[2:2]                                  
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     50.385 ns (0.00% GC)
  median time:      53.651 ns (0.00% GC)
  mean time:        61.642 ns (8.85% GC)
  maximum time:     2.443 μs (92.25% GC)
  --------------                        
  samples:          10000               
  evals/sample:     1000                
                                        
g(x,2:2)                                
BenchmarkTools.Trial:                   
  memory estimate:  64 bytes            
  allocs estimate:  2                   
  --------------                        
  minimum time:     49.919 ns (0.00% GC)
  median time:      53.185 ns (0.00% GC)
  mean time:        61.730 ns (8.83% GC)
  maximum time:     2.334 μs (92.88% GC)
  --------------                        
  samples:          10000               
  evals/sample:     1000

so it seems to be roughly equivalent.
The only issue is that I understand that in the near future line l = s.len would have to be changed to something else as it is planned that String will loose this attribute.

bkamins · 2017-06-27T09:34:12Z

And a small fix as codeunit can be optimized out.

function g(s::String, r::UnitRange{Int})
    isempty(r) && return ""
    l = s.len
    i = first(r)
    (i < 1 || i > l) && throw(BoundsError(s, i))
    @inbounds si = unsafe_load(pointer(s),i)
    is_valid_continuation(si) && throw(BoundsError(s, i))
    j = last(r)
    j > l && throw(BoundsError(s, j))
    @inbounds sj = unsafe_load(pointer(s),j)
    is_valid_continuation(sj) && throw(BoundsError(s, j))
    j = nextind(s,j)-1
    unsafe_string(pointer(s,i), j-i+1)
end

and corrected benchmark code:

x = "a∀"
for i in 1:2
    println("\nx[$i:$i]")
    display(@benchmark $x[1:1])
    println("\ng(x,$i:$i)")
    display(@benchmark g($x, 1:1))
end

giving

x[1:1]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.259 ns (0.00% GC)
  median time:      26.125 ns (0.00% GC)
  mean time:        30.188 ns (9.53% GC)
  maximum time:     2.348 μs (96.74% GC)
  --------------
  samples:          10000
  evals/sample:     1000

g(x,1:1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.259 ns (0.00% GC)
  median time:      25.659 ns (0.00% GC)
  mean time:        30.032 ns (9.68% GC)
  maximum time:     2.290 μs (96.37% GC)
  --------------
  samples:          10000
  evals/sample:     1000

x[2:2]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     23.793 ns (0.00% GC)
  median time:      26.125 ns (0.00% GC)
  mean time:        30.006 ns (9.61% GC)
  maximum time:     2.263 μs (96.15% GC)
  --------------
  samples:          10000
  evals/sample:     1000

g(x,2:2)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.259 ns (0.00% GC)
  median time:      25.659 ns (0.00% GC)
  mean time:        30.052 ns (9.97% GC)
  maximum time:     3.207 μs (96.58% GC)
  --------------
  samples:          10000
  evals/sample:     1000

nalimilan · 2017-06-27T09:48:22Z

Looks good, that's certainly worth a PR. But I would keep codeunit, anyway the compiler is going to inline it.

The only issue is that I understand that in the near future line l = s.len would have to be changed to something else as it is planned that String will loose this attribute.

You can just use sizeof.

Make `getindex` for `String` check if indices are valid.

Closes JuliaLang#22548 fixes a bug with use of prevind in dates/io.jl

tkelman added the strings "Strings!" label Jun 26, 2017

bkamins mentioned this issue Jun 26, 2017

Fixes constructor of SubString from SubString #22511

Merged

bkamins added a commit to bkamins/julia that referenced this issue Jun 27, 2017

Closes JuliaLang#22548

d135378

Make `getindex` for `String` check if indices are valid.

bkamins mentioned this issue Jun 27, 2017

Make getindex for String check if indices are valid #22572

Merged

nalimilan mentioned this issue Sep 7, 2017

WIP: Remove RevString #23612

Closed

vtjnash pushed a commit to bkamins/julia that referenced this issue Sep 19, 2017

Make getindex for String check if all indices are valid

ed2d32f

Closes JuliaLang#22548 fixes a bug with use of prevind in dates/io.jl

KristofferC closed this as completed in 0058a4f Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

checkbounds for strings #22548

checkbounds for strings #22548

bkamins commented Jun 26, 2017

nalimilan commented Jun 26, 2017

Uh oh!

bkamins commented Jun 26, 2017 •

edited

Loading

Uh oh!

nalimilan commented Jun 26, 2017

Uh oh!

bkamins commented Jun 26, 2017 •

edited

Loading

Uh oh!

nalimilan commented Jun 27, 2017

Uh oh!

bkamins commented Jun 27, 2017

Uh oh!

bkamins commented Jun 27, 2017

Uh oh!

nalimilan commented Jun 27, 2017

Uh oh!

Uh oh!

checkbounds for strings #22548

checkbounds for strings #22548

Comments

bkamins commented Jun 26, 2017

nalimilan commented Jun 26, 2017

Uh oh!

bkamins commented Jun 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Jun 26, 2017

Uh oh!

bkamins commented Jun 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Jun 27, 2017

Uh oh!

bkamins commented Jun 27, 2017

Uh oh!

bkamins commented Jun 27, 2017

Uh oh!

nalimilan commented Jun 27, 2017

Uh oh!

bkamins commented Jun 26, 2017 •

edited

Loading

bkamins commented Jun 26, 2017 •

edited

Loading