[julia-dev] Even faster string equality (see #16863)

Discussion:

Scott Jones

2016-06-09 03:55:11 UTC

I have made a branch with an even faster version of == for the String type
(it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of different
length, it calls memcmp directly if the string lengths are equal, and
checks if the result is 0.

https://github.com/ScottPJones/julia/tree/spj/fasteqstr

If somebody could make a PR out of this branch, it would be appreciated.

Thanks in advance,
Scott

Jeffrey Sarnoff

2016-06-09 04:46:46 UTC

Permalink

looks sane

Post by Scott Jones
I have made a branch with an even faster version of == for the String type
(it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of different
length, it calls memcmp directly if the string lengths are equal, and
checks if the result is 0.
https://github.com/ScottPJones/julia/tree/spj/fasteqstr
If somebody could make a PR out of this branch, it would be appreciated.
Thanks in advance,
Scott

Páll Haraldsson

2016-07-13 13:48:01 UTC

Permalink

Post by Jeffrey Sarnoff
looks sane

Are you sure? I'm not sure what lexcmp does (or should do..), but if say
one string has precomposed Unicode letters, and the other doesn't, should
it be possible for them to compare the same? Maybe it's not preferable by
default? Or is it..?

These things can all be tricky; E.g. minor recent upgrades of PostgreSQL
are generally considered safe (and recommended over not upgrading), with a
rare "exception" (just just need to be careful and REINDEX afterwards, as
broken in a former minor update..):

https://www.postgresql.org/docs/9.5/static/release-9-5-2.html

"PostgreSQL 9.5 introduced logic for speeding up comparisons of string data
types by using the standard C library function strxfrm() as a substitute
for strcoll(). It now emerges that most versions of glibc (Linux's
implementation of the C library) have buggy implementations of strxfrm()
that, in some locales, can produce string comparison results that do not
match strcoll(). Until this problem can be better characterized, disable
the optimization in all non-C locales. (C locale is safe since it uses
neither strcoll() nor strxfrm().)

Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text, varchar,
or char columns may now be corrupt if they sort according to an affected
locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users
should REINDEX indexes that might be affected.

It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no evidence of
trouble in English-based locales such as en_US, but some other popular
locales such as de_DE are affected in most glibc versions."

Post by Jeffrey Sarnoff

Post by Scott Jones
I have made a branch with an even faster version of == for the String
type (it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of different
length, it calls memcmp directly if the string lengths are equal, and
checks if the result is 0.
https://github.com/ScottPJones/julia/tree/spj/fasteqstr
If somebody could make a PR out of this branch, it would be appreciated.
Thanks in advance,
Scott

Jeffrey Sarnoff

2016-07-13 13:56:37 UTC

Permalink

PÃ¡ll Haraldsson --
Absolutely. Definitely. <hat tip>. Good catch.

Post by PÃ¡ll Haraldsson

Post by Jeffrey Sarnoff
looks sane

Are you sure? I'm not sure what lexcmp does (or should do..), but if say
one string has precomposed Unicode letters, and the other doesn't, should
it be possible for them to compare the same? Maybe it's not preferable by
default? Or is it..?
These things can all be tricky; E.g. minor recent upgrades of PostgreSQL
are generally considered safe (and recommended over not upgrading), with a
rare "exception" (just just need to be careful and REINDEX afterwards, as
https://www.postgresql.org/docs/9.5/static/release-9-5-2.html
"PostgreSQL 9.5 introduced logic for speeding up comparisons of string
data types by using the standard C library function strxfrm() as a
substitute for strcoll(). It now emerges that most versions of glibc
(Linux's implementation of the C library) have buggy implementations of
strxfrm() that, in some locales, can produce string comparison results
that do not match strcoll(). Until this problem can be better
characterized, disable the optimization in all non-C locales. (C locale
is safe since it uses neither strcoll() nor strxfrm().)
Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text,
varchar, or char columns may now be corrupt if they sort according to an
affected locale and were built or modified under PostgreSQL 9.5.0 or
9.5.1. Users should REINDEX indexes that might be affected.
It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no evidence
of trouble in English-based locales such as en_US, but some other popular
locales such as de_DE are affected in most glibc versions."

Post by Jeffrey Sarnoff

Post by Scott Jones
I have made a branch with an even faster version of == for the String
type (it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of different
length, it calls memcmp directly if the string lengths are equal, and
checks if the result is 0.
https://github.com/ScottPJones/julia/tree/spj/fasteqstr
If somebody could make a PR out of this branch, it would be appreciated.
Thanks in advance,
Scott

Páll Haraldsson

2016-07-13 14:36:52 UTC

Permalink

Páll Haraldsson --
Absolutely. Definitely. <hat tip>. Good catch.

This is already merged.. so needs to be reverted in case I'm right on
the default behavior..

Another thing, but might be to late to change the default bahavior of
sorts (comparisons):

https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

I'm partial to "natural sort" being the default.. Then we could have
non-default ASCIIbetical or Unicode disregarding precomposed etc.
whatever is fastest.
--
Palli.

*
*
looks sane
Are you sure? I'm not sure what lexcmp does (or should do..), but if
say one string has precomposed Unicode letters, and the other
doesn't, should it be possible for them to compare the same? Maybe
it's not preferable by default? Or is it..?
These things can all be tricky; E.g. minor recent upgrades of
PostgreSQL are generally considered safe (and recommended over not
upgrading), with a rare "exception" (just just need to be careful
https://www.postgresql.org/docs/9.5/static/release-9-5-2.html
<https://www.postgresql.org/docs/9.5/static/release-9-5-2.html>
"PostgreSQL 9.5 introduced logic for speeding up comparisons of
string data types by using the standard C library function
|strxfrm()| as a substitute for |strcoll()|. It now emerges that
most versions of glibc (Linux's implementation of the C library)
have buggy implementations of |strxfrm()| that, in some locales, can
produce string comparison results that do not match |strcoll()|.
Until this problem can be better characterized, disable the
optimization in all non-C locales. (C locale is safe since it uses
neither |strcoll()| nor |strxfrm()|.)
Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text,
varchar, or char columns may now be corrupt if they sort according
to an affected locale and were built or modified under PostgreSQL
9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.
It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no
evidence of trouble in English-based locales such as en_US, but some
other popular locales such as de_DE are affected in most glibc
versions."
I have made a branch with an even faster version of == for
the String type (it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of
different length, it calls memcmp directly if the string
lengths are equal, and checks if the result is 0.
https://github.com/ScottPJones/julia/tree/spj/fasteqstr
<https://github.com/ScottPJones/julia/tree/spj/fasteqstr>
If somebody could make a PR out of this branch, it would be
appreciated.
Thanks in advance,
Scott

Stefan Karpinski

2016-07-13 14:56:00 UTC

Permalink

We're moving towards the built-in String type being essentially just a
container for bytes with a UTF-8-like interpretation. Thus if two String
objects don't have the same underlying data, they are not equal. If you
want fancier Unicode-aware behavior like UTF-8 validation or normalized
comparison, then you'll need to use a package that provides a type like
EncodedString{UTF8} (or something). Care needs to be taken to make sure
that == is transitive, but that's a problem for the to-be-created external
Unicode strings package.

Post by Jeffrey Sarnoff
PÃ¡ll Haraldsson --
Absolutely. Definitely. <hat tip>. Good catch.

This is already merged.. so needs to be reverted in case I'm right on the
default behavior..
Another thing, but might be to late to change the default bahavior of
https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
I'm partial to "natural sort" being the default.. Then we could have
non-default ASCIIbetical or Unicode disregarding precomposed etc. whatever
is fastest.
--
Palli.
*

Post by Jeffrey Sarnoff
*
looks sane
Are you sure? I'm not sure what lexcmp does (or should do..), but if
say one string has precomposed Unicode letters, and the other
doesn't, should it be possible for them to compare the same? Maybe
it's not preferable by default? Or is it..?
These things can all be tricky; E.g. minor recent upgrades of
PostgreSQL are generally considered safe (and recommended over not
upgrading), with a rare "exception" (just just need to be careful
https://www.postgresql.org/docs/9.5/static/release-9-5-2.html
<https://www.postgresql.org/docs/9.5/static/release-9-5-2.html>
"PostgreSQL 9.5 introduced logic for speeding up comparisons of
string data types by using the standard C library function
|strxfrm()| as a substitute for |strcoll()|. It now emerges that
most versions of glibc (Linux's implementation of the C library)
have buggy implementations of |strxfrm()| that, in some locales, can
produce string comparison results that do not match |strcoll()|.
Until this problem can be better characterized, disable the
optimization in all non-C locales. (C locale is safe since it uses
neither |strcoll()| nor |strxfrm()|.)
Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text,
varchar, or char columns may now be corrupt if they sort according
to an affected locale and were built or modified under PostgreSQL
9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.
It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no
evidence of trouble in English-based locales such as en_US, but some
other popular locales such as de_DE are affected in most glibc
versions."
I have made a branch with an even faster version of == for
the String type (it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of
different length, it calls memcmp directly if the string
lengths are equal, and checks if the result is 0.
https://github.com/ScottPJones/julia/tree/spj/fasteqstr
<https://github.com/ScottPJones/julia/tree/spj/fasteqstr>
If somebody could make a PR out of this branch, it would be
appreciated.
Thanks in advance,
Scott

Páll Haraldsson

2016-07-13 17:11:34 UTC

Permalink

Post by Stefan Karpinski
We're moving towards the built-in String type being essentially just a
container for bytes with a UTF-8-like interpretation. Thus if two String
objects don't have the same underlying data, they are not equal. If you
want fancier Unicode-aware behavior

Ok, we can define it that way, then the optimization that is already merged
is ok.

This still does not imply anything for ordering. That must be defined.
ASCIIbetical is sometimes used (but is bad), similar for Unicode would also
be a disaster. Since I want my alphabet, e.g. a Ã¡ b c.., then it's not a
leap to correclty order Ã¡ weather it's precomposed or not.

Post by Stefan Karpinski
like UTF-8 validation or normalized comparison, then you'll need to use a
package that provides a type like EncodedString{UTF8} (or something). Care
needs to be taken to make sure that == is transitive, but that's a problem
for the to-be-created external Unicode strings package.

Is it at least easy to include a package with new behavior that substitutes
the default string type? If not with the same name, at least all literal
will get the new type?

If you must define some order, maybe it's better that equal is not dumb..

--
Palli.

Steven G. Johnson

2016-07-13 18:46:43 UTC

Permalink

Rather than defining a whole new string type, you can also just normalize
your strings (with the normalize_string function) before comparing them, in
the rare cases where you want equality that ignores e.g. differences in
encoding combining characters.

You could even overload a new infix operator (e.g. â) that performs
normalized equality tests of whatever flavor you like.

There's no single correct answer here, because it depends on what
differences you want to ignore. Do you want to ignore differences in
codepoint composition that lead to logically identical strings? Use NFC
normalization. Do you also want to ignore differences in "confusable"
characters like Âµ (micro) vs ÎŒ (greek mu)? Use NFKC normalization. Do
you want to ignore all diacritical marks, e.g. treat a and Ã¡ the same? Use
the stripmark=true option. Do you want to ignore case? Use the
casefold=true option.

Julia's default of == meaning "same codepoints" is reasonable and fast
(whereas anything requiring normalization will be much slower).

Steven G. Johnson

2016-07-13 18:48:59 UTC

Permalink

Similarly, you can use the "by" keyword option of the sort function if you
want to sort strings according to one of their normalized forms.

Scott Jones

2016-07-13 22:23:38 UTC

Permalink

Post by Steven G. Johnson
Rather than defining a whole new string type, you can also just normalize
your strings (with the normalize_string function) before comparing them, in
the rare cases where you want equality that ignores e.g. differences in
encoding combining characters.
You could even overload a new infix operator (e.g. â) that performs
normalized equality tests of whatever flavor you like.
There's no single correct answer here, because it depends on what
differences you want to ignore. Do you want to ignore differences in
codepoint composition that lead to logically identical strings? Use NFC
normalization. Do you also want to ignore differences in "confusable"
characters like Âµ (micro) vs ÎŒ (greek mu)? Use NFKC normalization. Do
you want to ignore all diacritical marks, e.g. treat a and Ã¡ the same? Use
the stripmark=true option. Do you want to ignore case? Use the
casefold=true option.
Julia's default of == meaning "same codepoints" is reasonable and fast
(whereas anything requiring normalization will be much slower).

Unfortunately, that is simply not true (and as far as I've seen has never
been) for UTF8String (and now String).

Since Julia never requires that strings be valid, there are many different
ways of encoding a single code point using either overly long encodings
(such as 0xc0 0x80, which is an invalid (overly long) encoding of 0x00), or
encoding non-BMP characters with 6 bytes, i.e. as two 3 byte surrogate
characters, which also cause strings to not sort correctly), and those will
give bad results. Fixing those issues would make the performance be far
worse than it already is.
At least, with convert and utf8 in v0.4.x (because of my PRs), you could
eliminate those problems and ensure that you had valid UTF-8 encoded
strings before comparisons, but now, in v0.5.x, you'd need to add the
LegacyStrings package to gain back that functionality.
It would be possible, although rather inefficient, to do something like
==(x::String,y::String) = convert(Vector{Char},x) ==
convert(Vector{Char},y), and then you also have to deal with all sorts of
hashing problems as well.
Just to be very clear here - I am *not* talking about composed Unicode
characters - which is another matter entirely, I'm talking about simply
trying to compare two Unicode strings accepted by String.

This is one of the reasons I felt that the string changes that have gone
into v0.5 were a mistake. (I'm not saying either that the way strings are
in v0.3 were good either, with 4 different inconsistent basic string types,
and my fixes in v0.4 didn't change that mess).

Steven G. Johnson

2016-07-17 19:07:59 UTC

Permalink

Post by Steven G. Johnson
Julia's default of == meaning "same codepoints" is reasonable and fast

Post by Steven G. Johnson
(whereas anything requiring normalization will be much slower).

Unfortunately, that is simply not true (and as far as I've seen has never
been) for UTF8String (and now String).

It's true for valid strings, which is what most applications are concerned
with. For invalid encodings, it's true that we compare encodings. I don't
see a problem with that. If the user just stuffed arbitrary binary data
into a string, comparing encodings (valid or not) seems to be the most
reasonable choice of == without knowing application-specific information.

(For example, if your application knows that you are importing "modified
UTF-8" from Java code that has (invalid) surrogate pairs, you can do an
application-specific preprocessing step to convert these into proper UTF-8
if you want to use the String type. But it would be inappropriate for
Base.String to do this preprocessing for you, because Base.String doesn't
know the origin of your invalid encoding.)

Scott Jones

2016-07-17 23:37:28 UTC

Permalink

Then what you are saying is that == for String type means simply "same code
units", which is not at all the same thing as "same codepoints".

This isn't the case of just arbitrary binary data, these are UTF-8
encodings that are going to be frequently found, whether coming from Java
(or languages using the JVM) or many databases.

I don't think it is inappropriate for `convert(String, x::Vector{UInt8})`
to make sure that the string is canonically encoded, it can save a lot of
grief when people UTF-8 data from a variety of sources.

Post by Steven G. Johnson

Post by Steven G. Johnson
Julia's default of == meaning "same codepoints" is reasonable and fast

Post by Steven G. Johnson
(whereas anything requiring normalization will be much slower).

Unfortunately, that is simply not true (and as far as I've seen has never
been) for UTF8String (and now String).

It's true for valid strings, which is what most applications are concerned
with. For invalid encodings, it's true that we compare encodings. I don't
see a problem with that. If the user just stuffed arbitrary binary data
into a string, comparing encodings (valid or not) seems to be the most
reasonable choice of == without knowing application-specific information.
(For example, if your application knows that you are importing "modified
UTF-8" from Java code that has (invalid) surrogate pairs, you can do an
application-specific preprocessing step to convert these into proper UTF-8
if you want to use the String type. But it would be inappropriate for
Base.String to do this preprocessing for you, because Base.String doesn't
know the origin of your invalid encoding.)

Scott Jones

2016-07-13 20:26:28 UTC

Permalink

Julia, AFAIK, has never compared UTF8String (or now String) in any way except for just checking if the encoded bytes are the same.
Before my fixes to handle different technically invalid (but common) UTF-8 encodings, you could easily get two strings like â\xc0\x80â and â\0â which contained the same Unicode code point, but with a different byte encoding (as well as have many completely invalid sequences that were simply not detected).

I think there is a problem now, in that the String constructor does not seem to be making sure that all strings produced are actually valid UTF-8, and methods that deal with String assume that a String is valid UTF-8.
In v0.4.x, utf8 (and convert) ensured that each character in the resulting UTF8String was encoded correctly, but now utf8 has been deprecated, and convert(String doesnât create a canonically encoded UTF-8 string.

This still does not imply anything for ordering. That must be defined. ASCIIbetical is sometimes used (but is bad), similar for Unicode would also be a disaster. Since I want my alphabet, e.g. a Ã¡ b c.., then it's not a leap to correclty order Ã¡ weather it's precomposed or not

Ordering like that is language dependent, and really needs something very complex like the ICU library to attempt to do it right (I wrote code to handle those issues before ICU was available - itâs not at all a trivial task).

like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.
Is it at least easy to include a package with new behavior that substitutes the default string type? If not with the same name, at least all literal will get the new type?

That is not currently possible.

If you must define some order, maybe it's better that equal is not dumb..
--
Palli.

Scott Jones

2016-06-09 21:23:23 UTC

Permalink

Just wanted to thank Stefan (again!) for getting this pulled in as
https://github.com/JuliaLang/julia/pull/16855 !