Discussion:
[PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
(too old to reply)
Carlos O'Donell
2018-07-19 19:43:37 UTC
Permalink
In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
the collation data to harmonize with the new version of ISO 14651
which is derived from Unicode 9.0.0. This collation update brought
with it some changes to locales which were not desirable by some
users, in particular it altered the meaning of the
locale-dependent-range regular expression, namely [a-z] and [A-Z], and
for en_US it caused uppercase letters to be matched by [a-z] for the
first time. The matching of uppercase letters by [a-z] is something
which is already known to users of other locales which have this
property, but this change could cause significant problems to en_US
and other similar locales that had never had this change before.
Whether this behaviour is desirable or not is contentious and GNU Awk
has this to say on the topic:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
While the POSIX standard also has this further to say: "RE Bracket
Expression":
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
"The current standard leaves unspecified the behavior of a range
expression outside the POSIX locale. ... As noted above, efforts were
made to resolve the differences, but no solution has been found that
would be specific enough to allow for portable software while not
invalidating existing implementations."
In glibc we implement the requirement of ISO POSIX-2:1993 and use
collation element order (CEO) to construct the range expression, the
API internally is __collseq_table_lookup(). The fact that we use CEO
and also have 4-level weights on each collation rule means that we can
in practice reorder the collation rules in iso14651_t1_common (the new
data) to provide consistent range expression resolution *and* the
weights should maintain the expected total order. Therefore this
patch does three things:

* Reorder the collation rules for the LATIN script in
iso14651_t1_common to deinterlace uppercase and lowercase letters in
the collation element orders.

* Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
strcoll* and strxfrm* and ensures the ISO 14651 collation remains.

* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.

The reordering of the ISO 14651 data is done in an entirely mechanical
fashion using the following program attached to the bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28

It is up for discussion if the iso14651_t1_common data should be
refined further to have 3 very tight collation element ranges that
include only a-z, A-Z, and 0-9, which would implement the solution
sought after in:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12

No regressions on x86_64.
Verified that removal of the iso14651_t1_common change causes tst-fnmatch
to regress with:
422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
...
425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
---
ChangeLog | 11 +
localedata/Makefile | 1 +
localedata/en_US.UTF-8.in | 2159 +++++++++++++++++++++++++++++++++
localedata/locales/iso14651_t1_common | 1928 ++++++++++++++---------------
posix/tst-fnmatch.input | 125 +-
posix/tst-regexloc.c | 8 +-
6 files changed, 3224 insertions(+), 1008 deletions(-)
create mode 100644 localedata/en_US.UTF-8.in

I'm suggesting this change immediately for 2.28 to avoid further
problems with users expectations and sorting with [a-z] and [A-Z] until
a clearer consensus can be reached for a final solution.

File attached as .tar.gz to get past spam detectors. There is a lot
of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN
set that can be sorted with the existing test case infrastructure).
--
Cheers,
Carlos.
Florian Weimer
2018-07-19 20:39:26 UTC
Permalink
Post by Carlos O'Donell
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.
[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful. It's an
improvement, and it may be good enough for glibc 2.28, but I would
rather see us implement the rational ranges interpretation.

Thanks,
Florian
Carlos O'Donell
2018-07-20 18:49:07 UTC
Permalink
Post by Carlos O'Donell
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.
[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.
Sorry, I don't follow, it absolutely matches ASCII z.

We deinterlace the collation element ordering (not sequence) to get
the right range expression resolution.

See the added fnmatch tests:

+en_US.UTF-8 "a" "[a-z]" 0
+en_US.UTF-8 "z" "[a-z]" 0
+en_US.UTF-8 "A" "[a-z]" NOMATCH
+en_US.UTF-8 "Z" "[a-z]" NOMATCH
+en_US.UTF-8 "a" "[A-Z]" NOMATCH
+en_US.UTF-8 "z" "[A-Z]" NOMATCH
+en_US.UTF-8 "A" "[A-Z]" 0
+en_US.UTF-8 "Z" "[A-Z]" 0
+en_US.UTF-8 "0" "[0-9]" 0
+en_US.UTF-8 "9" "[0-9]" 0

[a-z] matches a-z (including z), *and* all the lowercase inbetween,
and so behaves like :lower: effectively.

[A-Z] matches A-Z (including Z), *and* all the uppercase inbetwee,
and so behaves like :upper: effectively.

I left in all the matches for the accented characters because it was
the most conservative thing to do for now.

I could be persuaded otherwise I think, just reading the old history
and seeing the new reports seems to indicate we should back down to
behaving like C/POSIX in these cases.
It's an improvement, and it may be good enough for glibc 2.28, but I would
rather see us implement the rational ranges interpretation.
That requires all ranges behave rationally?

We could fix a-z, A-Z, and 0-9 easily.

Patch attached.

It has no effect on collation sequence, but it will break scripts
that expect the new-style behaviour, and we knew that, but it
certainly aligns us with the pre-POSIX requirement and the rest of
the GNU tools implementing rational ranges, which is a much better
reason.
--
Cheers,
Carlos.
Rich Felker
2018-07-20 19:02:39 UTC
Permalink
Post by Carlos O'Donell
Post by Florian Weimer
Post by Carlos O'Donell
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.
[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.
Sorry, I don't follow, it absolutely matches ASCII z.
That's not an ASCII z. It's some plane-1 mathematical z. :-)

Rich
Florian Weimer
2018-07-20 19:19:28 UTC
Permalink
Post by Carlos O'Donell
Post by Florian Weimer
Post by Carlos O'Donell
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.
[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.
Sorry, I don't follow, it absolutely matches ASCII z.
The z I wrote above is one of the non-BMP math characters.
Post by Carlos O'Donell
We deinterlace the collation element ordering (not sequence) to get
the right range expression resolution.
+en_US.UTF-8 "a" "[a-z]" 0
+en_US.UTF-8 "z" "[a-z]" 0
+en_US.UTF-8 "A" "[a-z]" NOMATCH
+en_US.UTF-8 "Z" "[a-z]" NOMATCH
+en_US.UTF-8 "a" "[A-Z]" NOMATCH
+en_US.UTF-8 "z" "[A-Z]" NOMATCH
+en_US.UTF-8 "A" "[A-Z]" 0
+en_US.UTF-8 "Z" "[A-Z]" 0
+en_US.UTF-8 "0" "[0-9]" 0
+en_US.UTF-8 "9" "[0-9]" 0
[a-z] matches a-z (including z), *and* all the lowercase inbetween,
and so behaves like :lower: effectively.
There are characters equivalent to ASCII z (like the z above), but which
sort after z, so they are not matched. This is one reason why I think
this is a bad idea: it looks like [:lower:], but it's not. Same for
[0-9], I assume.
Post by Carlos O'Donell
Post by Florian Weimer
It's an improvement, and it may be good enough for glibc 2.28, but I would
rather see us implement the rational ranges interpretation.
That requires all ranges behave rationally?
We could fix a-z, A-Z, and 0-9 easily.
Patch attached.
(NB: Patch is relative to the previous patch.)

My enumeration tester likes it much more. 8-)

actual: "abcdefghijklmnopqrstuvwxyz"
actual: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
actual: "0123456789"

That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1.
However, I still get this:

tst-regex-classes.script:85:0: result character set difference in locale
tr_TR.ISO-8859-9
enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
^
expected: "abcdefghijklmnopqrstuvwxyz"
actual: "abcdefghjklmnopqrstuvwxyz"
tst-regex-classes.script:86:0: result character set difference in locale
tr_TR.ISO-8859-9
enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
^
expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
actual: "ABCDEFGHJKLMNOPQRSTUVWXYZ"
error: 2 test failures

Can you fix this with data-only changes, too?

posix/bug-regex17 regresses as well in the test for bug 9697, but I can
incorporate that into my enumeration tester. I don't think the bug is
actually regressing, it's just that the test objective is not expressed
properly in it.

posix/tst-rxspencer fails as well, presumably due to this:

UTF-8 aA FAIL regcomp failed: Invalid range end
UTF-8 aAcC FAIL regcomp failed: Invalid range end

I think this happens because the test blindly replaces ASCII characters
with non-ASCII characters, which causes issues if they are not ordered
as expected.

Thanks,
Florian
Carlos O'Donell
2018-07-20 21:56:22 UTC
Permalink
Post by Florian Weimer
Post by Carlos O'Donell
Post by Carlos O'Donell
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.
[a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.
Sorry, I don't follow, it absolutely matches ASCII z.
The z I wrote above is one of the non-BMP math characters.
Thanks :-}

It was a conservative solution.
Post by Florian Weimer
Post by Carlos O'Donell
We deinterlace the collation element ordering (not sequence) to get
the right range expression resolution.
+en_US.UTF-8     "a"                    "[a-z]"                0
+en_US.UTF-8     "z"                    "[a-z]"                0
+en_US.UTF-8     "A"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "Z"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "a"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "z"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "A"                    "[A-Z]"                0
+en_US.UTF-8     "Z"                    "[A-Z]"                0
+en_US.UTF-8     "0"                    "[0-9]"                0
+en_US.UTF-8     "9"                    "[0-9]"                0
[a-z] matches a-z (including z), *and* all the lowercase inbetween,
and so behaves like :lower: effectively.
There are characters equivalent to ASCII z (like the z above), but
which sort after z, so they are not matched. This is one reason why
I think this is a bad idea: it looks like [:lower:], but it's not.
Same for [0-9], I assume.
Again, conservatively, this is how it worked before, and now works again
the same, but retains the improvement of ISO 14651 data being added.
Post by Florian Weimer
Post by Carlos O'Donell
It's an improvement, and it may be good enough for glibc 2.28, but I would
rather see us implement the rational ranges interpretation.
That requires all ranges behave rationally?
We could fix a-z, A-Z, and 0-9 easily.
Patch attached.
(NB: Patch is relative to the previous patch.)
My enumeration tester likes it much more. 8-)
It was designed exactly for your enumerator ;-)
Post by Florian Weimer
  actual:   "abcdefghijklmnopqrstuvwxyz"
  actual:   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  actual:   "0123456789"
tst-regex-classes.script:85:0: result character set difference in locale tr_TR.ISO-8859-9
enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
^
  expected: "abcdefghijklmnopqrstuvwxyz"
  actual:   "abcdefghjklmnopqrstuvwxyz"
tst-regex-classes.script:86:0: result character set difference in locale tr_TR.ISO-8859-9
enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
^
  expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  actual:   "ABCDEFGHJKLMNOPQRSTUVWXYZ"
error: 2 test failures
Can you fix this with data-only changes, too?
Yes, I need to duplicate the rational range for A-Z in tr_TR and
remove 'i' since it's just fine the way it is, the existing

New patch attached with additional tests in tst-fnmatch.input to
test tr_TR.UTF-8, and ISO-8859-9.

Noticed equivalence class issues and filed a bug and added an XFAIL-ish
test case in test-fnmatch.input:
https://sourceware.org/bugzilla/show_bug.cgi?id=23437
Post by Florian Weimer
posix/bug-regex17 regresses as well in the test for bug 9697, but I
can incorporate that into my enumeration tester. I don't think the
bug is actually regressing, it's just that the test objective is not
expressed properly in it.
Fixed.
Post by Florian Weimer
UTF-8 aA FAIL regcomp failed: Invalid range end
UTF-8 aAcC FAIL regcomp failed: Invalid range end
I think this happens because the test blindly replaces ASCII
characters with non-ASCII characters, which causes issues if they are
not ordered as expected.
Fixed.

v2
- Fixed tr_TR by duplicating A-Z rational range.
- Fixed tst-rxspender.
- Fixed bug-regex17.

Tell me how the new version does.
--
Cheers,
Carlos.
Florian Weimer
2018-07-23 15:10:54 UTC
Permalink
Post by Carlos O'Donell
v2
- Fixed tr_TR by duplicating A-Z rational range.
- Fixed tst-rxspender.
- Fixed bug-regex17.
Tell me how the new version does.
My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
support, too, and initial results look good as well.

Thanks,
Florian
Carlos O'Donell
2018-07-23 18:09:31 UTC
Permalink
Post by Florian Weimer
Post by Carlos O'Donell
v2
- Fixed tr_TR by duplicating A-Z rational range.
- Fixed tst-rxspender.
- Fixed bug-regex17.
Tell me how the new version does.
My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
support, too, and initial results look good as well.
OK, so we have the capability to deploy rational ranges.

Florian,

Should we do so in 2.28? Avoiding all possible problems in the future
and making the ranges portable, rational, and safe from a security
perspective?

Rafal,

As localedata maintainer what is your opinion of changing the meaning
of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
which mean exactly the latin character sequences you would expect
e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
[A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?

Mike,

Same question to you.

For historical context in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

For context from POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
(see the section on "RE Bracket Expressions").

Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges
rational for all locales, and would no longer include mixed case, or accents.

I'd like to year affirmatives from the localedata maintainers on this issue.

Cheers,
Carlos.
Rafal Luzynski
2018-07-24 20:45:15 UTC
Permalink
Post by Carlos O'Donell
[...]
Rafal,
As localedata maintainer what is your opinion of changing the meaning
of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
which mean exactly the latin character sequences you would expect
e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
[A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
Having discussed this off-list my answer is: I'm in favor of implementing
rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
code-point ranges. But I understand that this is possible only in 2.29.
Therefore for 2.28 I support this data-based solution.

Regards,

Rafal
Carlos O'Donell
2018-07-24 20:53:37 UTC
Permalink
Post by Rafal Luzynski
Post by Carlos O'Donell
[...]
Rafal,
As localedata maintainer what is your opinion of changing the meaning
of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
which mean exactly the latin character sequences you would expect
e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
[A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
Having discussed this off-list my answer is: I'm in favor of implementing
rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
code-point ranges. But I understand that this is possible only in 2.29.
Therefore for 2.28 I support this data-based solution.
From the perspective of the user of the library and the locales the
rational ranges we implement will look as-if they were code point ranges
for the ranges in question e.g. a-z, A-Z, 0-9 and their subranges.

For 2.28 we will implement rational ranges for [a-z], [A-Z], and [0-9],
and all of their subsets via a data-only solution. Just wanted to make
it clear that all subsets will be treated as rational ranges.

It is only for other subsets like [!-~] (ASCII range) where we will not
have a rational range until we switch to making ranges operate on code
points. That will be a 2.29 optimization.

OK, I will prepare a patch to fix this.

Cheers,
Carlos.
Carlos O'Donell
2018-07-24 20:59:30 UTC
Permalink
Post by Rafal Luzynski
Post by Carlos O'Donell
[...]
Rafal,
As localedata maintainer what is your opinion of changing the meaning
of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
which mean exactly the latin character sequences you would expect
e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
[A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
Having discussed this off-list my answer is: I'm in favor of implementing
rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
code-point ranges. But I understand that this is possible only in 2.29.
Therefore for 2.28 I support this data-based solution.
I'll put together a final patch ASAP that provides:

* Deinterlace upper/lower
* Group a-z, A-Z, 0-9,
* NEWS entry for rational ranges.

Note: manual/stdio.texi also makes the mistake of saying [a-z] is lowercase
characters, so this will fix the manual bug with no change :-)

Cheers,
Carlos.
Mike FABIAN
2018-07-25 15:43:55 UTC
Permalink
Post by Carlos O'Donell
Post by Florian Weimer
Post by Carlos O'Donell
v2
- Fixed tr_TR by duplicating A-Z rational range.
- Fixed tst-rxspender.
- Fixed bug-regex17.
Tell me how the new version does.
My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
support, too, and initial results look good as well.
OK, so we have the capability to deploy rational ranges.
Florian,
Should we do so in 2.28? Avoiding all possible problems in the future
and making the ranges portable, rational, and safe from a security
perspective?
Rafal,
As localedata maintainer what is your opinion of changing the meaning
of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
which mean exactly the latin character sequences you would expect
e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
[A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
Mike,
Same question to you.
I agree that rational ranges are much more useful.

I cannot imagine any use case for [a-z] matching aAbB...z and not Z.

One never knows what [a-z] would match if it uses the locale sort order,
it is just too confusing.

In the long run, I think implementing ranges by code points would be
the best solution and make updates of the iso14651_t1_common file easier
because we need to make less changes to the upstream version of that
file then.

But for 2.28 this cannot be done. Therefore, I think the solution
by Carlos is very good.
Post by Carlos O'Donell
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
(see the section on "RE Bracket Expressions").
Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges
rational for all locales, and would no longer include mixed case, or accents.
I'd like to year affirmatives from the localedata maintainers on this issue.
Cheers,
Carlos.
--
Mike FABIAN <***@redhat.com>
睡眠不足はいい仕事の敵だ。
Carlos O'Donell
2018-07-25 15:54:37 UTC
Permalink
Post by Florian Weimer
Post by Carlos O'Donell
v2
- Fixed tr_TR by duplicating A-Z rational range.
- Fixed tst-rxspender.
- Fixed bug-regex17.
Tell me how the new version does.
My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
support, too, and initial results look good as well.
OK, here is v3.

~~~ NEWS ~~
* The GNU C Library now uses rational ranges for regular expression
matching of ranges that are within a-z, A-Z, and 0-9 for all
locales. This means that the range [a-c] will no longer match
accented letter a's and will only match exactly a, b, and c. Likewise
[0-9] will only include the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and
no other characters. Rational ranges have been implemented by
several other GNU projects to provide straight forward rules for
regular expression ranges and to make them portable across locales.
The current rational ranges are implemented using collation element
ordering, which may yield unexpected results if the range includes
accented characters e.g. [a-ñ], since such a range will include a-z
since ñ comes after the rational range in collation element order.
In the future the library may implement full rational ranges covering
all characters by using Unicode code point ordering which will make
the sequences faster to match and more portable.
~~~

We have approval from Mike and Rafal, the two localedata subsystem
maintainers.

This solution matches what you and Rich Felker both thinks is the
correct solution.

So for 2.28 we would use rational ranges for a-z, A-Z, and 0-9, until
we can implement code point ranges.

v3
- Merged lowercase/uppercase deinterlacing.
- Added NEWS entry.

Please run this through your checker, and ACK this for 2.28 and I'll
commit.

Attaching it as swbz23393v3.tar.gz to avoid spam rejection.

Cheers,
Carlos.
Florian Weimer
2018-07-25 20:18:50 UTC
Permalink
Post by Carlos O'Donell
Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
Quick comment. The middle line here adds trailing whitespace:

- { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+
+ The U+02DA RING ABOVE is chosen because it's not in [s-㏜]. */

Florian
Carlos O'Donell
2018-07-25 20:25:24 UTC
Permalink
Post by Carlos O'Donell
Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
-  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+
+     The U+02DA RING ABOVE is chosen because it's not in [s-㏜].  */
Thanks. I'll fix this with v4.

I had to fix the following locales:

modified: localedata/locales/ar_SA
modified: localedata/locales/km_KH
modified: localedata/locales/lo_LA
modified: localedata/locales/or_IN
modified: localedata/locales/sl_SI
modified: localedata/locales/th_TH

They all re-arranged ASCII character collation element ordering like tr_TR,
and so they needed manual fixing.

Could you please add these locales to your tester?

c.
Florian Weimer
2018-07-25 20:31:43 UTC
Permalink
Post by Carlos O'Donell
Post by Carlos O'Donell
Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
-  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+
+     The U+02DA RING ABOVE is chosen because it's not in [s-㏜].  */
Thanks. I'll fix this with v4.
I have verified that localedata/locales/iso14651_t1_common is just a
reordering (except for the new comments).

localedata/locales/tr_TR is more complicated, but looks like an
order-only change for me too.
Post by Carlos O'Donell
modified: localedata/locales/ar_SA
modified: localedata/locales/km_KH
modified: localedata/locales/lo_LA
modified: localedata/locales/or_IN
modified: localedata/locales/sl_SI
modified: localedata/locales/th_TH
Do you have the actual locale names handy? localedata/SUPPORTED
contains charsets, but I'm not sure if the translation to locale names
is completely regular.
Post by Carlos O'Donell
They all re-arranged ASCII character collation element ordering like tr_TR,
and so they needed manual fixing.
Could you please add these locales to your tester?
I will try. I already have an xtests part, and these probably need to
go there as well.

Thanks,
Florian
Carlos O'Donell
2018-07-25 20:57:11 UTC
Permalink
Post by Carlos O'Donell
Post by Carlos O'Donell
Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
-  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+
+     The U+02DA RING ABOVE is chosen because it's not in [-㏜].  */
Thanks. I'll fix this with v4.
I have verified that localedata/locales/iso14651_t1_common is just a reordering (except for the new comments).
localedata/locales/tr_TR is more complicated, but looks like an order-only change for me too.
Post by Carlos O'Donell
    modified:   localedata/locales/ar_SA
    modified:   localedata/locales/km_KH
    modified:   localedata/locales/lo_LA
    modified:   localedata/locales/or_IN
    modified:   localedata/locales/sl_SI
    modified:   localedata/locales/th_TH
Do you have the actual locale names handy?  localedata/SUPPORTED contains charsets, but I'm not sure if the translation to locale names is completely regular.
It is completely regular. In that ar_SA => ar_SA.UTF-8. And so forth.
Post by Carlos O'Donell
They all re-arranged ASCII character collation element ordering like tr_TR,
and so they needed manual fixing.
Could you please add these locales to your tester?
I will try.  I already have an xtests part, and these probably need to go there as well.
v4
- Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
- Added range checking for a-z, A-Z for all supported UTF-8 locales.

All of my testers are clean.

So the question is now:

Do we commit to rational ranges for a-z, A-Z, 0-9 ... for 2.28.

or

Do we just do the deinterlacing of iso14651_t1_common to fix en_US.UTF-8?

Cheers,
Carlos.
Carlos O'Donell
2018-07-26 02:34:25 UTC
Permalink
Post by Carlos O'Donell
v4
- Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
- Added range checking for a-z, A-Z for all supported UTF-8 locales.
All of my testers are clean.
Attaching v4 on top of the current master.

This fixes all the locales.

All locales, even with tailoring have rational range support now.

If this passes your tests tomorrow I'm OK to put this into 2.28.

Cheers,
Carlos.
Florian Weimer
2018-07-26 14:50:57 UTC
Permalink
Post by Carlos O'Donell
Post by Carlos O'Donell
v4
- Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
- Added range checking for a-z, A-Z for all supported UTF-8 locales.
All of my testers are clean.
Attaching v4 on top of the current master.
This fixes all the locales.
I wrote another enumeration tester, this time covering all locales. It
found these issues:

az_AZ: U+000069 fails to match /[a-z]/
az_AZ: U+000049 fails to match /[A-Z]/
az_AZ.utf8: U+000069 fails to match /[a-z]/
az_AZ.utf8: U+000049 fails to match /[A-Z]/
crh_UA: U+000069 fails to match /[a-z]/
crh_UA: U+000049 fails to match /[A-Z]/
crh_UA.utf8: U+000069 fails to match /[a-z]/
crh_UA.utf8: U+000049 fails to match /[A-Z]/
ku_TR: U+000069 fails to match /[a-z]/
ku_TR: U+000049 fails to match /[A-Z]/
ku_TR.iso88599: U+000069 fails to match /[a-z]/
ku_TR.iso88599: U+000049 fails to match /[A-Z]/
ku_TR.utf8: U+000069 fails to match /[a-z]/
ku_TR.utf8: U+000049 fails to match /[A-Z]/
lv_LV: U+000079 fails to match /[a-z]/
lv_LV: U+000059 fails to match /[A-Z]/
lv_LV.iso885913: U+000079 fails to match /[a-z]/
lv_LV.iso885913: U+000059 fails to match /[A-Z]/
lv_LV.utf8: U+000079 fails to match /[a-z]/
lv_LV.utf8: U+000059 fails to match /[A-Z]/
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
slovene: U+00006A fails to match /[a-z]/
slovene: U+00006B fails to match /[a-z]/
slovene: U+00006C fails to match /[a-z]/
slovene: U+00006D fails to match /[a-z]/
slovene: U+00006E fails to match /[a-z]/
slovene: U+00006F fails to match /[a-z]/
slovenian: U+00006A fails to match /[a-z]/
slovenian: U+00006B fails to match /[a-z]/
slovenian: U+00006C fails to match /[a-z]/
slovenian: U+00006D fails to match /[a-z]/
slovenian: U+00006E fails to match /[a-z]/
slovenian: U+00006F fails to match /[a-z]/
sl_SI: U+00006A fails to match /[a-z]/
sl_SI: U+00006B fails to match /[a-z]/
sl_SI: U+00006C fails to match /[a-z]/
sl_SI: U+00006D fails to match /[a-z]/
sl_SI: U+00006E fails to match /[a-z]/
sl_SI: U+00006F fails to match /[a-z]/
sl_SI.iso88592: U+00006A fails to match /[a-z]/
sl_SI.iso88592: U+00006B fails to match /[a-z]/
sl_SI.iso88592: U+00006C fails to match /[a-z]/
sl_SI.iso88592: U+00006D fails to match /[a-z]/
sl_SI.iso88592: U+00006E fails to match /[a-z]/
sl_SI.iso88592: U+00006F fails to match /[a-z]/
sl_SI.utf8: U+00006A fails to match /[a-z]/
sl_SI.utf8: U+00006B fails to match /[a-z]/
sl_SI.utf8: U+00006C fails to match /[a-z]/
sl_SI.utf8: U+00006D fails to match /[a-z]/
sl_SI.utf8: U+00006E fails to match /[a-z]/
sl_SI.utf8: U+00006F fails to match /[a-z]/
sv_FI: U+000077 fails to match /[a-z]/
sv_FI: U+000057 fails to match /[A-Z]/
***@euro: U+000077 fails to match /[a-z]/
***@euro: U+000057 fails to match /[A-Z]/
sv_FI.iso88591: U+000077 fails to match /[a-z]/
sv_FI.iso88591: U+000057 fails to match /[A-Z]/
***@euro: U+000077 fails to match /[a-z]/
***@euro: U+000057 fails to match /[A-Z]/
sv_FI.utf8: U+000077 fails to match /[a-z]/
sv_FI.utf8: U+000057 fails to match /[A-Z]/
sv_SE: U+000077 fails to match /[a-z]/
sv_SE: U+000057 fails to match /[A-Z]/
sv_SE.iso88591: U+000077 fails to match /[a-z]/
sv_SE.iso88591: U+000057 fails to match /[A-Z]/
sv_SE.utf8: U+000077 fails to match /[a-z]/
sv_SE.utf8: U+000057 fails to match /[A-Z]/
swedish: U+000077 fails to match /[a-z]/
swedish: U+000057 fails to match /[A-Z]/
tt_RU: U+000069 fails to match /[a-z]/
tt_RU: U+000049 fails to match /[A-Z]/
***@iqtelif: U+000069 fails to match /[a-z]/
***@iqtelif: U+000049 fails to match /[A-Z]/
tt_RU.utf8: U+000069 fails to match /[a-z]/
tt_RU.utf8: U+000049 fails to match /[A-Z]/
***@iqtelif: U+000069 fails to match /[a-z]/
***@iqtelif: U+000049 fails to match /[A-Z]/

Thanks,
Florian
Carlos O'Donell
2018-07-26 14:59:27 UTC
Permalink
Post by Florian Weimer
Post by Carlos O'Donell
Post by Carlos O'Donell
v4
- Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
- Added range checking for a-z, A-Z for all supported UTF-8 locales.
All of my testers are clean.
Attaching v4 on top of the current master.
This fixes all the locales.
az_AZ: U+000069 fails to match /[a-z]/
az_AZ: U+000049 fails to match /[A-Z]/
az_AZ.utf8: U+000069 fails to match /[a-z]/
az_AZ.utf8: U+000049 fails to match /[A-Z]/
See it.
Post by Florian Weimer
crh_UA: U+000069 fails to match /[a-z]/
crh_UA: U+000049 fails to match /[A-Z]/
crh_UA.utf8: U+000069 fails to match /[a-z]/
crh_UA.utf8: U+000049 fails to match /[A-Z]/
See it.
Post by Florian Weimer
ku_TR: U+000069 fails to match /[a-z]/
ku_TR: U+000049 fails to match /[A-Z]/
ku_TR.iso88599: U+000069 fails to match /[a-z]/
ku_TR.iso88599: U+000049 fails to match /[A-Z]/
ku_TR.utf8: U+000069 fails to match /[a-z]/
ku_TR.utf8: U+000049 fails to match /[A-Z]/
See it.
Post by Florian Weimer
lv_LV: U+000079 fails to match /[a-z]/
lv_LV: U+000059 fails to match /[A-Z]/
lv_LV.iso885913: U+000079 fails to match /[a-z]/
lv_LV.iso885913: U+000059 fails to match /[A-Z]/
lv_LV.utf8: U+000079 fails to match /[a-z]/
lv_LV.utf8: U+000059 fails to match /[A-Z]/
See it.
Post by Florian Weimer
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
Good catch. These were the ones I was hoping your finder would catch.
Post by Florian Weimer
slovene: U+00006A fails to match /[a-z]/
slovene: U+00006B fails to match /[a-z]/
slovene: U+00006C fails to match /[a-z]/
slovene: U+00006D fails to match /[a-z]/
slovene: U+00006E fails to match /[a-z]/
slovene: U+00006F fails to match /[a-z]/
This is an alias for sl_SI.ISO-8859-2 and we see it below.
Post by Florian Weimer
slovenian: U+00006A fails to match /[a-z]/
slovenian: U+00006B fails to match /[a-z]/
slovenian: U+00006C fails to match /[a-z]/
slovenian: U+00006D fails to match /[a-z]/
slovenian: U+00006E fails to match /[a-z]/
slovenian: U+00006F fails to match /[a-z]/
This is an alias for sl_SI.ISO-8859-2 and we see it below.
Post by Florian Weimer
sl_SI: U+00006A fails to match /[a-z]/
sl_SI: U+00006B fails to match /[a-z]/
sl_SI: U+00006C fails to match /[a-z]/
sl_SI: U+00006D fails to match /[a-z]/
sl_SI: U+00006E fails to match /[a-z]/
sl_SI: U+00006F fails to match /[a-z]/
See it.
Post by Florian Weimer
sl_SI.iso88592: U+00006A fails to match /[a-z]/
sl_SI.iso88592: U+00006B fails to match /[a-z]/
sl_SI.iso88592: U+00006C fails to match /[a-z]/
sl_SI.iso88592: U+00006D fails to match /[a-z]/
sl_SI.iso88592: U+00006E fails to match /[a-z]/
sl_SI.iso88592: U+00006F fails to match /[a-z]/
See it (aliased above twice).
Post by Florian Weimer
sl_SI.utf8: U+00006A fails to match /[a-z]/
sl_SI.utf8: U+00006B fails to match /[a-z]/
sl_SI.utf8: U+00006C fails to match /[a-z]/
sl_SI.utf8: U+00006D fails to match /[a-z]/
sl_SI.utf8: U+00006E fails to match /[a-z]/
sl_SI.utf8: U+00006F fails to match /[a-z]/
See it.
Post by Florian Weimer
sv_FI: U+000077 fails to match /[a-z]/
sv_FI: U+000057 fails to match /[A-Z]/
See it.
Same as sv_FI.
Post by Florian Weimer
sv_FI.iso88591: U+000077 fails to match /[a-z]/
sv_FI.iso88591: U+000057 fails to match /[A-Z]/
Likewise.
Likewise.
Post by Florian Weimer
sv_FI.utf8: U+000077 fails to match /[a-z]/
sv_FI.utf8: U+000057 fails to match /[A-Z]/
Likewise.
Post by Florian Weimer
sv_SE: U+000077 fails to match /[a-z]/
sv_SE: U+000057 fails to match /[A-Z]/
See it.
Post by Florian Weimer
sv_SE.iso88591: U+000077 fails to match /[a-z]/
sv_SE.iso88591: U+000057 fails to match /[A-Z]/
Same as above.
Post by Florian Weimer
sv_SE.utf8: U+000077 fails to match /[a-z]/
sv_SE.utf8: U+000057 fails to match /[A-Z]/
Likewise.
Post by Florian Weimer
swedish: U+000077 fails to match /[a-z]/
swedish: U+000057 fails to match /[A-Z]/
Alias for sv_SE.
Post by Florian Weimer
tt_RU: U+000069 fails to match /[a-z]/
tt_RU: U+000049 fails to match /[A-Z]/
See it.
See it.
Post by Florian Weimer
tt_RU.utf8: U+000069 fails to match /[a-z]/
tt_RU.utf8: U+000049 fails to match /[A-Z]/
See it.
See it.

Thanks you!

I increased tst-fnmatch.input coverage and I get this:

Line #3699: Test #3548 (az_AZ.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #3751: Test #3600 (az_AZ.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #6819: Test #6668 (crh_UA.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #6871: Test #6720 (crh_UA.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #18675: Test #18524 (ku_TR.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #18727: Test #18576 (ku_TR.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #19835: Test #19684 (lv_LV.UTF-8): fnmatch ("[a-z]", "y", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #19887: Test #19736 (lv_LV.UTF-8): fnmatch ("[A-Z]", "Y", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26684: Test #26533 (sl_SI.UTF-8): fnmatch ("[a-z]", "j", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26685: Test #26534 (sl_SI.UTF-8): fnmatch ("[a-z]", "k", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26686: Test #26535 (sl_SI.UTF-8): fnmatch ("[a-z]", "l", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26687: Test #26536 (sl_SI.UTF-8): fnmatch ("[a-z]", "m", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26688: Test #26537 (sl_SI.UTF-8): fnmatch ("[a-z]", "n", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26689: Test #26538 (sl_SI.UTF-8): fnmatch ("[a-z]", "o", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28049: Test #27898 (sv_FI.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28101: Test #27950 (sv_FI.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28153: Test #28002 (sv_SE.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28205: Test #28054 (sv_SE.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30427: Test #30276 (tt_RU.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30479: Test #30328 (tt_RU.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30531: Test #30380 (tt_RU.UTF-***@iqtelif): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30583: Test #30432 (tt_RU.UTF-***@iqtelif): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***

Which matches all the locales you saw failures in except for shs_CA, which is a real bug.

I'll fix these up quickly.

Cheers,
Carlos.
Carlos O'Donell
2018-07-28 01:12:37 UTC
Permalink
Post by Florian Weimer
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
This is a WIP, because the number of tests now is too big
to simply add them to tst-fnmatch.input, and so I'm writing
a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
expecting all of the locales to be built for testing, and
then running through all the rational ranges to test
inclusion of the required datums.

How slow is your tester? Should I do what you do to test
for the inclusion of characters that shouldn't be in the
range? Or will that take too long?

v5
- Add ~30k+ tests to tst-fnmatch.input.
- Fix broken locales:
- Fix shs_CA to not reorder-after for no reason.

Could you run this through the tester please?

Cheers,
Carlos.
Florian Weimer
2018-07-30 17:39:56 UTC
Permalink
Post by Carlos O'Donell
Post by Florian Weimer
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
This is a WIP, because the number of tests now is too big
to simply add them to tst-fnmatch.input, and so I'm writing
a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
expecting all of the locales to be built for testing, and
then running through all the rational ranges to test
inclusion of the required datums.
Let me repeat my suggestion that we should initially fix the locales
with the common collation order, where glibc 2.28 regresses.
Post by Carlos O'Donell
How slow is your tester? Should I do what you do to test
for the inclusion of characters that shouldn't be in the
range? Or will that take too long?
v5
- Add ~30k+ tests to tst-fnmatch.input.
- Fix shs_CA to not reorder-after for no reason.
Could you run this through the tester please?
It fails installation for me:

$ make localedata/install-locales DESTDIR=/tmp/locales
sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined
at locales/sl_SI:998
locales/sl_SI:1231: [error] symbol `S0062' not defined
locales/sl_SI:1231: [error] symbol `BASE' not defined
/bin/sh: line 17: 4148 Segmentation fault (core dumped) I18NPATH=.
GCONV_PATH=/home/fweimer/src/gnu/glibc/build/iconvdata LC_ALL=C
/home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2
--library-path
/home/fweimer/src/gnu/glibc/build:/home/fweimer/src/gnu/glibc/build/math:/home/fweimer/src/gnu/glibc/build/elf:/home/fweimer/src/gnu/glibc/build/dlfcn:/home/fweimer/src/gnu/glibc/build/nss:/home/fweimer/src/gnu/glibc/build/nis:/home/fweimer/src/gnu/glibc/build/rt:/home/fweimer/src/gnu/glibc/build/resolv:/home/fweimer/src/gnu/glibc/build/mathvec:/home/fweimer/src/gnu/glibc/build/support:/home/fweimer/src/gnu/glibc/build/crypt:/home/fweimer/src/gnu/glibc/build/nptl
/home/fweimer/src/gnu/glibc/build/locale/localedef $flags
--alias-file=../intl/locale.alias -i locales/$input -f charmaps/$charset
--prefix=/tmp/locales $locale

GDB says this:

Core was generated by
`/home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2
--library-path /home'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000419234 in output_weight (pool=***@entry=0x7ffdf1550ce0,
collate=***@entry=0x7fd5a8a03240,
elem=***@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912
1912 len += utf8_encode (&buf[len],
(gdb) bt
#0 0x0000000000419234 in output_weight (pool=***@entry=0x7ffdf1550ce0,
collate=***@entry=0x7fd5a8a03240,
elem=***@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912
#1 0x000000000041dc4a in collate_output () at programs/ld-collate.c:2180
#2 0x000000000042709f in write_all_categories
(definitions=0x7ffdf15513c0, charmap=***@entry=0x7fd5a71786a0,
locname=0x7ffdf1552e33 "sl_SI.UTF-8",
output_path=***@entry=0x7fd5a7178310
"/tmp/locales/usr/lib64/locale/sl_SI.utf8/")
at programs/locfile.c:337
#3 0x0000000000402f69 in main (argc=<optimized out>,
argv=0x7ffdf1551630) at programs/localedef.c:300
(gdb) l
1907 int i;
1908
1909 for (i = 0; i < elem->weights[cnt].cnt; ++i)
1910 /* Encode the weight value. We do nothing for IGNORE
entries. */
1911 if (elem->weights[cnt].w[i] != NULL)
1912 len += utf8_encode (&buf[len],
1913
elem->weights[cnt].w[i]->mborder[cnt]);
1914
1915 /* And add the buffer content. */
1916 obstack_1grow (pool, len);
(gdb) print elem->weights[cnt].w[i]->mborder[cnt]
Cannot access memory at address 0x0
(gdb) print elem->weights[cnt].w[i]->mborder
$3 = (int *) 0x0
(gdb)

Any idea what is going on?

Thanks,
Florian
Carlos O'Donell
2018-07-30 17:45:51 UTC
Permalink
Post by Florian Weimer
Post by Carlos O'Donell
Post by Florian Weimer
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
This is a WIP, because the number of tests now is too big
to simply add them to tst-fnmatch.input, and so I'm writing
a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
expecting all of the locales to be built for testing, and
then running through all the rational ranges to test
inclusion of the required datums.
Let me repeat my suggestion that we should initially fix the locales
with the common collation order, where glibc 2.28 regresses.
I do not think it is appropriate to release rational range support on
only a subset of the SUPPORTED set of locales. Either we support it on
all SUPPORTED locales or we work until we are ready.

At present glibc 2.28 does not regress because of commit
7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
uppercase.

In glibc 2.28 we simply have ~2500 characters in the range of a-z,
and in 2.27 we had ~250, it's still a large set of non-ASCII characters
accepted by the range, all because we caught up to Unicode 9.0.0 with
the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
with the next release, and probably always lagging a bit).

I don't see an urgent need to get rational range support into 2.28.
I was happy to get it in earlier, but now with deeper testing showing
that not all locales are working correctly, I'm not happy to see this
go out the door. I think it will be ready very shortly, and we can check
it in immediately into 2.29, and then continue our work on code point
ranges as the next step, which will require even more testing, and
internal API cleanup.
--
Cheers,
Carlos.
Florian Weimer
2018-07-30 17:54:47 UTC
Permalink
Post by Carlos O'Donell
Post by Florian Weimer
Post by Carlos O'Donell
Post by Florian Weimer
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
This is a WIP, because the number of tests now is too big
to simply add them to tst-fnmatch.input, and so I'm writing
a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
expecting all of the locales to be built for testing, and
then running through all the rational ranges to test
inclusion of the required datums.
Let me repeat my suggestion that we should initially fix the locales
with the common collation order, where glibc 2.28 regresses.
I do not think it is appropriate to release rational range support on
only a subset of the SUPPORTED set of locales. Either we support it on
all SUPPORTED locales or we work until we are ready.
At present glibc 2.28 does not regress because of commit
7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
uppercase.
In glibc 2.28 we simply have ~2500 characters in the range of a-z,
and in 2.27 we had ~250, it's still a large set of non-ASCII characters
accepted by the range, all because we caught up to Unicode 9.0.0 with
the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
with the next release, and probably always lagging a bit).
Ahh. So it's more complex and a regression longer in the making.
Post by Carlos O'Donell
I don't see an urgent need to get rational range support into 2.28.
I was happy to get it in earlier, but now with deeper testing showing
that not all locales are working correctly, I'm not happy to see this
go out the door. I think it will be ready very shortly, and we can check
it in immediately into 2.29, and then continue our work on code point
ranges as the next step, which will require even more testing, and
internal API cleanup.
Sounds reasonable.

Thanks,
Florian
Carlos O'Donell
2018-07-30 18:25:56 UTC
Permalink
Post by Carlos O'Donell
Post by Florian Weimer
Post by Carlos O'Donell
Post by Florian Weimer
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
This is a WIP, because the number of tests now is too big
to simply add them to tst-fnmatch.input, and so I'm writing
a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
expecting all of the locales to be built for testing, and
then running through all the rational ranges to test
inclusion of the required datums.
Let me repeat my suggestion that we should initially fix the locales
with the common collation order, where glibc 2.28 regresses.
I do not think it is appropriate to release rational range support on
only a subset of the SUPPORTED set of locales. Either we support it on
all SUPPORTED locales or we work until we are ready.
At present glibc 2.28 does not regress because of commit
7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
uppercase.
In glibc 2.28 we simply have ~2500 characters in the range of a-z,
and in 2.27 we had ~250, it's still a large set of non-ASCII characters
accepted by the range, all because we caught up to Unicode 9.0.0 with
the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
with the next release, and probably always lagging a bit).
Ahh.  So it's more complex and a regression longer in the making.
I'm worried I don't quite follow your statement of "longer in the making,"
but let me summarize what I think you wrote, and tell me if I have
it right.

The regression, from the perspective of en_US, is that [a-z] in master
accepts uppercase ASCII characters, and this breaks user expectations.

This is the only regression I'm considering serious enough to block the
release for and we've fixed it for now.

The regression which you say is "longer in the making" is that at some
point in the past the collation data for en_US contained only ASCII
ranges for a-z, A-Z, and 0-9. Then at some point in the past the ranges,
particularly those from a-z, and A-Z began accepting non-ASCII characters.

Thus the regression, from your perspective, happened far in the past.

As far as I can tell the regression has existed since the first import
for en_US which copied LC_COLLATE from en_DK (showing en_DK):
~~~
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 968) <a> <A>;<NONE>;<SMALL>;IGNORE
...
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'> <Z>;<ACUTE>;<CAPITAL>;IGNORE
~~~
Is this what you mean by "longer in the making?"

I expect that en_US at some point along the way is switched to use the
iso14651_t1 data, and so gains non-interleaved a-z/A-Z CEO, but it's hard
to tell exactly if CEO was fully functional, if fnmatch worked as expected,
etc.

Either way this is all a poorly understood and structured solution at this
point, and I hope that in 1 or 2 releases we go from "unusable interface" to
"rational ranges (data)" to "full rational ranges (code point ranges)" and
end up with a sensible portable solution.
Post by Carlos O'Donell
I don't see an urgent need to get rational range support into 2.28.
I was happy to get it in earlier, but now with deeper testing showing
that not all locales are working correctly, I'm not happy to see this
go out the door. I think it will be ready very shortly, and we can check
it in immediately into 2.29, and then continue our work on code point
ranges as the next step, which will require even more testing, and
internal API cleanup.
Sounds reasonable.
That sounds great. I will continue to update this patch set and get some
independent checking from your scripts, and my own testing. I also need
to add collation tests for all the locales I touch to ensure that the
reordering is just that, and that it doesn't materially change the collation
sequence (if it does it's a bug). This all adds more coverage to the
SUPPORTED set of languages which is a positive thing.
--
Cheers,
Carlos.
Florian Weimer
2018-07-30 18:34:43 UTC
Permalink
Post by Carlos O'Donell
As far as I can tell the regression has existed since the first import
~~~
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 968) <a> <A>;<NONE>;<SMALL>;IGNORE
...
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'> <Z>;<ACUTE>;<CAPITAL>;IGNORE
~~~
Is this what you mean by "longer in the making?"
Yes, that's what I meant. I didn't check whether it went back to 2.17,
2.12, or even earlier.

Thanks,
Florian
Carlos O'Donell
2018-07-31 02:18:15 UTC
Permalink
I'm so sorry to waste your time like this.

I apparently failed to test sl_SI.
Post by Florian Weimer
$ make localedata/install-locales DESTDIR=/tmp/locales
sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined at locales/sl_SI:998
locales/sl_SI:1231: [error] symbol `S0062' not defined
locales/sl_SI:1231: [error] symbol `BASE' not defined
... this is a cascading set of errors.
Post by Florian Weimer
(gdb) print elem->weights[cnt].w[i]->mborder[cnt]
Cannot access memory at address 0x0
(gdb) print elem->weights[cnt].w[i]->mborder
$3 = (int *) 0x0
(gdb)
Any idea what is going on?
The parser should have stopped at the first error IMO, going any further
just results in problems. It's very hard to rollback the state of the parser
and data structures if there is an error in the source files. It should just
have stopped at the duplicate U0061 definition.

I'm testing a v6 with the sl_SI fixes, and a new test case.
--
Cheers,
Carlos.
Rafal Luzynski
2018-07-25 21:06:04 UTC
Permalink
[...]
modified: localedata/locales/ar_SA
modified: localedata/locales/km_KH
modified: localedata/locales/lo_LA
modified: localedata/locales/or_IN
modified: localedata/locales/sl_SI
modified: localedata/locales/th_TH
They all re-arranged ASCII character collation element ordering like tr_TR,
and so they needed manual fixing.
Please check bg_BG. It also has a large reorder: puts all Cyrillic characters
before Latin. (However, this may not be relevant at all.)

Regards,

Rafal
Carlos O'Donell
2018-07-25 21:12:52 UTC
Permalink
Post by Rafal Luzynski
[...]
modified: localedata/locales/ar_SA
modified: localedata/locales/km_KH
modified: localedata/locales/lo_LA
modified: localedata/locales/or_IN
modified: localedata/locales/sl_SI
modified: localedata/locales/th_TH
They all re-arranged ASCII character collation element ordering like tr_TR,
and so they needed manual fixing.
Please check bg_BG. It also has a large reorder: puts all Cyrillic characters
before Latin. (However, this may not be relevant at all.)
Right, that won't affect the rational range for ASCII.

The new tst-fnmatch.input has this:

886 bg_BG.UTF-8 "a" "[a-z]" 0
887 bg_BG.UTF-8 "z" "[a-z]" 0
888 bg_BG.UTF-8 "A" "[a-z]" NOMATCH
889 bg_BG.UTF-8 "Z" "[a-z]" NOMATCH
890 bg_BG.UTF-8 "A" "[A-Z]" 0
891 bg_BG.UTF-8 "Z" "[A-Z]" 0
892 bg_BG.UTF-8 "a" "[A-Z]" NOMATCH
893 bg_BG.UTF-8 "z" "[A-Z]" NOMATCH

Which tests the range extremes, and it passes.

It doesn't reorder any actual LATIN characters and so it's safe.

Cheers,
Carlos.
Carlos O'Donell
2018-07-25 21:35:10 UTC
Permalink
Post by Carlos O'Donell
In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
the collation data to harmonize with the new version of ISO 14651
which is derived from Unicode 9.0.0. This collation update brought
with it some changes to locales which were not desirable by some
users, in particular it altered the meaning of the
locale-dependent-range regular expression, namely [a-z] and [A-Z], and
for en_US it caused uppercase letters to be matched by [a-z] for the
first time. The matching of uppercase letters by [a-z] is something
which is already known to users of other locales which have this
property, but this change could cause significant problems to en_US
and other similar locales that had never had this change before.
Whether this behaviour is desirable or not is contentious and GNU Awk
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
While the POSIX standard also has this further to say: "RE Bracket
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
"The current standard leaves unspecified the behavior of a range
expression outside the POSIX locale. ... As noted above, efforts were
made to resolve the differences, but no solution has been found that
would be specific enough to allow for portable software while not
invalidating existing implementations."
In glibc we implement the requirement of ISO POSIX-2:1993 and use
collation element order (CEO) to construct the range expression, the
API internally is __collseq_table_lookup(). The fact that we use CEO
and also have 4-level weights on each collation rule means that we can
in practice reorder the collation rules in iso14651_t1_common (the new
data) to provide consistent range expression resolution *and* the
weights should maintain the expected total order. Therefore this
* Reorder the collation rules for the LATIN script in
iso14651_t1_common to deinterlace uppercase and lowercase letters in
the collation element orders.
* Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
* Add back tests to tst-fnmatch.input and tst-regexloc.c which
exercise that [a-z] does not match A or Z.
The reordering of the ISO 14651 data is done in an entirely mechanical
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28
It is up for discussion if the iso14651_t1_common data should be
refined further to have 3 very tight collation element ranges that
include only a-z, A-Z, and 0-9, which would implement the solution
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12
No regressions on x86_64.
Verified that removal of the iso14651_t1_common change causes tst-fnmatch
422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
...
425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
---
ChangeLog | 11 +
localedata/Makefile | 1 +
localedata/en_US.UTF-8.in | 2159 +++++++++++++++++++++++++++++++++
localedata/locales/iso14651_t1_common | 1928 ++++++++++++++---------------
posix/tst-fnmatch.input | 125 +-
posix/tst-regexloc.c | 8 +-
6 files changed, 3224 insertions(+), 1008 deletions(-)
create mode 100644 localedata/en_US.UTF-8.in
I'm suggesting this change immediately for 2.28 to avoid further
problems with users expectations and sorting with [a-z] and [A-Z] until
a clearer consensus can be reached for a final solution.
File attached as .tar.gz to get past spam detectors. There is a lot
of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN
set that can be sorted with the existing test case infrastructure).
I have committed only the most conservative fix for this issue, which is
to deinterlace the lower and upper case ranges.

I think we are too late to commit rational ranges, and we can do that in
2.29 when it opens. Right now I want to remove the blocker that is causing
regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z].

We have consensus that this is the right direction to take a solution,
and if anyone objects, please speak up before I cut the branch on August 1st
(if we can still achieve that and get good machine coverage).

Cheers,
Carlos.
Florian Weimer
2018-07-25 22:50:37 UTC
Permalink
Post by Carlos O'Donell
I have committed only the most conservative fix for this issue, which is
to deinterlace the lower and upper case ranges.
I think we are too late to commit rational ranges, and we can do that in
2.29 when it opens. Right now I want to remove the blocker that is causing
regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z].
How is this the most conservative fix, relative to glibc 2.27 upstream?

[a-z] still matches lots of non-ASCII characters, which it did not before.

When I meant that we left regression-fixing territory, I was talking
about the locales which had iso14651_t1_common customizations.

Thanks,
Florian
Carlos O'Donell
2018-07-26 01:20:13 UTC
Permalink
Post by Florian Weimer
Post by Carlos O'Donell
I have committed only the most conservative fix for this issue,
which is to deinterlace the lower and upper case ranges.
I think we are too late to commit rational ranges, and we can do
that in 2.29 when it opens. Right now I want to remove the blocker
that is causing regressions for en_US.UTF-8 scripts that use [a-z],
and [A-Z].
How is this the most conservative fix, relative to glibc 2.27
upstream?
We have two solutions to fix the regression:

* Revert the entire ISO 14651 udpate.
- This is 13 commits for just the update.
- Several more commits for Rafal and Mike's work on locales on top of that.

* Fix the key issue of a-z interleaving with A-Z.

My opinion is that is most conservative to fix the interleaving.

In 2.27 we accepted 297 characters between A-Z.

In 2.28 we accept 2280 characters between A-Z as part of the ISO 14651 update.
Post by Florian Weimer
[a-z] still matches lots of non-ASCII characters, which it did not before.
This is not true, we were already matching 297 characters between A-Z
in 2.27. It has always been the case that we accepted non-ASCII characters
in the range. With the ISO 14651 update the *key* issue was that lowercase
and uppercase were now mixed in collation element ordering, resulting in
surprising matches and failures like the reported xfs test failure where
[a-z] matched "Makefile" and broke their test infrastructure.
Post by Florian Weimer
When I meant that we left regression-fixing territory, I was talking
about the locales which had iso14651_t1_common customizations.
OK, so to be clear you think we *should* go forward with rational ranges?

I don't think it's too late, we could commit it tomorrow, it should not
impact machine testing in way.

My v4 fixes all of the locales that either have customizations on
iso14651_t1_common or have their own custom locales. No more locales
remain to be fixed, I tested all of them with tst-fnmatch.input additions
to catch the ones that needed fixing.

Cheers,
Carlos.
Andreas Schwab
2018-07-26 08:08:56 UTC
Permalink
Post by Carlos O'Donell
surprising matches and failures like the reported xfs test failure where
[a-z] matched "Makefile"
??? [a-z] has always done that.

Andreas.
--
Andreas Schwab, SUSE Labs, ***@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."
Florian Weimer
2018-07-26 09:16:10 UTC
Permalink
Post by Andreas Schwab
Post by Carlos O'Donell
surprising matches and failures like the reported xfs test failure where
[a-z] matched "Makefile"
??? [a-z] has always done that.
It's about the glob/fnmatch pattern “[a-z]*”.

Florian
Jonathan Nieder
2018-07-26 01:33:51 UTC
Permalink
Hi,
Post by Carlos O'Donell
In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
the collation data to harmonize with the new version of ISO 14651
which is derived from Unicode 9.0.0. This collation update brought
with it some changes to locales which were not desirable by some
users, in particular it altered the meaning of the
locale-dependent-range regular expression, namely [a-z] and [A-Z], and
for en_US it caused uppercase letters to be matched by [a-z] for the
first time.
The Debian system where it is most convenient for me to test has
Debian's libc6 package, version 2.24-12. [a-z] matches uppercase
letters. I've always considered that undesirable but I'm confused
about the described regression. Did one of Debian's patches to
localedata cause it to pick up the regression early (by which I mean,
more than 5 years ago)?
Post by Carlos O'Donell
In glibc we implement the requirement of ISO POSIX-2:1993 and use
collation element order (CEO) to construct the range expression, the
API internally is __collseq_table_lookup(). The fact that we use CEO
and also have 4-level weights on each collation rule means that we can
in practice reorder the collation rules in iso14651_t1_common (the new
data) to provide consistent range expression resolution *and* the
weights should maintain the expected total order.
[...]
Post by Carlos O'Donell
* Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
Cool! Checking my understanding: does this mean that if I have files

lll
MMM
nnn

that with this patch,

echo [a-z]*

would no longer match MMM, and

ls | sort

would continue to sort in the order lll < MMM < nnn?

I wish we had done it 10 years ago. ;-) Thanks for getting it done.

Jonathan
Carlos O'Donell
2018-07-26 01:49:32 UTC
Permalink
Post by Jonathan Nieder
Hi,
Post by Carlos O'Donell
In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
the collation data to harmonize with the new version of ISO 14651
which is derived from Unicode 9.0.0. This collation update brought
with it some changes to locales which were not desirable by some
users, in particular it altered the meaning of the
locale-dependent-range regular expression, namely [a-z] and [A-Z], and
for en_US it caused uppercase letters to be matched by [a-z] for the
first time.
The Debian system where it is most convenient for me to test has
Debian's libc6 package, version 2.24-12. [a-z] matches uppercase
letters. I've always considered that undesirable but I'm confused
about the described regression. Did one of Debian's patches to
localedata cause it to pick up the regression early (by which I mean,
more than 5 years ago)?
It depends entirely on the locale you use. Some locales already have
[a-z] matching uppercase and have had it for years. The problem is that
this is new for en_US.UTF-8.

Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
done something different with iso14651_t1_common to change this, or added
something else. I did a quick look at the debian patches for 2.24-12 and
didn't see anything that would change this materially for en_US.
Post by Jonathan Nieder
Post by Carlos O'Donell
In glibc we implement the requirement of ISO POSIX-2:1993 and use
collation element order (CEO) to construct the range expression, the
API internally is __collseq_table_lookup(). The fact that we use CEO
and also have 4-level weights on each collation rule means that we can
in practice reorder the collation rules in iso14651_t1_common (the new
data) to provide consistent range expression resolution *and* the
weights should maintain the expected total order.
[...]
Post by Carlos O'Donell
* Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
Cool! Checking my understanding: does this mean that if I have files
lll
MMM
nnn
that with this patch,
echo [a-z]*
would no longer match MMM, and
Correct.
Post by Jonathan Nieder
ls | sort
would continue to sort in the order lll < MMM < nnn?
Yes.
Post by Jonathan Nieder
I wish we had done it 10 years ago. ;-) Thanks for getting it done.
The rational ranges follow code point order.

The sorting follows collation sequence.

I think this was never an issue because most locales following ISO 14651
were using an old data set which never exhibited this issue. However, thanks
to Mike Fabian's hard work (and no good deed goes unpunished) we have updated
collation all the way to Unicode 9.0.0-era and so encountered this problem.

Cheers,
Carlos.
Jonathan Nieder
2018-07-26 02:16:43 UTC
Permalink
Post by Carlos O'Donell
Post by Jonathan Nieder
The Debian system where it is most convenient for me to test has
Debian's libc6 package, version 2.24-12. [a-z] matches uppercase
letters. I've always considered that undesirable but I'm confused
about the described regression. Did one of Debian's patches to
localedata cause it to pick up the regression early (by which I mean,
more than 5 years ago)?
It depends entirely on the locale you use. Some locales already have
[a-z] matching uppercase and have had it for years. The problem is that
this is new for en_US.UTF-8.
Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
done something different with iso14651_t1_common to change this, or added
something else. I did a quick look at the debian patches for 2.24-12 and
didn't see anything that would change this materially for en_US.
I tried with the following locales:

en_US: matches (bad)
en_US.UTF-8: matches (bad)
C: does not match (good)
C.UTF-8: does not match (good)
fr_CH: matches (bad)
fr_CH.UTF-8: matches (bad)

Looking over
https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata
and https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale,
I don't see any obvious culprits. Anyway, please just take this as more
feedback in favor of your approach.

See the user reports merged with https://bugs.debian.org/301717.

Thanks,
Jonathan
Carlos O'Donell
2018-07-26 03:48:26 UTC
Permalink
Post by Jonathan Nieder
Post by Carlos O'Donell
Post by Jonathan Nieder
The Debian system where it is most convenient for me to test has
Debian's libc6 package, version 2.24-12. [a-z] matches uppercase
letters. I've always considered that undesirable but I'm confused
about the described regression. Did one of Debian's patches to
localedata cause it to pick up the regression early (by which I mean,
more than 5 years ago)?
It depends entirely on the locale you use. Some locales already have
[a-z] matching uppercase and have had it for years. The problem is that
this is new for en_US.UTF-8.
Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
done something different with iso14651_t1_common to change this, or added
something else. I did a quick look at the debian patches for 2.24-12 and
didn't see anything that would change this materially for en_US.
en_US: matches (bad)
en_US.UTF-8: matches (bad)
C: does not match (good)
C.UTF-8: does not match (good)
fr_CH: matches (bad)
fr_CH.UTF-8: matches (bad)
Looking over
https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata
and https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale,
I don't see any obvious culprits. Anyway, please just take this as more
feedback in favor of your approach.
See the user reports merged with https://bugs.debian.org/301717.
This is your shell doing the expanding, and worse doing it
differently from glibc.

My bash shell also handles [a-z] expansion differently given
the locale data. It appears to be using collation sequence
i.e. the order in which the elements sort in.

Using grep doesn't result in these matches.

The fix is this: `shopt -s globasciiranges`, and we should
make it the default from now on. The option turns on rational
ranges for bash. Florian found this out when digging into
the issue.

We have a lot of cleanup to do to get rational ranges on
at each step of expansion.

Cheers,
Carlos.
Florian Weimer
2018-07-26 07:42:37 UTC
Permalink
Post by Jonathan Nieder
Looking over
https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata
andhttps://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale,
I don't see any obvious culprits. Anyway, please just take this as more
feedback in favor of your approach.
See the user reports merged with https://bugs.debian.org/301717.
The bash implementation of glob always uses strcoll/wcscoll ordering
when globasciirange is not active. It does not use collation element
ordering, so rearranging collation data does not affect it. This means
that the changes discussed here will not affect bash (well, the glob
part at least).

Thanks,
Florian
Andreas Schwab
2018-07-26 08:18:30 UTC
Permalink
The bash implementation of glob always uses strcoll/wcscoll ordering when
globasciirange is not active. It does not use collation element ordering,
so rearranging collation data does not affect it.
Why does strcoll not agree with the collation sequence?

Andreas.
--
Andreas Schwab, SUSE Labs, ***@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."
Florian Weimer
2018-07-26 09:15:38 UTC
Permalink
Post by Andreas Schwab
The bash implementation of glob always uses strcoll/wcscoll ordering when
globasciirange is not active. It does not use collation element ordering,
so rearranging collation data does not affect it.
Why does strcoll not agree with the collation sequence?
The collation element ordering is encoded in the _NL_COLLATE_COLLSEQMB
and _NL_COLLATE_COLLSEQWC tables, and not the weights used by strcoll.

Thanks,
Florian
Carlos O'Donell
2018-07-26 13:25:33 UTC
Permalink
Post by Andreas Schwab
The bash implementation of glob always uses strcoll/wcscoll ordering when
globasciirange is not active. It does not use collation element ordering,
so rearranging collation data does not affect it.
Why does strcoll not agree with the collation sequence?
There are two terms that mean very different things.

The strcoll output and collation sequence are the same.

The collation sequence is not the same as the collation element ordering
(the order of the rules in the source file).

POSIX mandated the use of collation element ordering (not sequence) for
regular expression ranges, and then decided this was a bad idea and instead
made it unspecified.

In glibc we continue to implement and support collation element ordering,
not collation sequence, for posix regular expression ranges.

Even collation sequence is a bad idea because [a-z] does not include all the
z's that are sorted after z, and you need special collation element markers
like AFTER-Z to find all the z's. Instead we should use rational ranges
and make everything based on code points to make it portable across all
locales.

Cheers,
Carlos.
Loading...