tr A-Z a-z in locales other than C

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

tr A-Z a-z in locales other than C

Jilles Tjoelker
A few years ago, when locale support was added to the tr utility,
character ranges (except ones containing one or two octal escapes) were
changed to use the collation order instead of the character code order.
At the time, this matched other implementations of tr and was apparently
somewhat generally accepted.

However, this behaviour is not intuitive, not portable as it deeply
depends on the collation order and it is very hard to find a useful use
for it. Perhaps there is a use case in EBCDIC locales that only contain
the 2*26 basic Latin letters, but that is rather exotic.

The command tr A-Z a-z may do something unexpected even if there is an
1:1 mapping between upper and lower case, since it also assumes that 'z'
is the last letter.

This is not a POSIX issue as POSIX leaves character ranges in tr
unspecified for locales other than the POSIX locale (except for ranges
containing octal escapes).

If there is no reason to keep using the collation order, I would like to
change tr's character ranges back to character codes. GNU tr does this
and many ports wrongly take advantage of it, so following it will reduce
the need to patch ports.

The below patch demonstrates the new behaviour. The code could be
simplified more as the flags for octal escapes are no longer needed.

The man page may need some additional change as well. In particular, the
command
  tr "[:upper:]" "[:lower:]"
in a user's locale is a good choice for text specified by the user, but
a poor choice for doing case-insensitive comparisons of constant
strings, because in Turkish locales the upper case version of 'i' is a
capital I with dot and the lower case version of 'I' is a lower case i
without dot. In such cases,
  LC_ALL=C tr "[:upper:]" "[:lower:]"
may be a better option (A-Z a-z could be used at the cost of breaking
EBCDIC support).

There is a related issue with ranges in regular expressions, glob and
fnmatch (likewise unspecified by POSIX outside the POSIX locale), but
this is less likely to cause problems.


Index: usr.bin/tr/tr.1
===================================================================
--- usr.bin/tr/tr.1 (revision 222648)
+++ usr.bin/tr/tr.1 (working copy)
@@ -31,7 +31,7 @@
 .\"     @(#)tr.1 8.1 (Berkeley) 6/6/93
 .\" $FreeBSD$
 .\"
-.Dd October 13, 2006
+.Dd June 6, 2011
 .Dt TR 1
 .Os
 .Sh NAME
@@ -158,12 +158,7 @@
 .Pp
 A backslash followed by any other character maps to that character.
 .It c-c
-For non-octal range endpoints
-represents the range of characters between the range endpoints, inclusive,
-in ascending order,
-as defined by the collation sequence.
-If either or both of the range endpoints are octal sequences, it
-represents the range of specific coded values between the
+A range represents the range of specific coded values between the
 range endpoints, inclusive.
 .Pp
 .Bf Em
@@ -309,20 +304,18 @@
 .Pp
 .Dl "tr \*q[=e=]\*q \*qe\*q"
 .Sh COMPATIBILITY
-Previous
-.Fx
-implementations of
-.Nm
-did not order characters in range expressions according to the current
-locale's collation order, making it possible to convert unaccented Latin
+Some implementations of
+.Nm ,
+including the ones in previous versions of
+.Fx ,
+order characters in range expressions according to the current
+locale's collation order, making it impossible to convert unaccented Latin
 characters (esp.\& as found in English text) from upper to lower case using
 the traditional
 .Ux
 idiom of
 .Dq Li "tr A-Z a-z" .
-Since
-.Nm
-now obeys the locale's collation order, this idiom may not produce
+In such implementations, this idiom may not produce
 correct results when there is not a 1:1 mapping between lower and
 upper case, or when the order of characters within the two cases differs.
 As noted in the
Index: usr.bin/tr/str.c
===================================================================
--- usr.bin/tr/str.c (revision 222648)
+++ usr.bin/tr/str.c (working copy)
@@ -260,37 +260,13 @@
  stopval = wc;
  s->str += clen;
  }
- /*
- * XXX Characters are not ordered according to collating sequence in
- * multibyte locales.
- */
- if (octal || was_octal || MB_CUR_MAX > 1) {
- if (stopval < s->lastch) {
- s->str = savestart;
- return (0);
- }
- s->cnt = stopval - s->lastch + 1;
- s->state = RANGE;
- --s->lastch;
- return (1);
- }
- if (charcoll((const void *)&stopval, (const void *)&(s->lastch)) < 0) {
+ if (stopval < s->lastch) {
  s->str = savestart;
  return (0);
  }
- if ((s->set = p = malloc((NCHARS_SB + 1) * sizeof(int))) == NULL)
- err(1, "genrange() malloc");
- for (cnt = 0; cnt < NCHARS_SB; cnt++)
- if (charcoll((const void *)&cnt, (const void *)&(s->lastch)) >= 0 &&
-    charcoll((const void *)&cnt, (const void *)&stopval) <= 0)
- *p++ = cnt;
- *p = OOBCH;
- n = p - s->set;
-
- s->cnt = 0;
- s->state = SET;
- if (n > 1)
- mergesort(s->set, n, sizeof(*(s->set)), charcoll);
+ s->cnt = stopval - s->lastch + 1;
+ s->state = RANGE;
+ --s->lastch;
  return (1);
 }
 

--
Jilles Tjoelker
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: tr A-Z a-z in locales other than C

Andrey Chernov
On Tue, Jun 07, 2011 at 12:41:05AM +0200, Jilles Tjoelker wrote:
>
> There is a related issue with ranges in regular expressions, glob and
> fnmatch (likewise unspecified by POSIX outside the POSIX locale), but
> this is less likely to cause problems.
>

You care about ports, but suggested change is americano-centrism which
kills tr usage for national language documents due to impossibility to
specify whole national alphabet easily, just by two letters.

Moreover, having differently treated regex ranges in tr vs other places
you mention will produce additional chaos.

Back to the ports: it is not hard to run _any_ port's make or configure
with LANG=C directly by the ports Mk system to eliminate that problem.

--
http://ache.vniz.net/
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: tr A-Z a-z in locales other than C

Jilles Tjoelker
On Tue, Jun 07, 2011 at 04:24:43AM +0400, Andrey Chernov wrote:
> On Tue, Jun 07, 2011 at 12:41:05AM +0200, Jilles Tjoelker wrote:

> > There is a related issue with ranges in regular expressions, glob and
> > fnmatch (likewise unspecified by POSIX outside the POSIX locale), but
> > this is less likely to cause problems.

> You care about ports, but suggested change is americano-centrism which
> kills tr usage for national language documents due to impossibility to
> specify whole national alphabet easily, just by two letters.

Hmm, so that's with translation to a constant, or with the -d and/or -s
options. In such cases, there may be a range for all letters with
collation order, but not with codeset order (mainly if "all letters"
includes letters with diacritical marks).

In FreeBSD, upper case sorts before lower case, so cases can be
distinguished this way but all letters may require two ranges. In most
other operating systems the cases go together so a single range is
sufficient, but cases cannot be distinguished. Making such things work
on multiple operating systems requires careful testing.

> Moreover, having differently treated regex ranges in tr vs other places
> you mention will produce additional chaos.

I think this is already inconsistent because some programs do not enable
locale or use different locale code.

With UTF-8 or other multibyte character sets, this is even more so
because functions like isalpha work very poorly by definition and there
is no collation support for such character sets in FreeBSD.

> Back to the ports: it is not hard to run _any_ port's make or configure
> with LANG=C directly by the ports Mk system to eliminate that problem.

True, but some ports install scripts with problematic tr calls.

--
Jilles Tjoelker
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: tr A-Z a-z in locales other than C

Atom Smasher
In reply to this post by Jilles Tjoelker
the man page makes it clear...

      Translate the contents of file1 to upper-case.

            tr "[:lower:]" "[:upper:]" < file1

      (This should be preferred over the traditional UNIX idiom of ``tr a-z
  A-Z'', since it works correctly in all locales.)


for any other uses, either build the port with locale specified as "C" as
mentioned, or patch the port so:
  tr '[a-z]' '[A-Z]'
  becomes:
  env LC_ALL=C tr '[a-z]' '[A-Z]'

the only change that would be appropriate to the tr utility would be a
command-line option to select a locale... something like:
  tr -l C '[a-z]' '[A-Z]'

i don't think anyone would object to that, but it would still require
patching some ports under some locales...

maybe another option would be modifying tr to recognize other [new]
environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and
TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or
sys.mk), not need any patching, and not interfere with other uses of
locale.


--
         ...atom

  ________________________
  http://atom.smasher.org/
  762A 3B98 A3C3 96C9 C6B7 582A B88D 52E4 D9F5 7808
  -------------------------------------------------

  "We in the West must bear in mind that the poor countries
  are poor primarily because we have exploited them through
  political or economic colonialism."
  -- Martin Luther King, Jr

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: tr A-Z a-z in locales other than C

Jilles Tjoelker
On Wed, Jun 08, 2011 at 09:56:39AM +1200, Atom Smasher wrote:
> the man page makes it clear...

>       Translate the contents of file1 to upper-case.

>             tr "[:lower:]" "[:upper:]" < file1

>       (This should be preferred over the traditional UNIX idiom of ``tr a-z
>   A-Z'', since it works correctly in all locales.)

> for any other uses, either build the port with locale specified as "C" as
> mentioned, or patch the port so:
>   tr '[a-z]' '[A-Z]'
>   becomes:
>   env LC_ALL=C tr '[a-z]' '[A-Z]'

> the only change that would be appropriate to the tr utility would be a
> command-line option to select a locale... something like:
>   tr -l C '[a-z]' '[A-Z]'

> i don't think anyone would object to that, but it would still require
> patching some ports under some locales...

That new option would provide zero benefit. If things are going to be
patched anyway then patch them to be standards compliant.

> maybe another option would be modifying tr to recognize other [new]
> environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and
> TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or
> sys.mk), not need any patching, and not interfere with other uses of
> locale.

That would be rather ugly.

If  tr a-z A-Z  is supposed to be deceiving in some locales, then let it
remain so unconditionally.

--
Jilles Tjoelker
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: tr A-Z a-z in locales other than C

Atom Smasher
On Wed, 8 Jun 2011, Jilles Tjoelker wrote:

>> maybe another option would be modifying tr to recognize other [new]
>> environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and
>> TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or
>> sys.mk), not need any patching, and not interfere with other uses of
>> locale.
>
> That would be rather ugly.
>
> If tr a-z A-Z is supposed to be deceiving in some locales, then let it
> remain so unconditionally.
=================

it can still be as ugly as one wants it to be, and in some ports that
might be fine. but this option would provide a very simple option to reign
in how ugly it is.


--
         ...atom

  ________________________
  http://atom.smasher.org/
  762A 3B98 A3C3 96C9 C6B7 582A B88D 52E4 D9F5 7808
  -------------------------------------------------

  "The livestock sector is a major player [in climate
  change], responsible for 18% of greenhouse gas
  emissions measured in CO2 equivalent. This is a higher
  share than transport."
  -- Livestock's long shadow, 2006
  UN report sponsored by WTO, EU, AS-AID, FAO, et al

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: tr A-Z a-z in locales other than C

Andrey Chernov
In reply to this post by Jilles Tjoelker
On Tue, Jun 07, 2011 at 11:17:12PM +0200, Jilles Tjoelker wrote:
> In FreeBSD, upper case sorts before lower case, so cases can be
> distinguished this way but all letters may require two ranges. In most
> other operating systems the cases go together so a single range is
> sufficient, but cases cannot be distinguished. Making such things work
> on multiple operating systems requires careful testing.

Such thing can't work consistenly on multiple operating systems by
definition, because POSIX states "undefined" here. So the best we can is
to concentrace on our system. No program should relay on that until POSIX
define that somehow.

> > Moreover, having differently treated regex ranges in tr vs other places
> > you mention will produce additional chaos.
>
> I think this is already inconsistent because some programs do not enable
> locale or use different locale code.

I say the same, producing additional chaos is not bringing chaos from
nowhere.
AFAIK nobody use different locale code but often different regex
implemetation.

> > Back to the ports: it is not hard to run _any_ port's make or configure
> > with LANG=C directly by the ports Mk system to eliminate that problem.
>
> True, but some ports install scripts with problematic tr calls.

What count says, how many ports do that?

Summarizing I suggest to consider two models:
1) Developer/programer etc. tr coderange does good for it.
2) Working with national language docs/end user/ tr coderange does bad for
it.

Sacrificing model 2) for 1) is not the thing we need, if such ports number
is low. If such ports number is significant, we can consider additional
options like automatically search and replace such tr's through pkg-plist
(similar scanning we already do for security reasons).

--
http://ache.vniz.net/
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: tr A-Z a-z in locales other than C

Perry Hutchison
In reply to this post by Jilles Tjoelker
Jilles Tjoelker <[hidden email]> wrote:

> On Tue, Jun 07, 2011 at 04:24:43AM +0400, Andrey Chernov wrote:
...
> > Back to the ports: it is not hard to run _any_ port's make
> > or configure with LANG=C directly by the ports Mk system to
> > eliminate that problem.
>
> True, but some ports install scripts with problematic tr calls.

So part of the porting effort may be to provide a patch that
prepends something along the lines of "env LANG=C" to tr calls in
those scripts.  It would surely not be the only kind of situation
in which a port needed to patch the ported code to get it to run
correctly on FreeBSD :)
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"