Question about ASCII and nl_langinfo (locale work)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about ASCII and nl_langinfo (locale work)

Baptiste Daroussin-2
Hi all,

When merging the new collation, the locales has been reworked.

ache@ raised a good point about LOCALE C and POSIX and by extension the locales
US-ASCII: should we take the opportunity to change that:

First a desciption of the situation: nl_langinfo is not normalised each OS can
return the encoding they want. While it is pretty obvious about what should be
returned for for regular encodings (iso-8859* or UTF-8), for C and POSIX
locales, FreeBSD used to return US-ASCII (and does it again since today).

Lots of third party application (python, perl, tcl etc) tries to figure out the
encoding by matching against a table of "known" output of nl_langinfo()

The thing is not all are aware that FreeBSD uses US-ASCII, for example tcl does
not. which means tcl is not able to determine what encoding is needed for the C
and POSIX locales.

On Linux they to return ANSI_X3.4-1968 (also known as US-ASCII) and most
application knows what linux returns.

That means we need to teach all upstream about US-ASCII all the time.

The proposals are:
- Do not change what we have always done.
- Change it to something that makes sense "C" (what we tried with "POSIX" which
  was a very bad idea, but "C" seems to be commonly recognised by application as
  ASCII)
- Let's report the same as Linux, that will simplify portability
- Let's be obvious and report ASCII (also commonly recognised by applications)

The next question is if we change the above, would it make sense to also report
ASCII for ASCII locales:
- en_AU.US-ASCII
- en_CA.US-ASCII
- en_GB.US-ASCII
- en_NZ.US-ASCII
- en_US.US-ASCII
- en_ZA.US-ASCII

Which would require some work or should we make them return ASCII or even
ANSI_X3.4-1968.

Please share your opinion here

Best regards,
Bapt

signature.asc (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

Andrey Chernov
On 11.11.2015 1:26, Baptiste Daroussin wrote:

> The thing is not all are aware that FreeBSD uses US-ASCII, for example tcl does
> not. which means tcl is not able to determine what encoding is needed for the C
> and POSIX locales.
>
> On Linux they to return ANSI_X3.4-1968 (also known as US-ASCII) and most
> application knows what linux returns.
>
> That means we need to teach all upstream about US-ASCII all the time.
>
> The proposals are:
> - Do not change what we have always done.
> - Change it to something that makes sense "C" (what we tried with "POSIX" which
>   was a very bad idea, but "C" seems to be commonly recognised by application as
>   ASCII)
> - Let's report the same as Linux, that will simplify portability
> - Let's be obvious and report ASCII (also commonly recognised by applications)
Just repeating my opinion in this new thread.

Since POSIX don't tell anything certain, we should be Linux compatible
here to have less surprise, i.e.:
1) Return "ANSI_X3.4-1968" for C/POSIX locale (was "US-ASCII").
2) Return "ASCII" for *.US-ASCII locales (was "US-ASCII").
Typical Linux program knows nothing about our "US-ASCII", and porting
handles it rarely.

Not doing that leads to hidden, hard to find bugs like still present
right now in our tcl ports. For all that years tcl don't understand
FreeBSD-native nl_langinfo() "US-ASCII" and falls back to "iso8859-1"
(it understands Linux "ANSI_X3.4-1968" and "ASCII" of course).

--
http://ache.vniz.net/


signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

John Marino-7
On 11/11/2015 5:59 PM, Andrey Chernov wrote:

> On 11.11.2015 1:26, Baptiste Daroussin wrote:
>> The thing is not all are aware that FreeBSD uses US-ASCII, for example tcl does
>> not. which means tcl is not able to determine what encoding is needed for the C
>> and POSIX locales.
>>
>> On Linux they to return ANSI_X3.4-1968 (also known as US-ASCII) and most
>> application knows what linux returns.
>>
>> That means we need to teach all upstream about US-ASCII all the time.
>>
>> The proposals are:
>> - Do not change what we have always done.
>> - Change it to something that makes sense "C" (what we tried with "POSIX" which
>>   was a very bad idea, but "C" seems to be commonly recognised by application as
>>   ASCII)
>> - Let's report the same as Linux, that will simplify portability
>> - Let's be obvious and report ASCII (also commonly recognised by applications)
>
> Just repeating my opinion in this new thread.
>
> Since POSIX don't tell anything certain, we should be Linux compatible
> here to have less surprise, i.e.:
> 1) Return "ANSI_X3.4-1968" for C/POSIX locale (was "US-ASCII").
> 2) Return "ASCII" for *.US-ASCII locales (was "US-ASCII").
> Typical Linux program knows nothing about our "US-ASCII", and porting
> handles it rarely.
>
> Not doing that leads to hidden, hard to find bugs like still present
> right now in our tcl ports. For all that years tcl don't understand
> FreeBSD-native nl_langinfo() "US-ASCII" and falls back to "iso8859-1"
> (it understands Linux "ANSI_X3.4-1968" and "ASCII" of course).
>

As a DragonFly representative (and probably the person that would
implement it), I can accept Andrey's proposal.

What it would mean:
1) "ANSI_X3.4-1968" would be the one return value of
nl_langinfo(CODESET) that is not in the output of "locale -m"

2) This would require an alteration to usr.bin/locale to add this
"ANSI_X3.4-1968" if not found (similar to how it's done for US-ASCII

3) At the same time usr.bin/locale would be modified to change check
from "US-ASCII" to "ASCII"

4) The locale tools would have to be modified to change all source and
map references from "US-ASCII" to "ASCII" and the six LC* generating
makefiles regenerated

5) nl_langinfo would be changed to return "ANSI_X3.4-1968" instead of
"US-ASCII" if the encoding equals "NONE"

6) the "make upgrade" utility would need to remove *.US-ASCII locales

7) Do we really need 6 ".ASCII" locales?  It has very limited use, I'd
suggest just having "en_US.ASCII" and that it.  Dump en_AU, en_ZA,
en_GB, etc.  We can keep all 6 if we want, but if we are removing
US-ASCII anyway, we should limit the locales to what makes sense.
Alternatively FreeBSD could link US-ASCII => ASCII and have both
variations but I think DragonFly will just drop US-ASCII in this case.

What nl_langinfo(CODESET) returns has to be reflected in the locale name
(with the exception of "ANSI_X3.4-1968") so there has to be e.g.
en_US.ASCII as a valid locale if US-ASCII is changed.

There might be other changes necessary if "US-ASCII" is changed; I'd
have to do a thorough review.

To get started, I think this needs to be decided:
A) confirm we want locale -m and nl_langinfo(CODESET) to return
"ANSI_X3.4-1968" for C/POSIX locales
B) Confirm renaming US-ASCII locales to ASCII
C) (FreeBSD only) Decide if you want to conserve US-ASCII locales with
symlinks.  nl_langinfo(CODESET) will return "ASCII" for these symlinked
locales
D) Decide the set of "ASCII" locales are really needed.  (I suggest one,
en_US.ASCII)

Thanks,
John

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

Ed Schouten-6
In reply to this post by Baptiste Daroussin-2
Hi Baptiste,

I personally think it's a shame if we were to deviate from returning
"US-ASCII", for the reason that "US-ASCII" also happens to be the
preferred MIME name for the character set:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

"ASCII" doesn't even seem to be an alias for this character set.
Though "ANSI_X3.4-1968" is an alias for ASCII, I wouldn't even know
that this is ASCII without doing a Google search.

In my opinion a decent implementation of newlocale() should support
any of the character set names and aliases provided on the IANA page,
but let nl_langinfo(CODESET) return the preferred MIME name.

> That means we need to teach all upstream about US-ASCII all the time.

Could you come up with a concrete list of pieces of software that need
to be changed? Is it just those three pieces of software that you
mentioned above? If so, then I think it would be a shame to make the
concession.

--
Ed Schouten <[hidden email]>
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

Andrey Chernov
On 16.11.2015 20:35, Ed Schouten wrote:
> I personally think it's a shame if we were to deviate from returning
> "US-ASCII", for the reason that "US-ASCII" also happens to be the
> preferred MIME name for the character set:
>
> http://www.iana.org/assignments/character-sets/character-sets.xhtml
>
> "ASCII" doesn't even seem to be an alias for this character set.

Yes, I overlook it somehow. ASCII is not in the IANA, while both
ANSI_X3.4-1968 and US-ASCII are.

So, I reconsider the proposal. We can return ANSI_X3.4-1968 for POSIX/C
(for Linux compatibility reasons) and left pure US-ASCII as it was
(since it is used rarely).

> In my opinion a decent implementation of newlocale() should support
> any of the character set names and aliases provided on the IANA page,
> but let nl_langinfo(CODESET) return the preferred MIME name.

BTW, we already have and return non-IANA codesets historically (inspired
by X11). I.e. we have ISO8859-* instead of preferred names ISO-8859-*,
moreover, ISO8859-* even not the aliases (!) and IANA knows nothing
about them. Linux have IANA preferred names here, i.e. ISO-8859-*.

So the question is: should we rename ISO8859-* to ISO-8859-* to be IANA
and Linux compatible?

We can strip first (or all) "_" and "-" from the environment names (as
Linux does), to not violate POLA.

>> That means we need to teach all upstream about US-ASCII all the time.
>
> Could you come up with a concrete list of pieces of software that need
> to be changed? Is it just those three pieces of software that you
> mentioned above? If so, then I think it would be a shame to make the
> concession.

No, I see such checks many times in other programs too, tcl is just one
which can be found quickly. The proper procedure to examine situation
will be to unpack _all_ ports and search through the code, but my
machine can't handle it.

--
http://ache.vniz.net/
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

Baptiste Daroussin-2
On Mon, Nov 16, 2015 at 10:00:29PM +0300, Andrey Chernov wrote:

> On 16.11.2015 20:35, Ed Schouten wrote:
> > I personally think it's a shame if we were to deviate from returning
> > "US-ASCII", for the reason that "US-ASCII" also happens to be the
> > preferred MIME name for the character set:
> >
> > http://www.iana.org/assignments/character-sets/character-sets.xhtml
> >
> > "ASCII" doesn't even seem to be an alias for this character set.
>
> Yes, I overlook it somehow. ASCII is not in the IANA, while both
> ANSI_X3.4-1968 and US-ASCII are.
>
> So, I reconsider the proposal. We can return ANSI_X3.4-1968 for POSIX/C
> (for Linux compatibility reasons) and left pure US-ASCII as it was
> (since it is used rarely).
To tell the truth, the locale change I made were painful enough (mostly my
fault)and I (for now) won't do anywork further beside fixing the fallouts if any
are left. But I do support this proposal!

>
> > In my opinion a decent implementation of newlocale() should support
> > any of the character set names and aliases provided on the IANA page,
> > but let nl_langinfo(CODESET) return the preferred MIME name.
>
> BTW, we already have and return non-IANA codesets historically (inspired
> by X11). I.e. we have ISO8859-* instead of preferred names ISO-8859-*,
> moreover, ISO8859-* even not the aliases (!) and IANA knows nothing
> about them. Linux have IANA preferred names here, i.e. ISO-8859-*.
>
> So the question is: should we rename ISO8859-* to ISO-8859-* to be IANA
> and Linux compatible?
>
> We can strip first (or all) "_" and "-" from the environment names (as
> Linux does), to not violate POLA.
I would like to see that as well, lots of new comers I have seen setup the
locales the IANA way and are unhappy because that does not work. The first plan
in the collation branch was to introduce the IANA syntax via an alias but in the
end I removed it, because there was already to many changes.

If one want to go further on the locale changes like the above proposal please
proceed.

Best regards,
Bapt

signature.asc (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

Andrey Chernov
On 17.11.2015 0:06, Baptiste Daroussin wrote:
> locales the IANA way and are unhappy because that does not work. The first plan
> in the collation branch was to introduce the IANA syntax via an alias but in the
> end I removed it, because there was already to many changes.

For ISO case we don't need aliases and can keep our internal names
hierarchy honoring POLA. All we need is:
1) Convert "ISO-" and "ISO_" to "ISO" for setlocale(3) input.
2) Convert from "ISO" to "ISO-" for setlocale(3), nl_langinfo(3) and
locale(1) output.

--
http://ache.vniz.net/


signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

John Marino-7
On 11/16/2015 10:51 PM, Andrey Chernov wrote:

> On 17.11.2015 0:06, Baptiste Daroussin wrote:
>> locales the IANA way and are unhappy because that does not work. The
>> first plan
>> in the collation branch was to introduce the IANA syntax via an alias
>> but in the
>> end I removed it, because there was already to many changes.
>
> For ISO case we don't need aliases and can keep our internal names
> hierarchy honoring POLA. All we need is:
> 1) Convert "ISO-" and "ISO_" to "ISO" for setlocale(3) input.
> 2) Convert from "ISO" to "ISO-" for setlocale(3), nl_langinfo(3) and
> locale(1) output.

A huge patch just went into GCC libstdc++ testsuite to change all the
locale names to "ISO8859-" because it works for both Linux and *BSD.

This is a change for changes sake.

Locale -m lists the encodings.
Locale -a lists the available locales

This is true on Linux as well.
Nobody is getting POLA'D here.

Moveover, there is significant work to implement this.  We brought up
the possibility of hyphen- and case- sensitivity on DragonFly and the
idea was shot down.  The reasons were solid enough.

There is no standard for encoding, period.  Using one source is as valid
another another.  I say leave it alone.

John

_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: Question about ASCII and nl_langinfo (locale work)

Enji Cooper

> On Nov 17, 2015, at 00:22, John Marino (FreeBSD) <[hidden email]> wrote:
>
>> On 11/16/2015 10:51 PM, Andrey Chernov wrote:
>>> On 17.11.2015 0:06, Baptiste Daroussin wrote:
>>> locales the IANA way and are unhappy because that does not work. The first plan
>>> in the collation branch was to introduce the IANA syntax via an alias but in the
>>> end I removed it, because there was already to many changes.
>> For ISO case we don't need aliases and can keep our internal names
>> hierarchy honoring POLA. All we need is:
>> 1) Convert "ISO-" and "ISO_" to "ISO" for setlocale(3) input.
>> 2) Convert from "ISO" to "ISO-" for setlocale(3), nl_langinfo(3) and
>> locale(1) output.
>
> A huge patch just went into GCC libstdc++ testsuite to change all the
> locale names to "ISO8859-" because it works for both Linux and *BSD.
>
> This is a change for changes sake.
>
> Locale -m lists the encodings.
> Locale -a lists the available locales
>
> This is true on Linux as well.
> Nobody is getting POLA'D here.
>
> Moveover, there is significant work to implement this.  We brought up
> the possibility of hyphen- and case- sensitivity on DragonFly and the
> idea was shot down.  The reasons were solid enough.
>
> There is no standard for encoding, period.  Using one source is as valid
> another another.  I say leave it alone.

Windows is probably the closest thing to a standard here. What does it use -- dashes or underscores?
Thanks,
-NGie
_______________________________________________
[hidden email] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "[hidden email]"