Re: login.conf --> UTF-8

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: login.conf --> UTF-8

Andrey Chernov
On 04.04.2014 16:46, Gleb Smirnoff wrote:

> On Thu, Apr 03, 2014 at 01:34:33AM +0400, Andrey Chernov wrote:
> A> On 02.04.2014 21:15, Gleb Smirnoff wrote:
> A> > S> + :lang=en_US.UTF-8:\
> A> > S> + :charset=UTF-8:
> A> >
> A> > And I'd like to do same change for the 'russian' login class
> A> > in /etc/login.conf.
> A>
> A> Please everybody remember that we don't have UTF-8 collation
> A> implemented, just fallback to bytecode comparison.
>
> Any objections on checking in FreeBSD-compatible[1] UTF-8 collation
> implementation from Alex Tutubalin?
>
> http://blog.lexa.ru/2008/03/03/freebsd_utf8_russian_collate_vtoraja_popitka.html
>

Even his "version 2" have my objections. I already reply Alex about this
in 2008. In short:
1) It is error there: almost all single chars above ASCII should be
"chains", i.t. two bytes minimum, since there almost no intersections
with ISO8859-1 as UTF-8 subset.
2) The table itself is very incomplete, f.e. not covering either whole
KOI8-R, nor ISO8859-5, nor CP866. It is made from CP1251 with all its
restrictions. So, switching from f.e. KOI8-R to UTF-8 will cause sorting
regression. Russian UTF-8 collation should be able to sort all major
Russian charsets mentioned, i.e. we need combined table.
3) "charmap map.ISO8859-1" declaration is missing (needed mainly for
using pure ASCII chars mnemonic names).

Even in case above mentioned errors will be removed and the code will be
committed afterwards, we should understand that this way (implementing
multibyte collation via single byte one) even while being possible is a
big hack and slowing sorting down up to 10 times.

Proper "Unicode collation algorithm" is already implemented by ICU and
other projects. See
http://unicode.org/reports/tr10/
It will be better if someone adopt it instead of hacks.

--
http://ache.vniz.net/
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: login.conf --> UTF-8

Sean Bruno-6
On Sat, 2014-04-05 at 05:35 +0400, Andrey Chernov wrote:

> On 04.04.2014 16:46, Gleb Smirnoff wrote:
> > On Thu, Apr 03, 2014 at 01:34:33AM +0400, Andrey Chernov wrote:
> > A> On 02.04.2014 21:15, Gleb Smirnoff wrote:
> > A> > S> + :lang=en_US.UTF-8:\
> > A> > S> + :charset=UTF-8:
> > A> >
> > A> > And I'd like to do same change for the 'russian' login class
> > A> > in /etc/login.conf.
> > A>
> > A> Please everybody remember that we don't have UTF-8 collation
> > A> implemented, just fallback to bytecode comparison.
> >
> > Any objections on checking in FreeBSD-compatible[1] UTF-8 collation
> > implementation from Alex Tutubalin?
> >
> > http://blog.lexa.ru/2008/03/03/freebsd_utf8_russian_collate_vtoraja_popitka.html
> >
>
> Even his "version 2" have my objections. I already reply Alex about this
> in 2008. In short:
> 1) It is error there: almost all single chars above ASCII should be
> "chains", i.t. two bytes minimum, since there almost no intersections
> with ISO8859-1 as UTF-8 subset.
> 2) The table itself is very incomplete, f.e. not covering either whole
> KOI8-R, nor ISO8859-5, nor CP866. It is made from CP1251 with all its
> restrictions. So, switching from f.e. KOI8-R to UTF-8 will cause sorting
> regression. Russian UTF-8 collation should be able to sort all major
> Russian charsets mentioned, i.e. we need combined table.
> 3) "charmap map.ISO8859-1" declaration is missing (needed mainly for
> using pure ASCII chars mnemonic names).
>
> Even in case above mentioned errors will be removed and the code will be
> committed afterwards, we should understand that this way (implementing
> multibyte collation via single byte one) even while being possible is a
> big hack and slowing sorting down up to 10 times.
>
> Proper "Unicode collation algorithm" is already implemented by ICU and
> other projects. See
> http://unicode.org/reports/tr10/
> It will be better if someone adopt it instead of hacks.
>

If you have a different patch, I'd appreciate seeing it.  

Sean

signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: login.conf --> UTF-8

Andrey Chernov
On 05.04.2014 6:39, Sean Bruno wrote:

> On Sat, 2014-04-05 at 05:35 +0400, Andrey Chernov wrote:
>> On 04.04.2014 16:46, Gleb Smirnoff wrote:
>>> On Thu, Apr 03, 2014 at 01:34:33AM +0400, Andrey Chernov wrote:
>>> A> On 02.04.2014 21:15, Gleb Smirnoff wrote:
>>> A> > S> + :lang=en_US.UTF-8:\
>>> A> > S> + :charset=UTF-8:
>>> A> >
>>> A> > And I'd like to do same change for the 'russian' login class
>>> A> > in /etc/login.conf.
>>> A>
>>> A> Please everybody remember that we don't have UTF-8 collation
>>> A> implemented, just fallback to bytecode comparison.
>>>
>>> Any objections on checking in FreeBSD-compatible[1] UTF-8 collation
>>> implementation from Alex Tutubalin?
>>>
>>> http://blog.lexa.ru/2008/03/03/freebsd_utf8_russian_collate_vtoraja_popitka.html
>>>
>>
>> Even his "version 2" have my objections. I already reply Alex about this
>> in 2008. In short:
>> 1) It is error there: almost all single chars above ASCII should be
>> "chains", i.t. two bytes minimum, since there almost no intersections
>> with ISO8859-1 as UTF-8 subset.
>> 2) The table itself is very incomplete, f.e. not covering either whole
>> KOI8-R, nor ISO8859-5, nor CP866. It is made from CP1251 with all its
>> restrictions. So, switching from f.e. KOI8-R to UTF-8 will cause sorting
>> regression. Russian UTF-8 collation should be able to sort all major
>> Russian charsets mentioned, i.e. we need combined table.
>> 3) "charmap map.ISO8859-1" declaration is missing (needed mainly for
>> using pure ASCII chars mnemonic names).
>>
>> Even in case above mentioned errors will be removed and the code will be
>> committed afterwards, we should understand that this way (implementing
>> multibyte collation via single byte one) even while being possible is a
>> big hack and slowing sorting down up to 10 times.
>>
>> Proper "Unicode collation algorithm" is already implemented by ICU and
>> other projects. See
>> http://unicode.org/reports/tr10/
>> It will be better if someone adopt it instead of hacks.
>>
>
>
> If you have a different patch, I'd appreciate seeing it.  
I don't have a different patch. In case you have enough time to fix
above mentioned obstacles, I can review yours (or somebody else's) one.
"No code" situation doesn't mean wrong code can be committed. Do it
properly even when it is a hack.

--
http://ache.vniz.net/


signature.asc (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: login.conf --> UTF-8

Andrey Chernov
In reply to this post by Andrey Chernov
On 05.04.2014 5:35, Andrey Chernov wrote:
> Even his "version 2" have my objections. I already reply Alex about this
> in 2008. In short:
> 1) It is error there: almost all single chars above ASCII should be
> "chains", i.t. two bytes minimum
...

I check my whole correspondence with Alexey and withdraw objection #1
which was related to the
\x80;...;\xff
line in his table. While they are illegal sequences in UTF-8, our
colldef(1) wants all single byte characters mapped.

--
http://ache.vniz.net/
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: login.conf --> UTF-8

Andrey Chernov
In reply to this post by Andrey Chernov
Few explanations to clarify maybe non-obvious moments:

On 05.04.2014 7:35, Andrey Chernov wrote:
>>> big hack and slowing sorting down up to 10 times.

Because our search for chains is linear because common single byte table
have no more than 2-3 chains. I don't think it worth efforts to optimize
search here, because better way to spend them is to implement
UCA:
>>> http://unicode.org/reports/tr10/

> "No code" situation doesn't mean wrong code can be committed.

Since we plan to change defaults from KOI8-R to UTF-8 ("russian" login
class), breaking sort order for non-alphabetic chars will violate POLA.
Sort order will be broken because only CP1251 is used to construct Alex
"chains" collation without merging with KOI8-R table.

Merging KOI8-R collation is absolute minimum, but proper hack will be
merging CP866 and ISO8859-5 too, as I already mention.

--
http://ache.vniz.net/


signature.asc (188 bytes) Download Attachment