POSIX regex VS. multi-byte characters

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

POSIX regex VS. multi-byte characters

Gabor Kovesdan-3
Hi Folks,

While working on bringing in a new regex code to FreeBSD, I came into an
issue. POSIX says here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09

"Matching shall be based on the bit pattern used for encoding the
character, not on the graphic representation of the character. This
means that if a character set contains two or more encodings for a
graphic symbol, or if the strings searched contain text encoded in more
than one codeset, no attempt is made to search for any other
representation of the encoded symbol. If that is required, the user can
specify equivalence classes containing all variations of the desired
graphic symbol."

According to my interpretation of this text, if someone specifies a
single bit as pattern that can be a prefix of a multi-byte character
that shall match, since match is based on bit pattern not semantical
meaning. Besides, in a consistent environment that uses a single
encoding and also supposing a user with common sense that would not
enter meaningless input, only whole characters should occur in the
pattern. However, GNU grep has a test in its regression test suite that
contradicts to this and chooses the opposite approach, i.e. it shall not
match a fragment of a character. Looking at the standard, I think GNU
grep is incorrect and my interpretation is the correct one.

Could you please comment on this?

Thanks,
Gabor Kovesdan


_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: POSIX regex VS. multi-byte characters

Wolfgang Zenker-2
Hi Gabor,

* Gabor Kovesdan <[hidden email]> [110902 04:08]:
> While working on bringing in a new regex code to FreeBSD, I came into an
> issue. POSIX says here:
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09

> "Matching shall be based on the bit pattern used for encoding the
> character, not on the graphic representation of the character. This
> means that if a character set contains two or more encodings for a
> graphic symbol, or if the strings searched contain text encoded in more
> than one codeset, no attempt is made to search for any other
> representation of the encoded symbol. If that is required, the user can
> specify equivalence classes containing all variations of the desired
> graphic symbol."

> According to my interpretation of this text, if someone specifies a
> single bit as pattern that can be a prefix of a multi-byte character
> that shall match, since match is based on bit pattern not semantical
> meaning. Besides, in a consistent environment that uses a single
> encoding and also supposing a user with common sense that would not
> enter meaningless input, only whole characters should occur in the
> pattern. However, GNU grep has a test in its regression test suite that
> contradicts to this and chooses the opposite approach, i.e. it shall not
> match a fragment of a character. Looking at the standard, I think GNU
> grep is incorrect and my interpretation is the correct one.

I think you are misinterpreting the standard here. As I read it, the
phrase "bit pattern used for encoding the character" means the complete
byte sequence that encodes the character, not just a byte. The paragraph
quoted above talks about characters that have several different encodings
like e.g. characters that exist as single codepoint but can also be
encoded using diacritical marks and a base character.

Wolfgang
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: POSIX regex VS. multi-byte characters

Andrey Chernov
On Fri, Sep 02, 2011 at 08:03:38AM +0200, Wolfgang Zenker wrote:

> Hi Gabor,
>
> * Gabor Kovesdan <[hidden email]> [110902 04:08]:
> > While working on bringing in a new regex code to FreeBSD, I came into an
> > issue. POSIX says here:
> > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09
>
> > "Matching shall be based on the bit pattern used for encoding the
> > character, not on the graphic representation of the character. This
> > means that if a character set contains two or more encodings for a
> > graphic symbol, or if the strings searched contain text encoded in more
> > than one codeset, no attempt is made to search for any other
> > representation of the encoded symbol. If that is required, the user can
> > specify equivalence classes containing all variations of the desired
> > graphic symbol."
>
> > According to my interpretation of this text, if someone specifies a
> > single bit as pattern that can be a prefix of a multi-byte character
> > that shall match, since match is based on bit pattern not semantical
> > meaning. Besides, in a consistent environment that uses a single
> > encoding and also supposing a user with common sense that would not
> > enter meaningless input, only whole characters should occur in the
> > pattern. However, GNU grep has a test in its regression test suite that
> > contradicts to this and chooses the opposite approach, i.e. it shall not
> > match a fragment of a character. Looking at the standard, I think GNU
> > grep is incorrect and my interpretation is the correct one.
>
> I think you are misinterpreting the standard here. As I read it, the
> phrase "bit pattern used for encoding the character" means the complete
> byte sequence that encodes the character, not just a byte. The paragraph
> quoted above talks about characters that have several different encodings
> like e.g. characters that exist as single codepoint but can also be
> encoded using diacritical marks and a base character.

1) As I read it, too. "bit pattern" means to be complete, not partial.
POSIX don't suppose partial or fragmened charaters match, all characters
there are always complete and monolitic.

2) The whole intention says; i.e. graphically same Russsian 'a' should not
match graphically same English 'a' inside giving character set like
KOI8-R or Unicode.

3) Meaningless input should not match anything with meaning, so only one
question remains, should meaningless input match exact the same
meaningless input or should exit with error? POSIX grep() says nothing,
POSIX regexec() says not more than:
"The regcomp( ) and regexec( ) functions are required to accept any
null-terminated string as the pattern argument. If the meaning of the
string is 'undefined', the behavior of the function is 'unspecified'."
Currently GNU grep match meaningless input with exact the same in the
file. Fragment of character (not completed) is meaningless input, so I
don't see where GNU grep is opposite.
 
--
http://ache.vniz.net/
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-i18n
To unsubscribe, send any mail to "[hidden email]"