Christian Kruse: mbstring-Funktionen/ UTF-8 in PHP

Beitrag lesen

你好 wahsaga,

[...]
2. was ist mit den string-funktionen, die unter an dieser stelle
_nicht_ genannt werden, wie beispielsweise trim() oder auch
preg_replace() - sind die "von natur aus" multi-byte-fähig, oder
muss/sollte ich hier mit komplikationen rechnen?

Von trim() habe ich keine Ahnung, aber die manpage zur libpcre, die auch von
PHP benutzt wird, sagt in pcrecompat:

1. PCRE does not have full UTF-8 support. Details of what it does have are
given in the section on UTF-8 support  in  the  main  pcre page.

Weiter steht da:

UTF-8 SUPPORT

Starting  at  release 3.3, PCRE has had some support for character strings encoded in the UTF-8 format. For release 4.0 this has been
       greatly extended to cover most common requirements.

In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must  call  pcre_com-
       pile()  with  the  PCRE_UTF8  option flag. When you do this, both the pattern and any subject strings that are matched against it are
       treated as UTF-8 strings instead of just strings of bytes.

If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be a bit bigger, but the additional run  time
       overhead is limited to testing the PCRE_UTF8 flag in several places, so should not be very large.

The following comments apply when PCRE is running in UTF-8 mode:

1.  When  you  set  the PCRE_UTF8 flag, the strings passed as patterns and subjects are checked for validity on entry to the relevant
       functions. If an invalid UTF-8 string is passed, an error return is given. In some situations, you may already know that your strings
       are valid, and therefore want to skip these checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at compile
       time or at run time, PCRE assumes that the pattern or subject it is given (respectively) contains only valid  UTF-8  codes.  In  this
       case,  it  does not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when PCRE_NO_UTF8_CHECK is set, the
       results are undefined. Your program may crash.

2. In a pattern, the escape sequence \x{...}, where the contents of the braces is a string of hexadecimal digits, is interpreted as a
       UTF-8  character whose code number is the given hexadecimal number, for example: \x{1234}. If a non-hexadecimal digit appears between
       the braces, the item is not recognized.  This escape sequence can be used either as a literal, or within a character class.

3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

4. Repeat quantifiers apply to complete UTF-8 characters, not to individual bytes, for example: \x{100}{3}.

5. The dot metacharacter matches one UTF-8 character instead of a single byte.

6. The escape sequence \C can be used to match a single byte in UTF-8 mode, but its use can lead to some strange effects.

7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of any code value, but the characters that PCRE
       recognizes as digits, spaces, or word characters remain the same set as before, all with values less than 256.

8.  Case-insensitive  matching  applies only to characters whose values are less than 256. PCRE does not support the notion of "case"
       for higher-valued characters.

9. PCRE does not support the use of Unicode tables and properties or the Perl escapes \p, \P, and \X.

Du musst also deine preg_*-Pattern mit dem u-Flag erweitern
(preg_match('/.../u',$txt)) und die oben genannten Hinweise beachten.

再见,
 CK

--
1 + 1 = 3 für gosse Werte von 1.
http://wwwtech.de/