你好 wahsaga,
[...]
2. was ist mit den string-funktionen, die unter an dieser stelle
_nicht_ genannt werden, wie beispielsweise trim() oder auch
preg_replace() - sind die "von natur aus" multi-byte-fähig, oder
muss/sollte ich hier mit komplikationen rechnen?
Von trim() habe ich keine Ahnung, aber die manpage zur libpcre, die auch von
PHP benutzt wird, sagt in pcrecompat:
1. PCRE does not have full UTF-8 support. Details of what it does have are
given in the section on UTF-8 support in the main pcre page.
Weiter steht da:
UTF-8 SUPPORT
Starting at release 3.3, PCRE has had some support for character strings encoded in the UTF-8 format. For release 4.0 this has been
greatly extended to cover most common requirements.
In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_com-
pile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject strings that are matched against it are
treated as UTF-8 strings instead of just strings of bytes.
If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be a bit bigger, but the additional run time
overhead is limited to testing the PCRE_UTF8 flag in several places, so should not be very large.
The following comments apply when PCRE is running in UTF-8 mode:
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects are checked for validity on entry to the relevant
functions. If an invalid UTF-8 string is passed, an error return is given. In some situations, you may already know that your strings
are valid, and therefore want to skip these checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at compile
time or at run time, PCRE assumes that the pattern or subject it is given (respectively) contains only valid UTF-8 codes. In this
case, it does not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when PCRE_NO_UTF8_CHECK is set, the
results are undefined. Your program may crash.
2. In a pattern, the escape sequence \x{...}, where the contents of the braces is a string of hexadecimal digits, is interpreted as a
UTF-8 character whose code number is the given hexadecimal number, for example: \x{1234}. If a non-hexadecimal digit appears between
the braces, the item is not recognized. This escape sequence can be used either as a literal, or within a character class.
3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.
4. Repeat quantifiers apply to complete UTF-8 characters, not to individual bytes, for example: \x{100}{3}.
5. The dot metacharacter matches one UTF-8 character instead of a single byte.
6. The escape sequence \C can be used to match a single byte in UTF-8 mode, but its use can lead to some strange effects.
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of any code value, but the characters that PCRE
recognizes as digits, spaces, or word characters remain the same set as before, all with values less than 256.
8. Case-insensitive matching applies only to characters whose values are less than 256. PCRE does not support the notion of "case"
for higher-valued characters.
9. PCRE does not support the use of Unicode tables and properties or the Perl escapes \p, \P, and \X.
Du musst also deine preg_*-Pattern mit dem u-Flag erweitern
(preg_match('/.../u',$txt)) und die oben genannten Hinweise beachten.
再见,
CK