Von Unicode nach iso-8859-1 von Calocybe, 04.09.2003 19:51

Von Unicode nach iso-8859-1

Calocybe 04.09.2003 19:51

perl

Hi!

Wer einfach nur mal schnell UTF-8 in Latin-1 konvertieren moechte, ohne die neueste Perl-Version oder diverse Module installieren zu muessen, findet folgenden Code vielleicht hilfreich. Voraussetzung fuer das korrekte Funktionieren ist, dass es sich beim Input wirklich um einen UTF-8-String handelt, der vollstaendig in Latin-1 darstellbar ist.

sub utf8_to_latin1($) {
my ($i, @s);

@s = unpack('C*', $_[0]);
    for ($i=0; $i<@s; $i++) {
        splice(@s, $i, 2, (($s[$i] & 0x03) << 6) | ($s[$i+1] & 0x3F))
            if (($s[$i] & 0xFC) == 0xC0);
    }
    return pack('C*', @s);
}

Das ganze zum Nachvollziehen noch mal in ausfuehrlich:

sub utf8_to_latin1($) {
my ($i, @s);

@s = unpack('C*', $_[0]);

$i = 0;
    while ($i < @s) {
        if ($s[$i] & 0x80) {
            # is a UTF-8 code
            if (($s[$i] & 0xFC) == 0xC0) {
                # this will render a valid Latin1 char
                $s[$i] = (($s[$i] & 0x03) << 6) | ($s[$i+1] & 0x3F);
                splice(@s, $i+1, 1);
                $i++;
            } else {
                # any other unicode char
                # we could determine the number of bytes of this code and skip them, but as the following
                # values all have bit 7 set and bit 6 unset in a valid utf8 stream, we can just skip over
                # this byte and the following will be automatically skipped as well. ok, we've seen more
                # performant approaches, but this case is not expected to happen at all. after all, the
                # string should be encodable in iso-8859-1

$i++;
}

} else {
            # ASCII - leave unchanged
            $i++;
        }
    }

return pack('C*', @s);
}

HTH && So long

--
I'm sorry. It has to end here.

Beitrag melden

– Informationen zu den Bewertungsregeln

SELFHTML Forum - Ergänzung zur Dokumentation Übersicht

Calocybe: Von Unicode nach iso-8859-1

Beitrag lesen

Von Unicode nach iso-8859-1

Von Unicode nach iso-8859-1