[Israel.pm] UTF-8
Mikhael Goikhman
migo at homemail.com
Thu May 3 17:42:48 EEST 2007
On 03 May 2007 12:41:30 +0300, Pinkhas Nisanov wrote:
>
> I need to convert some text to UTF-8, problem is that in text I have
> some characters encoded in ISO-8859-1 and some in UTF-8.
So your problem is this: you have a binary string (with no utf8 flag
set) with mixed encoding, and you want to convert it to utf-8 encoded
string (with or without utf8 flag set). Here is example of such string:
# this example should work on any modern unix and GNU date
my $mixed_string = join("",
`env LANG=he_IL.utf8 date -d 2007-02-28`,
`env LANG=fr_FR.iso-8859-1 date -d 2007-02-28`,
);
# when you print it, expect invalid chars in utf8 or iso-8859-1
# part or in both, depending on your locale and font
print $mixed_string;
> When I run "encode_utf8" function it convert ISO characters to UTF, but
> it also convert UTF characters to something unreadable. Is there some
> way to convert iso->utf and leave utf without change?
This is kind of easy. In order to parse binary string into utf8 chars you
should use decode_utf8[*]. And it will fail on non-utf8 chars. It may
fail differently depending on additional parameter, 0) substitude invalid
chars with special codes, 1) die, 2) decode only the string head till
possible, 3) the same with warning. You may read the man page of Encode.
So the following two lines should solve your problem completely:
my $utf8_string = decode_utf8($mixed_string, Encode::HTMLCREF);
$utf8_string =~ s/&#(\d+);/decode("iso-8859-1", chr($1))/eg;
#use encoding ":locale", STDOUT => 'utf8';
print $utf8_string;
The doc says that it is possible to give your own failure procedure to
decode_utf8, so it would be done directly in one line, but passing
coderef does not work for me as documented (the sub is not called in
Encode 2.12 or 2.20). Hovewer the solution above should be good enough.
[*] In my practice I almost never used encode_utf8 function, it would
just unset the utf8 flag that I don't usually need to do. I usually need
the opposite, i.e. to set the utf8 flag on string using decode_utf8.
Regards,
Mikhael.
--
perl -e 'print+chr(64+hex)for+split//,d9b815c07f9b8d1e'
More information about the Perl
mailing list