[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [freebsd-fdp] OT: charset mess
On Wed, Apr 14, 2004 at 07:29:21PM +0200, Viktor Vasilev wrote:
> On Wed, Apr 14, 2004 at 12:53:38PM -0400, lou wrote:
> > On Wed, Apr 14, 2004 at 05:49:01PM +0200, Viktor Vasilev wrote:
> > > On Wed, Apr 14, 2004 at 08:26:31AM -0400, Miroslav Pendev wrote:
> > > > On Wed, Apr 14, 2004 at 09:13:34AM +0300, Peter Pentchev wrote:
> > > > > On Wed, Apr 14, 2004 at 01:31:30AM -0400, lou wrote:
> Според мен това е единственият проблем при архивирането с твърда кодировка -
> какво става като е грешен MIME content-type. Ако всичко си е наред със
> писмото, прекодирането преди да влезе в архива е тривиално.
exactly!
> > <pseudo>
> > wrap ezmlm-archive
> > get content-type
> > get enc detected
> > if match
> > fix content-type according to enc.
> > pipe back to ezmlm-archive.
> > </pseudo>
> >
> > tova shte e 5 lines script.. IMHO. kakvo detectva encoding e neshto drugo :)
>
> Go for it, tigger! :-)
> За да видиш дали е грешен енкодинга можеш да сравняваш content-type от хедъра
> с мнението на konwert. Може би дори ще работи :-)
e tova e ideata, converters/enca e reasonable tool. niama i mnogo cruft po nego:
% ldd /usr/local/bin/enca
/usr/local/bin/enca:
libm.so.2 => /usr/lib/libm.so.2 (0x2806f000)
libiconv.so.3 => /usr/local/lib/libiconv.so.3 (0x2808b000)
libenca.so.3 => /usr/local/lib/libenca.so.3 (0x28179000)
libc.so.4 => /usr/lib/libc.so.4 (0x2819a000)
see.
-- cut here --
% cd web/
% enca -L bg index.html
MS-Windows code page 1251
LF line terminators
% cd ../web-utf8
% enca -L bg index.html
Universal transformation format 8 bits; UTF-8
% enca --list languages
Belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855
Bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
Czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
Estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
Croatian: CP1250 ISO-8859-2 IBM852 macce CORK
Hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
Lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
Latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
Polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
Russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
Slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
Slovene: ISO-8859-2 CP1250 IBM852 macce CORK
Ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
none:
%
-- cut here --
same crap ako se polzva i ru language.
-- cut here --
% enca -L ru index.html
Universal transformation format 8 bits; UTF-8
% cd ../web
% enca -L ru index.html
MS-Windows code page 1251
LF line terminators
%
-- cut here --
eto i s emailite test:
-- cut here --
; echo $list
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
; for (s in $list) enca -L bg $s
7bit ASCII characters
7bit ASCII characters
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from CP1251
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from CP1251
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from CP1251
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
7bit ASCII characters
7bit ASCII characters
MS-Windows code page 1251
LF line terminators
MS-Windows code page 1251
LF line terminators
Universal transformation format 8 bits; UTF-8
Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
7bit ASCII characters
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
-- cut here --
at the end of the day, znaem che listata e bulgarska.. taka che ochakvame
predimno CP1251, UTF-8, KIO8-R i drugi ot grafata Bulgarian/Russian/Ukrainian.
is that good enuf?
l