[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [freebsd-fdp] OT: charset mess



On Wed, Apr 14, 2004 at 07:29:21PM +0200, Viktor Vasilev wrote:
> On Wed, Apr 14, 2004 at 12:53:38PM -0400, lou wrote:
> > On Wed, Apr 14, 2004 at 05:49:01PM +0200, Viktor Vasilev wrote:
> > > On Wed, Apr 14, 2004 at 08:26:31AM -0400, Miroslav Pendev wrote:
> > > > On Wed, Apr 14, 2004 at 09:13:34AM +0300, Peter Pentchev wrote:
> > > > > On Wed, Apr 14, 2004 at 01:31:30AM -0400, lou wrote:

> Според мен това е единственият проблем при архивирането с твърда кодировка - 
> какво става като е грешен MIME content-type. Ако всичко си е наред със 
> писмото, прекодирането преди да влезе в архива е тривиално.

exactly!

> > <pseudo>
> > wrap ezmlm-archive
> > 	get content-type 
> > 	get enc detected
> > 	if match
> > 		 fix content-type according to enc.
> > 	pipe back to ezmlm-archive.
> > </pseudo>
> > 
> > tova shte e 5 lines script.. IMHO. kakvo detectva encoding e neshto drugo :)
> 
> Go for it, tigger! :-)
> За да видиш дали е грешен енкодинга можеш да сравняваш content-type от хедъра
> с мнението на konwert. Може би дори ще работи :-)

e tova e ideata, converters/enca e reasonable tool. niama i mnogo cruft po nego:

% ldd /usr/local/bin/enca
/usr/local/bin/enca:
	libm.so.2 => /usr/lib/libm.so.2 (0x2806f000)
	libiconv.so.3 => /usr/local/lib/libiconv.so.3 (0x2808b000)
	libenca.so.3 => /usr/local/lib/libenca.so.3 (0x28179000)
	libc.so.4 => /usr/lib/libc.so.4 (0x2819a000)

see.

-- cut here --

% cd web/
% enca -L bg index.html
MS-Windows code page 1251
  LF line terminators
% cd ../web-utf8
% enca -L bg index.html
Universal transformation format 8 bits; UTF-8
% enca --list languages
Belarussian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855
  Bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
      Czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
   Estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
   Croatian: CP1250 ISO-8859-2 IBM852 macce CORK
  Hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
 Lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
    Latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
     Polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
    Russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
     Slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
    Slovene: ISO-8859-2 CP1250 IBM852 macce CORK
  Ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
       none:
% 
-- cut here --

same crap ako se polzva i ru language.

-- cut here --
% enca -L ru index.html
Universal transformation format 8 bits; UTF-8
% cd ../web
% enca -L ru index.html
MS-Windows code page 1251
  LF line terminators
% 

-- cut here --

eto i s emailite test:

-- cut here --

; echo $list
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
; for (s in $list) enca -L bg $s
7bit ASCII characters
7bit ASCII characters
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from CP1251
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from CP1251
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from CP1251
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
7bit ASCII characters
7bit ASCII characters
MS-Windows code page 1251
  LF line terminators
MS-Windows code page 1251
  LF line terminators
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
7bit ASCII characters
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8
Universal transformation format 8 bits; UTF-8

-- cut here --

at the end of the day, znaem che listata e bulgarska.. taka che ochakvame
predimno CP1251, UTF-8, KIO8-R i drugi ot grafata Bulgarian/Russian/Ukrainian.


is that good enuf?

l