unix linux shell uniq unicode bug

Perm URL with updates: http://xahlee.info/comp/unix_uniq_unicode_bug.html

Here's a bug of unix/linux GNU shellutil uniq.

Create a file of the following text:

═
═
═
║
║
║
╒
╓
╔
╕
╖
╗
╘
╙
╚
╛
╜
╝
╞
╟
╠
╡
╢
╣
╤
╥
╦
╧
╨
╩
╪
╫
╬

save it as 〔unicode.txt〕, then do cat unicode.txt | unicq -c. You get “33 ═”. Idiotic unix.

◆ uniq --version
uniq (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later .
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Richard M. Stallman and David MacKenzie.

The man page doesn't mention anything about Unicode. Here's my locale setting anyhow.

◆ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

i think, since about 2005, unix utils are in frenzed patch trying to be Unicode compatible. It looks like, the state is still shi�tty.

A related problem is grep. 〔☛ Problems of Calling Unix grep in Emacs〕 I thought the problem is between the complexity of emacs+cygwin+layer+environment variable. But now i know, it's unix!

am not sure how many unix utils still have Unicode problem.

see also: Complexity & Tedium of Software Engineering

Popular posts from this blog

Browser User Agent Strings 2012

11 Years of Writing About Emacs

does md5 creates more randomness?