Chinese Pinyin Letter Frequency and Dvorak Layout

Perm url with updates: http://xahlee.org/Periodic_dosage_dir/bangu/pinyin_frequency.html

Chinese Pinyin Letter Frequency and Dvorak Layout

Xah Lee, 2005-09-01

The following is a letter frequency of Chinese in pinyin. The purpose of this study is to find out whether the Dvorak Keyboard Layout is efficient for inputing Chinese with pinyin too.

Tones
 4   9714
 2   7137
 1   6805
 3   5125
 5   1547

Letters
 i  12620
 n  11269
 a   9314
 u   7075
 g   6922
 e   6851
 h   6815
 o   5519
 z   3545
 d   3363
 s   2585
 y   2571
 j   2299
 l   1522
 b   1422
 x   1361
 c   1150
 w   1097
 r   1073
 m    930
 f    925
 t    881
 q    717
 k    448
 p    255
 v     12 (v is u umlaut as in nv (woman) etc)

This table is compiled by Dylan Sung, taken from his post in newsgroup sci.lang of 2005-08-27, subject: “Letter frequency of Chinese pinyin”. (Source)

Originally, i'm curious about frequency of pinyin because i'm wondering whether Dvorak keyboard is also very efficient in typing pinyin than qwerty.

Pinyin on Dvorak Keyboard Pinyin on QWERTY

Arrangement of the Dvorak Keyboard Layout and the traditional QWERTY. The red are the most frequent letters used in pinyin, followed by yellow, then green.

For the list of letter frequencies of English text, see Wikipedia: Letter frequencies.

2010-09-22

The following data are from http://fatduck.org/dvorak/, accessed on 2010-09-22. The author is 潘永之.

The following tables are letter distributions on qwerty and dvorak. The input is a 403 words chinese blog written in pinyin.

QWERTY

q 0.56% w 2.01% e 6.51% r 0.32% t 1.77% y 2.81% u 7.40% i 12.70% o 6.51% p 0.16% 40.76%
a 13.75% s 1.53% d 3.54% f 0.72% g 4.18% h 6.27% j 1.77% k 0.80% l 2.41% 34.97%
z 1.93% x 1.61% c 3.22% v 0.00% b 2.01% n 8.52% m 1.45% , 1.61% . 3.94% 24.28%

Dvorak

' 0.00% , 1.61% . 3.94% p 0.16% y 2.81% f 0.72% g 4.18% c 3.22% r 0.32% l 2.41% 19.37%
a 13.75% o 6.51% e 6.51% u 7.40% i 12.70% d 3.54% h 6.27% t 1.77% n 8.52% s 1.53% 68.49%
q 0.56% j 1.77% k 0.80% x 1.61% b 2.01% m 1.45% w 2.01% v 0.00% z 1.93% 12.14%

The following is distribution of qwerty and dvorak. The input file is all characters in GB2312, a total of 6727 chars. (chinese_characters_GB2312.txt)

QWERTY

q 1.54% w 0.97% e 4.94% r 0.53% t 1.29% y 2.78% u 9.94% i 13.26% o 6.11% p 1.10% 42.46%
a 11.80% s 2.29% d 1.54% f 0.97% g 6.53% h 6.25% j 2.49% k 0.94% l 2.25% 35.06%
z 2.63% x 1.93% c 2.06% v 0.12% b 1.52% n 12.88% m 1.35% 22.48%

Dvorak

' 0.00% , 0.00% . 0.00% p 1.10% y 2.78% f 0.97% g 6.53% c 2.06% r 0.53% l 2.25% 16.22%
a 11.80% o 6.11% e 4.94% u 9.94% i 13.26% d 1.54% h 6.25% t 1.29% n 12.88% s 2.29% 70.30%
q 1.54% j 2.49% k 0.94% x 1.93% b 1.52% m 1.35% w 0.97% v 0.12% z 2.63% 13.48%

See also: Chinese Input with Dvorak Layout (Microsoft Pinyin IME).

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs