emacs lisp function to decode url percent encoding?

Perm url with updates: http://xahlee.org/js/url_encoding_unicode.html

URL Percent Encoding and Unicode

Xah Recommends:
Kindle
Amazon Kindle. Read books under the sun. Review

Xah Lee, 2010-05-24

This page discuss some issues about what characters should be percent encoded in url, and how different browsers behave.

Browser Behavior

Some test on browser's behavior on url encoding/decoding. Apparently, some browsers automatically decode parts of the percent encoding.

Copy this line:

http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

then go to browser, open a new tab or window. 【Alt+d】 to select the url field, 【Ctrl+v】 to paste in. Enter to go to the page.

Then, 【Alt+d】 to select url field, 【Ctrl+a】 to select all, 【Ctrl+c】 to copy. Then, paste in a text editor. Here are the results:

• Google Chrome
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(D%C3%BCrer)

• Safari
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

• Firefox
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29

• Opera
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

• IE
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

Now, try again, starting with this line:

http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29
• Google Chrome
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(D%C3%BCrer)

• Safari
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28Dürer%29

• Firefox
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29

• Opera
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_(Dürer)

• IE
http://en.wikipedia.org/wiki/St._Jerome_in_His_Study_%28D%C3%BCrer%29

Another example. Start with:

http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
• Google Chrome
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

• Safari
http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

• Firefox
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

• Opera
http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

• IE
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

All results are on Windows Vista, using latest public released version of the browsers as of 2010-05-24.

Summary

Here's some summary of the behavior as it appears from above tests:

  • Firefox (v 3.6.3), is the most aggressive in turning characters in url into the percent encoded form.
  • Google Chrome (4.1.249.1064 (45376)) will change unicode chars into percent encoded form, but not parenthesis chars.
  • Safari (4.0.5 (531.22.7)) does convert some percent encoded chars into plain unicode char, but not all.
  • Opera (v 10.10, build 1893) is the best, it shows unicode and paren and en-dash as is.
  • IE (8.0.6001.18904), seems to take the approach that it doesn't do anything to the url. Whatever you pasted in, remain unchanged.

References

Emacs Question

Is there emacs lisp function that decode the url percent encoding? e.g.

http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

should become

http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

That's a EN DASH (unicode 8211, #o20023, #x2013).

I know there's a

 (require 'gnus-util)
 (gnus-url-unhex-string ...)

but that just unhex, and generates gibberish if the url contains unicode chars.

some study shows that the “%E2%80%93” are hexadecimals E2 80 93, and is the byte sequence of the en dash char by utf-8 encoding.

So, i guess i could parse the url then interpret the %x string as utf-8 hex bytes then turn them back to unicode chars. Any idea if there's built in function that helps this?

Some discussion and temp solutions at:

Reported to FSF: bug#6252.

From the above discussions, you can see that it does not seem clear what character should be percent encoded. In fact, different browsers have different behavior.

Was this page useful? If so, please do donate $3, thank you donors!

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs