2010-01-29

anti-bot: CAPTCHA!

perm url with updates: http://xahlee.org/js/captcha.html

Anti-bot Test: CAPTCHA!

Xah Lee, 2010-01-29

You've seen on the web CAPTCHAs Like this:

captcha

It is a test designed to prevent bot that spam websites. Software can be written to automatically fill web forms. That means, they can leave blog comments or create new web accounts. So, spamers use these software to create hundreds of comments or accounts by the seconds, and leave their advertisement or otherwise walware.

To prevent that, one needs something that computers can not do. Some sort of bot test. So, you have the distorted image, which computers cannot recognize well yet.

The name CAPTCHA is supposed to be: Completely Automated Public Turing test to tell Computers and Humans Apart.

Google's reCAPTCHA

There is a captcha service called reCAPTCHA, now owned by Google and is free, at http://recaptcha.net/. It allows web masters to put captchas on their site.

There is one aspect about reCAPTCHA that's interesting. The distorted text are actually from the process of digitizing books. When OCR can not understand a text, it became the source of google's captcha image. When human gives the answer, the data is used to statistically determine the correct answer. So, reCAPTCHA serves both as anti-bot but also helps in digitizing books. (OCR means Optical Character Recognition. It is the software designed to recognize text in image form, kinda the opposite of captcha.)

Google has a blog explaining their captcha service, at: Source. The blog also features a video, of Luis von Ahn explaining reCAPTCHA. Reading Wikipedia on Luis, he turns out be a well recognized genius with many awards.

Though, i must say, my experience with recaptcha is that often it is hard to understand. Often, after several tries i cannot pass. Apparently, many people felt the same as you can see their comments on google's blog. The severity of this problem is critical.

For more detail, check Wikipedia ReCAPTCHA.

Artificial Intelligence

Captchas are quite interesting in several aspects. It is a simple artificial intelligence problem of devising a scheme so that a machine can tell if a human is human. It is also a problem in image recognition. It is also a interesting problem of web site security.

Note, Wikipedia article cites that many research projects have broken captchas, and also there are alternatives such as image captchas. For example, showing you several images of animals, and ask you to pick one that's cat. Also, breakers has several methods to defeat captchas, including cheap human labor farm.

The history of cop vs robber game in the computing realm is itself a fascinating story.

Overall, i really don't think captchas are a good solution to the web spam problem, at all. It is frustrating to use, waste time, and isn't that effective in preventing spam. In fact, i'm quite surprised that spam just have increased and increased over the past 20 years i use the web, everywhere. In my yahoo email account, gmail account, in my several instant messagge chats, in my web logs, in spam blogs that use randomized snippet of text from my website. Today i even get spams from skype, about few times a week now. Captchas and spam is a phenomenon of the tech geekers trying to solve a human problem by technology. (See: Tech Geekers vs Spammers)

interesting tidbits about domain names

perm url with updates: http://xahlee.org/js/domain_names.html

Domain Names, Cybersquatting, Resell Market

Xah Lee, 2010-01-22

Some interestig tidbits about domains today.

Domain Name Registry vs Rigistrar

So what's the diff between Domain name registry and Domain name registrar? Here are quotes:

A domain name registry, is a database of all domain names registered in a top-level domain. A registry operator, also called a Network Information Center (NIC), is the part of the Domain Name System (DNS) of the Internet that keeps the database of domain names, and generates the zone files which convert domain names to IP addresses. Each NIC is an organisation that manages the registration of Domain names within the top-level domains for which it is responsible, controls the policies of domain name allocation, and technically operates its top-level domain. It is potentially distinct from a domain name registrar. [1]

A domain name registrar is an organization or commercial entity, accredited by the Internet Corporation for Assigned Names and Numbers (ICANN) or by a national country code top-level domain (ccTLD) authority, to manage the reservation of Internet domain names in accordance with the guidelines of the designated domain name registries and offer such services to the public.

So, basically, the “registery” is a database (technical), and its operator is called Network Information Center (NIC). A “registrar” works with NIC, and basically are commercial busnesses. So, when you want to buy a domain, you go to a registrar. (actually, when you buy a domain name from web hosting companies, the web hosting companies are just resellers, they buy the domain name from a registrar for you.)

Also of interest: Independent Domain Registrars. They are basically independent entities that functions as both registry and registrar, but for particular second level names assigned to them (e.g. “uk.com”, “au.com”, “hk.com”).

domain name registrar market share

Domain name registrar market share. Source

Domain Name Market

Cybersquatting

In the Dot-com days, a well known phenomenon is Cybersquatting. (i'm a dot-com-er, 1998-2002.) Today, the purchasing domains for actual use or for reselling, have become mostly legitimate, and the term cybersquatting mostly refer to those domain name buyers with devious intentions of resell at high prices.

There are few famous domain dispute cases cited from the article. Some i've read that are interesting to me are:

Also of interest: Drop registrar.

A drop registrar is a domain name registrar who registers expiring Internet domain names immediately after they expire and are deleted by the domain name registry. A drop registrar will typically use automated software to send up to 250 simultaneous domain name registration requests in an attempt to register the domain name first.[1]

I'm particularly interested in this because, my domain name xahlee.org was going to expire on 2010-01-24, and i only realized it 4 days before. I tried to go to the website to renew the domain name but the site is down, for both days i tried. I tried to contact the company thru phone, but the phone lines are down too. I thought the company is out of business or there's some foul play. Then, finially, i sent email with urgent writings to the company's support, and unexpectedly, i got answer within a couple of hours and my domain name is renewed. I was quite panicked for 3 hours, because, according to web market sites, my domain name is worth some USD$30k. (see: XahLee.org Web Traffic Report)

Domain Tasting

Also interesting: Domain tasting. Basically, domain names registration has a 5-day grace period. If buyer doesn't want it within 5 days, he doesn't have to pay. (the free grace period problem has been fixed since 2009-04.) So, the abusers using automated software to buy hundreds or thousands domain names, and test them out for few days to see which actually generate traffic (from user's typos, etc).

In April 2006, out of 35 million registrations, only a little more than 2 million were permanent or actually purchased. By February 2007, the CEO of Go Daddy reported that of 55.1 million domain names registered, 51.5 million were canceled and refunded just before the 5 day grace period expired and only 3.6 million domain names were actually kept.[4]

In January 2008, Network Solutions was publicly accused of this practice when the company began reserving all domain names searched on their website for five days.[7], a practice known as Domain name front running.

The domain tasting finally ended. See: “Domain tasters” bitter as new fees put an end to their games (2009-08-13), by John Timmer. Source

Domain name speculation

Domain name speculation.

The secondary market for domain names covers previously registered domain names that have not been renewed by registrants or are available for resale. Sometimes these dropped domain names can be more valuable due to their having had high-profile websites associated with them. They will have links from other websites and could still have users searching for the websites because of these links. Others can be valuable because of the generic nature of the domain name or the length of the domain name, with two and three character names being the most sought after.

The business of registering the domain names as they are deleted by the registries is known as drop catching. It is a highly competitive business. The main operators in this business typically set up a number of front companies as registrars. VeriSign, in the case of TLDs COM and NET, allows each registrar a slice of the resources that may be used to register dropped domains. VeriSign drops domains in a random order, giving registrars only a vague idea of the particular drop time of a particular domain. Sometimes a group of drop registrars often work in confederation to increase their possibility of registering a dropped domain immediately after it is deleted by the registry. If the domain is caught by a confederation of registrars attempting to fulfill a domain backorder, then whichever domain registrar caught the domain will register it to the entity who backordered the domain. If the newly reregistered domain is captured by a company that has no customers who backordered it, the domain may be auctioned to the highest bidder by the registrar who captured it or an auction intermediary. The time between a drop and a capture is often measured in seconds or fractions thereof.

Some registrars do not allow domains to drop in the normal fashion, instead introducing an intermediary (e.g., Snapnames and Namejet) that auction the domain prior to their deletion. If nobody buys the domain at auction, it will pass through the normal deletion process.

On the section “Domain name speculation and the rise of Pay per click websites”:

The latest statistics for domain name usage quoted in the Verisign Domain Brief for June 2009 states that of the 92 million COM and NET domain names, 24 % of these domains have one page websites, 64% have multipage websites and 12% have no associated websites.

Resale Record

From Domain name:

To date, and according to Guinness World Records and MSNBC, the most expensive domain name sales on record as of 2004 were[7]:

  • Business.com for $7.5 million in December 1999
  • AsSeenOnTv.com for $5.1 million in January 2000
  • Altavista.com for $3.3 million in August 1998
  • Wine.com for $2.9 million in September 1999
  • CreditCards.com for $2.75 million in July 2004
  • Autos.com for $2.2 million in December 1999

Domain parking


IDNA

Internationalized domain name. Allowing domain names in unicode. IDN turns out rather stupid. This scheme does not modify DNS's Domain name to broaden it beyond alphanumerics, but relies on application that translate the host name to real ascii based one, a scheme called Internationalizing Domain Names in Applications (IDNA). The consequence of IDNA is complex implementation in browser (in part using Punycode), does not work if a host name in Chinese is long, increases Homograph spoofing attack, creates unpredicability in browser's url field (because each browser opt different methods and display of url to prevent the proofing attack).

What it solves is to allow domain name in non-latin langs such as Chinese, Arabics, Russian to be used in host name. The latin alphabet in computing is rather popular and international and have been used throughout the history of the internet and widely used in every non-English speaking countries. Latin alphabet is simple too, and English is popular and considered a international language. The need for local lang in domain name is not that great, and the cost of introducing INDA's complexity outweight the benefits.


For some interesting read about what domain names are taken, see:

Interesting Facts About Domain Names (2006-03-29), by Dennis Forbes. Source

2010-01-28

Blocking Image Leechers

perm url with updates: http://xahlee.org/js/image_leechers.html

Blocking Image Leechers

Xah Lee, 2010-01-22

This page gives some tips about preventing websites that use images from your website.

Image leeching is a problem. Basically, some other website using inline image with the image at your website. Besides copyright issues, it causes bandwidth problem on your site. There are a lot websites these days that allow its users to insert images from a url. The user may not be aware that it is a problem, since most of them are not technical person, and they simply wanted to show their friends some images they found.

Image leeching often takes significant bandwidth from yoursite. If you have a image, let's say a beautiful girl. Many sites that are porn or othewise shady sites, such as 4chan or lots others that are infested by teens and highschool and college students, gamers, they have huge amounts of traffic for rather useless content (mostly teen drivels and bantering). If they insert one of your images, your image will get few hundred hits a day, or thousands. If you get leeched, then more than 50% of your site's bandwidth will be from leechers, more likely, the bandwidth usage will be few times your normal.

My website does not have image leecher protection up to 2003 or so. Then i noticed image leechers, they cause my site to go over bandwidth limit for that month. This happened to me several times in the past. This means, i have to pay extra for the hosting fee, by the mega bytes. (See: XahLee.org Web Traffic Report)

The following code is Apache HTTP Server config file for blocking leechers. You need to place it in a file named “.htaccess”, usually at the root web dir.

RewriteEngine on

# block image leechers
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.xahlee\.org.+|^http://xahlee\.org.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www\.xahlee\.info.+|^http://xahlee\.info.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www\.ergoemacs\.org.+|^http://ergoemacs\.org.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://xahlee\.blogspot\.com.+$ [NC]
RewriteRule \.(png|gif|jpg|jpeg|mov|mp3|wav|ico)$ - [NC,F]

http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

What the above code does is this: Overall, it tries to matche the text from the HTTP_REFERER the browser sends. (HTTP_REFERER contains text where the requested page is from) If conditions are met, then do a “rewrite” about the url (that is, deny access). Here's the conditions: The HTTP_REFERER line does not match blank (if it is blank, it usually means the visitor typed the image url in the browser). The HTTP_REFERER does not matches any of of xahlee.org, xahlee.info, ergoemacs.org, xahlee.blogspot.com. Otherwise (the image is inline from some other website), deny access, for url ending in “png”, “gif”, “jpg”, etc. The “NC” means ingore letter case. The “F” means deny access (Forbidden).

Note that this protection is not absolute. The very nature of web from its original conception is to allow anyone to share info, without much thought about copyright or commecial development. Leecher can easily just mirror your image on their server (which steal your image but doesn't steal your bandwidth), or server's pages can have js code that bypass this protection. Anyhow, that's getting into hacking. The above code should prevent vast majority bandwidth theft.

There are several more advance blocking methods. One is http://www.alistapart.com/articles/hotlinking/, others are using js to prevent people from knowing your image url, or embed images in Flash. But these usually gets a bit complicated.

Site Whackers

Another related problem is site whackers. Some people, typically programing geekers, when they like your site, they download everything to their computer, so they can read offline, or as some packrat habit. My website xahlee.org today is 679 MB. A few whacks, and your monthly bandwidth will be gone. Whack also causes huge spike in your traffic, causing your site to become slow.

Here's the code to prevent web whackers:

# forbid bots that whack entire site
RewriteCond  %{HTTP_USER_AGENT}  ^Wget|Teleport\ Pro|webreaper.net|WebCopier|HTTrack|WebCapture
RewriteRule  ^.+$                -  [F,L]

Note that the above does not provide absolute protection. Dedicated whackers can easily bypass it by masking their user agent id. But it should prevent majority of casual web whackers.

Some popular website downloader are: HTTrack, Wget. Note: cURL does not allow you to download a site recursively or a page with all linked pages. It allows you to download one single page, or a set of pages or images.

Of course there are legitimate uses. In the older days (say, late 1990s or early 2000s), internet speed are not as fast as today, and many are still using modem with 28.8 kbit/s, and websites are not that reliable or accessible. So, people need to download sites to read offline, or archive it in case the info disappears the next day. However, today, with video sites and 100 megabyte movie trailers and all that, these reasons are mostly gone.

Note that there's Robots exclusion standard, which instruct how web crawlers should behave. However, web downloading software usually ignore that, or can be easily set to ignore it.

Random Notes On Nicolas Bourbaki

perm url with updates: http://xahlee.org/math/nicolas_bourbaki.html

Random Notes On Nicolas Bourbaki

Xah Lee, 2010-01-28

Spent about 3 hours reading about Bourbaki.

Nicolas Bourbaki is a influential math group, used to be mysterious. I didn't know much about the group until recent years, from Wikipedia.

“MacTutor History of Mathematics Archive” written by J J O'Connor and E F Robertson, University of St Andrews. Source Source

Read also, Twenty-Five Years with Nicolas Bourbaki 1949–1973 (2008-07), by Armand Borel, from AMS Volume 45, Number 3. http://www.ams.org/notices/199803/borel.pdf

Some juicy quotes:

... Cartan was frequently bugging Weil with questions on how to present this material, so that at some point, to get it over with once and for all, Weil suggested they write themselves a new Traité d’Analyse. This suggestion was spread around, and soon a group of about ten mathematicians began to meet regularly to plan this treatise. It was soon decided that the work would be collective, without any acknowledgment of individual contributions. In summer 1935 the pen name Nicolas Bourbaki was chosen.

At this point let me simply mention that the true “founding fathers”, those who shaped Bourbaki and gave it much of their time and thoughts until they retired, are: Henri Cartan, Claude Chevalley, Jean Delsarte, Jean Dieudonné, André Weil.

born respectively in 1904, 1909, 1903, 1906, 1906— all former students at the École Normale Supérieure in Paris.

I was rather put off by the very dry style, without any concession to the reader, the apparent striving for the utmost generality, the inflexible system of internal references and the total absence of outside ones (except in Historical Notes).

LOL. I rather prefer this approach. Also, note some quotes from Wikipedia on criticism:

  • # combinatorics is not discussed
  • # logic is treated minimally[18]

Furthermore, Bourbaki make only limited use of pictures in their presentation.[19] In general, Bourbaki has been criticized for reducing geometry as a whole to abstract algebra and soft analysis.[20]

LOL. How dare they! I love combinatorics and logic and geometry.

also see:

Dieudonné at one point said “one can do nothing serious without them [Lie algebras]”, for which he was reproached.

This is typical arrogance from mathematicians. What a fucking bullshit. The Bourbaki books do not cover much of discrete math, for example, used in computer science. (computational math is too early in his time) But similar attitude you can often see in academic mathematicians today. The importance of discrete math, or computational math, is arguably a new era in math, overturning the traditional set theory based tower of foundations dealt by humans as treated in Bourbaki. See, for example: State Of Theorem Proving Systems 2008, Notes on A New Kind of Science.

File Aliases Considered A Plague

perm url with updates: http://xahlee.org/UnixResource_dir/writ/hardlink_softlink_alias_junction_plague.html

File Aliases Considered A Plague

Xah Lee, 2009-09-01

Learned today about Windows's Junction. Basically, it's a file aliasing mechanism in NTFS much like unix's hardlink, softlink, and also Mac OS X's HFS+'s alias.

All these i have avoided like a plague in software. They create a lot problems.

I noticed this Windows juncture because i am using rsync thru cygwin to copy files in my Documents folder to my Mac. Rsync keeps telling me permission denied. Here's the error:

xah@xah-PC ~
$ rsync -z -r -v -t --exclude=".DS_Store" --delete --rsh="ssh -l xah" ~/Documents/ xah@169.254.153.147:~/Documents_PC/
Password:
building file list ... rsync: opendir "/cygdrive/c/Users/xah/Documents/My Music" failed: Permission denied (13)
rsync: opendir "/cygdrive/c/Users/xah/Documents/My Pictures" failed: Permission denied (13)
rsync: opendir "/cygdrive/c/Users/xah/Documents/My Videos" failed: Permission denied (13)

In my “Documents” folder, i don't have a folder named “My Music” or such. But of course, apparantly i do. They are hidden. In Explorer, you won't see these files in your Documents dir (unless you have turned on viewing system files in Folder Options).

These hidden files are links to different dirs: “~/My Pictures” and “~/My Videos”. I get perm denied because i haven't made the link destination folders to be shared.

If you use PowerShell “dir -force”, you see:

    Directory: C:\Users\xah\Documents

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d--hs         5/23/2009   7:15 PM            My Music
d--hs         5/23/2009   7:15 PM            My Pictures
d----         6/18/2009   8:25 PM            My Received Files
d--hs         5/23/2009   7:15 PM            My Videos

Note those “hs” there, probably meaning some Hidden and System attributes in NTFS. If you use cmd.exe's “dir /A”, you see:

08/31/2009  06:27 AM    <DIR>          emacs_stuff
05/23/2009  07:15 PM    <JUNCTION>     My Music [C:\Users\xah\Music]
05/23/2009  07:15 PM    <JUNCTION>     My Pictures [C:\Users\xah\Pictures]
06/18/2009  08:25 PM    <DIR>          My Received Files
05/23/2009  07:15 PM    <JUNCTION>     My Videos [C:\Users\xah\Videos]

At that, the word JUNCTION give me a hook to search on the web, and found the answer.

All these different redirect mechanisms in unix, Mac, and now i know Windows, creates lots of complexities and problems. Problems in security, in copying deleting dirs, in deceptive dir structure. One thing i particularly hate is unix's “hard link”. Though, i guess in some situations they are convenient and probably the best solution.

The rsync problem is solved by adding “--exclude="**/My *" ”. (I was using “--exclude="*/My *" ” to work around the perm denied error, but no go. Annoying, and i lived with it for days. Eventually, i bite the bullet to spend time to resolve it, which lead me to this juncture business. For rsync, it turns out i need two asterisks.)

2009-09-30 Addendum: See also this blog Why doesn't Explorer have an interface for creating hard links? by Raymond Chen, at Source.

Vista VirtualStore

Xah Lee, 2010-01-28

Discovered Windows Vista's “File and Registry Virtualization”, or, “VirtualStore”.

In 2009-09, i'm installing a new version of ErgoEmacs, which me and David Capello is developing. The new version didn't work for me, because, apparantly some previous version of elisp files are still in my program installation dir at “C:/Program Files (x86)/ErgoEmacs”. After reinstalling several times, still the same problem. I was quite frustrated. After hours of looking into the problem, apparantly the problem isn't the new release, but just on my machine. It seems there's some file caching going on, but just couldn't figure out what. At one point, i suspected that my file system is corrupted. This whole incidence has caused me some 20 hours. (See this discussion at ergoemacs forum: Source)

Today, David wrote to me pointing out that the problem could be Vista's file virtualization. Bingo, that was it. Here's what it means.

In Vista, dir path such as “C:\Program Files (x86)\” is not supposed to be accessible for normal users. But a lot old programs still write to that dir for user data. So, when the program tries to do that, and OS detects that the user has no privilege, it will write to “C:\Users\‹your account name›\AppData\Local\VirtualStore\Program Files (x86)\” instead. This means, some application that view directories will get confused or misleading. For example, in emacs dired, it shows me that i have several files in: “c:\Program Files (x86)\ErgoEmacs\bin”, but when you use Windows Explorer or PowerShell or cmd.exe to look, the directory doesn't exist.

The masking of files also is the source of a mysterious problem i have when trying to work with IntelliTypo software (see: Microsoft IntelliType Hacks), where i worked for hours but some expected result mysteriously does not happen.

The file virtualization happens to registry too. For detail, see:

  • Technet: Inside Windows Vista User Account Control (2007-06), by Mark Russinovich. Source
  • Technet: Achieve the Non-Admin Dream with User Account Control (2006-11), by Alex Heaton. Source
  • Blog: Vista Virtual Store or “Where did my files go?” (2008-09-29), by Darrell Source
  • Blog: File and Registry Virtualization – the good, the bad, and the ugly (2005-12-19), by Jerry. Source.

NoSQL Movement

perm url with updates: http://xahlee.org/comp/nosql.html

The NoSQL Movement

Xah Lee, 2010-01-26

In the past few years, there's new fashionable thinking about anti relational database, now blessed with a rhyming term: NoSQL. Basically, it considers that relational database is outdated, and not “horizontally” scalable. I'm quite dubious of these claims.

According to Wikipedia Scalability article, verticle scalability means adding more resource to a single node, such as more cpu, memory. (You can easily do this by running your db server on a more powerful machine.), and “Horizontal scalability” means adding more machines. (and indeed, this is not simple with sql databases, but again, it is the same situation with any software, not just database. To add more machines to run one single software, the software must have some sort of grid computing infrastructure built-in. This is not a problem of the software per se, it is just the way things are. It is not a problem of databases.)

I'm quite old fashioned when it comes to computer technology. In order to convience me of some revolutionary new-fangled technology, i must see improvement based on math foundation. I am a expert of SQL, and believe that relational database is pretty much the gist of database with respect to math. Sure, a tight definition of relations of your data may not be necessary for many applications that simply just need store and retrieve and modify data without much concern about the relations of them. But still, that's what relational database technology do too. You just don't worry about normalizing when you design your table schema.

The NoSQL movement is really about scaling movement, about adding more machines, about some so-called “cloud computing” and services with simple interfaces. (like so many fashionable movements in the computing industry, often they are not well defined.) It is not really about anti relation designs in your data. It's more about adding features for practical need such as providing easy-to-user APIs (so you users don't have to know SQL or Schemas), ability to add more nodes, provide commercial interface services to your database, provide parallel systems that access your data. Of course, these needs are all done by any big old relational database companies such as Oracle over the years as they constantly adopt the changing industry's needs and cheaper computing power. If you need any relations in your data, you can't escape relational database model. That is just the cold truth of math.

Importat data, such as used in the bank transactions, has relations. You have to have tight relational definitions and assurance of data integrity.

Here's a second hand quote from Microsoft's Technical Fellow David Campbell. Source

I've been doing this database stuff for over 20 years and I remember hearing that the object databases were going to wipe out the SQL databases. And then a little less than 10 years ago the XML databases were going to wipe out.... We actually ... you know... people inside Microsoft, [have said] 'let's stop working on SQL Server, let's go build a native XML store because in five years it's all going....'

LOL. That's exactly my thought.

Though, i'd have to have some hands on experience with one of those new database services to see what it's all about.

Amazon S3 and Dynamo

Look at Structured storage. That seems to be what these nosql databases are. Most are just a key-value pair structure, or just storage of documents with no relations. I don't see how this differ from a sql database using one single table as schema.

Amazon's Amazon S3 is another storage service, which uses Amazon's Dynamo (storage system), indicated by Wikipedia to be one of those NoSQL db. Looking at the S3 and Dynamo articles, it appears the db is just a Distributed hash table system, with added http access interface. So, basically, little or no relations. Again, i don't see how this is different from, say, MySQL with one single table of 2 columns, added with distributed infrastructure. (distributed database is often a integrated feature of commercial dbs, e.g. Wikipedia Oracle database article cites Oracle Real Application Clusters )

Here's a interesting quote on S3:

Bucket names and keys are chosen so that objects are addressable using HTTP URLs:

  • http://s3.amazonaws.com/bucket/key
  • http://bucket.s3.amazonaws.com/key
  • http://bucket/key (where bucket is a DNS CNAME record pointing to bucket.s3.amazonaws.com)

Because objects are accessible by unmodified HTTP clients, S3 can be used to replace significant existing (static) web hosting infrastructure.

So this means, for example, i can store all my images in S3, and in my html document, the inline images are just normal img tags with normal urls. This applies to any other type of file, pdf, audio, but html too. So, S3 becomes the web host server as well as the file system.

Here's Amazon's instruction on how to use it as image server. Seems quite simple: How to use Amazon S3 for hosting web pages and media files? Source

Google BigTable

Another is Google's BigTable. I can't make much comment. To make a sensible comment, one must have some experience of actually implementing a database. For example, a file system is a sort of database. If i created a scheme that allows me to access my data as files in NTFS that are distributed over hundreds of PC, communicated thru http running Apache. This will let me access my files. To insert, delete, data, one can have cgi scripts on each machine. Would this be considered as a new fantastic NoNoSQL?

python 3 adoption

perm url with updates: http://xahlee.org/comp/python3.html.

Python 3 Adoption

Xah Lee, 2010-01-26

Some notes of Wikipedia readings related to Python.

Unladen Swallow, a new project from Google. It is a new python compiler with the goal of 5 times faster than the de facto standand implementation CPython. Also note Stackless Python, which has already been used in some major commercial projects.

Was looking into what's new in Python 3. See: http://docs.python.org/dev/3.0/whatsnew/3.0.html. From a quick reading, i don't really like it. Here's some highlights:

  • Print is now a function. Great, much improvement.
  • Many functions that return lists now returns “Views” or “Iterators” Instead. A fucking fuck all fucked up shit. A extraneous “oop engineering” complication. (See: Lambda in Python 3000)
  • The cmp() function used in sort is basically gone, users are now supposed to use the “key” parameter instead. This is a flying-face-fuck to computer science. This would be the most serious fuckup in python 3. (See: Sorting in Python and Perl)
  • Integers by default is long. Great!
  • Much more integrated unicode support, rewrite of most its text or string semantics. Fantastic. Finally.

Am looking because i wonder if i should switch to python 3 for my own few scripts, and rewrite my Python Tutorial for version 3. Am also interested to know how python 3 is received by the computing industry. Apparantly, a little search on the web indicates that vast majority of python base have not switched, as expected, for many good reasons. Vast majority of major python modules and tools have not switched. Most linux distro have not switched, i don't find any large corporation having adopted Python 3 (Google, Yahoo, Facebook, NASA,... ). (sources: Source, Source) Basically, such a incompatible change with trivial, ideological improvements, is too costy to switch.

I wonder, if by 2015, will most large corporate users have switched to python 3. I give it a maybe. In today's Proliferation of Computing Languages, such a major antic by Guido can just hurt itself. What is he thinking? He of course thought himself as a god of lang designer, who sincerely wants to push towards perfection, all future-looking. Unfortunately, the tens of other major language designers all think similarly.

Learning Notes Of Symmetric Space and Differential Geometry Topics

perm url with updates: http://xahlee.org/math/symmetric_space.html

Learning Notes Of Symmetric Space and Differential Geometry Topics

Xah Lee, 2010-01-27

Spent a hour chatting with Richard Palais on voice yesterday. He is teaching me some math about transvections. This page is some learning notes on some differential geometry related topics spurred from the chat.

Spent about 6 hours reading Wikipedia and writing this.

Here's Wikipedia article Symmetric space, some quotes:

Symmetric Space

In differential geometry, representation theory and harmonic analysis, a symmetric space is a smooth manifold whose group of symmetries contains an "inversion symmetry" about every point. There are two ways to make this precise, via Riemannian geometry or via Lie theory; the Lie theoretic definition is more general and more algebraic.

In Riemannian geometry, the inversions are geodesic symmetries, and these are required to be isometries, leading to the notion of a Riemannian symmetric space...

Here's a brief explanation, assuming the manifold is 2 dimentional. A symmetrc space means is that it is smooth surface such that every point on the surface can serve as a reflection thru a point point, such that any shortest distance from two points on the surface, is still the same, before and after the reflection.

Think of a 2-dimentional, euclidean plane. There, we have the concept of reflection thru a point. After the reflection, all distances are preserved. So, it is a symmetric space. The case gets a bit more complex when the surface is not a flat plane. Say, a sphere. A sphere is also a symmetric space. Every point on the sphere can serve as the point for reflection thru a point. After the operation, every geodesics is preserved. That is, any 2 points, P, and G, the shortest distance between them on the surface, is the same, before and after, the operation.

So, what possibly could other symmetric space be for surfaces, besides a plane or sphere? One example i'm given is a flat torus. (See: Flat manifold)

Richard particularly spoke about Elie Cartan, the man behind symmetric spaces. Richard said when he is studying in France, he read Elie's work and found it beautiful.

Flat manifold

In mathematics, a Riemannian manifold is said to be flat if its curvature is everywhere zero. Intuitively, a flat manifold is one that "locally looks like" Euclidean space in terms of distances and angles, e.g. the interior angles of a triangle add up to 180°.

The universal cover of a complete flat manifold is Euclidean space. This can be used to prove the theorem of Bieberbach (1911, 1912) that all compact flat manifolds are finitely covered by tori; the 3-dimensional case was proved earlier by Schoenflies (1891).

Geodesics

Geodesic.

In mathematics, a geodesic (pronounced /ˌdʒiː.ɵˈdiːzɨk, ˌdʒiː.ɵˈdɛsɨk/) is a generalization of the notion of a "straight line" to "curved spaces". In the presence of a metric, geodesics are defined to be (locally) the shortest path between points on the space. In the presence of an affine connection, geodesics are defined to be curves whose tangent vectors remain parallel if they are transported along it.

The term "geodesic" comes from geodesy, the science of measuring the size and shape of Earth; in the original sense, a geodesic was the shortest route between two points on the Earth's surface, namely, a segment of a great circle. The term has been generalized to include measurements in much more general mathematical spaces; for example, in graph theory, one might consider a geodesic between two vertices/nodes of a graph.

Representation theory

Representation theory

Representation theory is a branch of mathematics that studies abstract algebraic structures by representing their elements as linear transformations of vector spaces.[1] In essence, a representation makes an abstract algebraic object more concrete by describing its elements by matrices and the algebraic operations in terms of matrix addition and matrix multiplication. The algebraic objects amenable to such a description include groups, associative algebras and Lie algebras. The most prominent of these (and historically the first) is the representation theory of groups, in which elements of a group are represented by invertible matrices in such a way that the group operation is matrix multiplication.[2]

Hilbert space

Hilbert space.

The mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions. A Hilbert space is an abstract vector space possessing the structure of an inner product that allows length and angle to be measured. Hilbert spaces are in addition required to be complete, a property that stipulates the existence of enough limits in the space to allow the techniques of calculus to be used.

Complete metric space

What does the “complete” above mean? Here: Complete metric space.

In mathematical analysis, a metric space M is said to be complete (or Cauchy) if every Cauchy sequence of points in M has a limit that is also in M or alternatively if every Cauchy sequence in M converges in M.

Intuitively, a space is complete if there are no "points missing" from it (inside or at the boundary). Thus, a complete metric space is analogous to a closed set. For instance, the set of rational numbers is not complete, because is "missing" from it, even though one can construct a Cauchy sequence of rational numbers that converges to it. (See the examples below.) It is always possible to "fill all the holes", leading to the completion of a given space, as will be explained below.

Definition

A Hilbert space H is a real or complex inner product space that is also a complete metric space with respect to the distance function induced by the inner product.

Cauchy sequence

What the fuck is Cauchy sequence?

In mathematics, a Cauchy sequence, named after Augustin Cauchy, is a sequence whose elements become arbitrarily close to each other as the sequence progresses. To be more precise, by dropping enough (but still only a finite number of) terms from the start of the sequence, it is possible to make the maximum of the distances from any of the remaining elements to any other such element smaller than any preassigned, necessarily positive, value.

cauchy sequence illustration

he plot of a Cauchy sequence X[n], shown in blue, as n versus X[n]. If the space containing the sequence is complete, the "ultimate destination" of this sequence, that is, the limit, exists.

Inner product space

Inner product space

In mathematics, an inner product space is a vector space with the additional structure called an inner product. This additional structure associates each pair of vectors in the space with a scalar quantity known as the inner product of the vectors. Inner products allow the rigorous introduction of intuitive geometrical notions such as the length of a vector or the angle between two vectors. They also provide the means of defining orthogonality between vectors (zero inner product). Inner product spaces generalize Euclidean spaces (in which the inner product is the dot product, also known as the scalar product) to vector spaces of any (possibly infinite) dimension, and are studied in functional analysis.

An inner product space is sometimes also called a pre-Hilbert space, since its completion with respect to the metric, induced by its inner product, is a Hilbert space. That is, if a pre-Hilbert space is complete with respect to the metric arising from its inner product (and norm), then it is called a Hilbert space.

Lie Group

Lie group.

Lie group is a group which is also a differentiable manifold, with the property that the group operations are compatible with the smooth structure. Lie groups are named after the nineteenth century Norwegian mathematician Sophus Lie, who laid the foundations of the theory of continuous transformation groups. Lie groups represent the best-developed theory of continuous symmetry of mathematical objects and structures, which makes them indispensable tools for many parts of contemporary mathematics, as well as for modern theoretical physics. They provide a natural framework for analysing the continuous symmetries of differential equations (Differential Galois theory), in much the same way as permutation groups are used in Galois theory for analysing the discrete symmetries of algebraic equations. An extension of Galois theory to the case of continuous symmetry groups was one of Lie's principal motivations.

Differential Galois Theory

Differential Galois theory.

In mathematics, the antiderivatives of certain elementary functions cannot themselves be expressed as elementary functions. A standard example of such a function is e−x2, whose antiderivative is (up to constants) the error function, familiar from statistics. Other examples include the functions Sin[x]/x and x^x.

It should be realized that the notion of an elementary function is merely a matter of convention. One could choose to add the error function to the list of elementary functions, and with this new list, the antiderivative of e−x2 is elementary. However, no matter how long the list of so called elementary functions, as long as it is finite, there will still be functions on the list whose antiderivatives are not.

The machinery of differential Galois theory allows one to determine when an elementary function does or does not have an antiderivative that can be expressed as an elementary function. Differential Galois theory is a theory based on the model of Galois theory. Whereas algebraic Galois theory studies extensions of algebraic fields, differential Galois theory studies extensions of differential fields, i.e. fields that are equipped with a derivation, D. Much of the theory of differential Galois theory is parallel to algebraic Galois theory. One difference between the two constructions is that the Galois groups in differential Galois theory tend to be matrix Lie groups, as compared with the finite groups often encountered in algebraic Galois theory. The problem of finding which integrals of elementary functions can be expressed with other elementary functions is analogous to the problem of solutions of polynomial equations by radicals in algebraic Galois theory.

Topological space

Topological space

Topological spaces are mathematical structures that allow the formal definition of concepts such as convergence, connectedness, and continuity. They appear in virtually every branch of modern mathematics and are a central unifying notion.

A topological space is a set X together with τ, a collection of subsets of X, satisfying the following axioms:

  • The empty set and X are in τ.
  • The union of any collection of sets in τ is also in τ.
  • The intersection of any finite collection of sets in τ is also in τ.

Space

This article: Space (mathematics), is quite illuminating.

mathematical spaces

The hierarchy of spaces.

The article is particular interesting because it gives a sense of organization of all these spaces.

Basically, the most basic is a inner product space (which is a vector space with inner product defined). The inner product is required, a modern approach, to give you definition of angles. Then, you have normed vector spaces. That means, the length concept is defined based on inner product. (again, not necessarily natural, but kinda logically the simplest from modern foundational point of view) So now, you have angles and length, both defined on top of Field. A normed vector space basically gives you the familiar euclidean space. More generalization with different ways to measure the “distance” between elements gives you metric spaces. And more abstract, rigorous, or different, definitions on issues of boundary, continuity, differentiability, gives you topological spaces.

Note, that the above lose grouping is just one perspective. Categorization or classification is Taxonomy and Ontology. Basically, there are many perspectives, with regards to different needs, and there's no one universal categorization into one absolute hierarchy.

In particular, the above hierarchy of spaces is based on most popular modern practices of math foundation. That is, you start with sets, then integers, then rational numbers, then with Dedekind cut you have real numbers, then with binary operators (functions of 2 parameters) you have groups, then rings and fields, and vector space. Though, at this point you don't have the familiar angle and distance of euclidean space. So, comes a fiat definition of a function called inner product, that gives you definition of angle, then, length comes from another function called “norm”, by conveniently defined on top of inner product. Area and volume's definition and generalization became Measure spaces.

Some items of particular personal interest:

The original space investigated by Euclid is now called "the three-dimensional Euclidean space". Its axiomatization, started by Euclid 23 centuries ago, was finalized in the 20 century by David Hilbert, Alfred Tarski and George Birkhoff. This approach describes the space via undefined primitives (such as "point", "between", "congruent") constrained by a number of axioms. Such a definition "from scratch" is now of little use, since it hides the standing of this space among other spaces. The modern approach defines the three-dimensional Euclidean space more algebraically, via linear spaces and quadratic forms, namely, as an affine space whose difference space is a three-dimensional inner product space.

Also a three-dimensional projective space is now defined non-classically, as the space of all one-dimensional subspaces (that is, straight lines through the origin) of a four-dimensional linear space.

css problems

perm url with updates: http://xahlee.org/js/css_problems.html

CSS Problems

Xah Lee, 2010-01-28

Some random thoughts about CSS.

Am reading the Wikipedia article on Cascading Style Sheets again.

Here's a interesting quote:

While new additions to CSS3 provide a stronger, more robust layout feature-set, CSS is still very much rooted as a styling language, not a layout language. This problem has also meant that creating fluid layouts is still very much done by hand-coding CSS, and make the development of a standards-based WYSIWYG editor more difficult than expected.

This is so very much true. For example, if you want text to flow in 2 columns, basically you have to manually move the the text to the appropriate block. (as opposed to, for example, text being auto word wrapped by a specified width when the text is long. See: CSS Text Wrapping)

Also, although you can make a page's layout using CSS instead of Tables, but if you want more fine grained layout, such as using nested tables, CSS pretty much fails. You'd spend several hours trying do it and come out with unsatisfactory result. (see also: Tableless Layout with CSS) I'd say, just use tables.

CSS's tag matching scheme (so-called Selectors) is also pretty weak and ad hoc. For example, there's “:first-child” to match the first child of a tag, but you can't match second child, third, etc, or last. “AAA + BBB” will match BBB only if there exist in the same level a AAA, and comes before it. But, you can not specify a match where there must be a CCC that comes after.

Generally speaking, html and xml are tree structures. With this perspective, you can see that css selectors are just a way to match tree branches. With a tree, you have concepts of root, level/depth, parent/child/siblings, ordering of siblings. For a tree matcher, in its full generality, you can consider a scheme where all these tree properties can be specified. (in a way, similar to pattern matching in functional languages.) Of course, css isn't a computing language, so, for designing its Selector, efficiency needs to be considered. In any case, the way css's seletors is today, is rather ad hoc and very weak.

Also, the selector expression can not use parens to specify precedence. This is a need i actually had a few times for my own site. (it'll take some time to write explanation. Will have to add example here later.)

Two other criticisms from Wikipedia i particularly find to be important are:

CSS offers no way to select a parent or ancestor of an element that satisfies certain criteria. A more advanced selector scheme (such as XPath) would enable more sophisticated style sheets. However, the major reasons for the CSS Working Group rejecting proposals for parent selectors are related to browser performance and incremental rendering issues.

While horizontal placement of elements is generally easy to control, vertical placement is frequently unintuitive, convoluted, or impossible. Simple tasks, such as centering an element vertically or getting a footer to be placed no higher than bottom of viewport, either require complicated and unintuitive style rules, or simple but widely unsupported rules.[clarification needed]

For a short tutorial of css selectors, see: What's New in CSS2.

2010-01-25

unix pipe as functional language

perm url with updates: http://xahlee.org/comp/unix_pipes_and_functional_lang.html.

Unix Pipe As Functional Language

Xah Lee, 2010-01-25

Found the following juicy interview snippet today:

Is there a connection between the idea of composing programs together from the command line throught pipes and the idea of writing little languages, each for a specific domain?

Alfred Aho: I think there's a connection. Certainly in the early days of Unix, pipes facilitated function composition on the command line. You could take an input, perform some transformation on it, and then pipe the output into another program. ...

When you say “function composition”, that brings to mind the mathematical approach of function composition.

Alfred Aho: That's exactly what I mean.

Was that mathematical formalism in mind at the invention of the pipe, or was that a metaphor added later when someone realized it worked the same way?

Alfred Aho: I think it was right there from the start. Doug McIlroy, at least in my book, deserves the credit for pipes. He thought like a mathematician and I think he had this connection right from the start. I think of the Unix command line as a prototypical functional language.

It is from a interview with Alfred Aho, one of the creators of AWK. The source is from this book: Masterminds of Programming: Conversations with the Creators of Major Programming Languages (2009), by Federico Biancuzzi et al. (amazon)

Since about 1998, when i got into the unix programing industry, i see the pipe as a post fix notation, and sequencing pipes as a form of functional programing, but finding it overall extremely badly designed. I've wrote a few essays explaining the functional programing connection and exposing the lousy syntax. (mostly in years around 2000) However, i've never seen another person expressing the idea that unix pipes is a form of postfix notation and functional programing. It is a great satisfaction to see one of the main unix author state so.

Unix Pipe As Functional Programing

The following email content (slighted edited) is posted to Mac OS X mailing list, 2002-05. Source

From: xah / xahlee.org
Subject: Re: mail handling/conversion between OSes/apps
Date: May 12, 2002 8:41:58 PM PDT
Cc: macosx-talk / omnigroup.com

Yes, unix have this beautiful philosophy. The philosophy is functional programing. For example, define:

power(x) := x*x

so “power(3)” returns “9”.

Here “power” is a function that takes 2 arguments. First parameter specifies the number to be raised to power, the second the number of times to multiply itself.

functions can be nested,

f(g(h(x)))

or composed

compose(f,g,h)(x)

Here the “compose” itself is a function, which take other functions as arguments, and the output of compose is a new function that is equivalent to nesting f g h.

Nesting does not necessarily involved nested syntax. Here's a post fix notation in Mathematica for example:

x // h // g // h

or prefix notation:

f @ g @ h @ x

or in lisp

(f (g (h x)))

The principle is that everything is either a function definition or function application, and function's behavior is strictly determined by its argument.

Apple around 1997 or so have this OpenDoc technology, which is similar idea applied more broadly across OS. That is, instead of one monolithic browser or big image editors or other software, but have lots of small tools or components that each does one specific thing and all can call each other or embedded in a application framework as services or the like. For example, in a email apps, you can use BBEdit to write you email, use Microsoft's spell checker, use XYZ brand of recorder to record a message, without having to open many applications or use the Finder the way we would do today. This multiplies flexibility. (OpenDoc was killed when Steve Jobs become the iCEO around 1998 and did some serious house-keeping, against the ghastly anger of Mac developers and fanatics, I'm sure many of you remember this piece of history.)

The unix pipe syntax “|”, is a postfix notation for nesting. e.g.

ps auwwx | awk '{print $2}' | sort -n | xargs echo

in conventional syntax it might look like this:

xargs(  echo, sort(n, awk('print $2', ps(auwwx)))  )

So when you use “pipe” to string many commands in unix, you are doing supreme functional programing. That's why it is so flexible and useful, because each component or function does one thing, and you can combine them in myriad of ways. However, this beautiful functional programing idea, when it is implemented by the unix heads, becomes a fucking mess. Nothing works and nothing works right.

I don't feel like writing a comprehensive exposition on this at the moment. Here's a quick summary:

  • Fantastically stupid syntax.
  • Inconsistencies everywhere. Everywhere.
  • Fucking stupid global variables reliance called environment variables, which fucks up the whole functional programing paradigm.
  • Implicit stuff everywhere.
  • Totally incompetent commands and their parameters. (promiscuously non-orthogonal, and missing things, and fucked up in just more ways than one can possibly imagine. there are one million way to do one thing, and none are correct, and much simple needs CANNOT be done! (that's why there are gazillion shells each smart-ass improving upon the other, and that's why Perl is born too! But asinine Larry Wall don't know shit but smugly created another complexity that don't do much.))

Maybe some other day when i'm pissed, i'll write a better exposition on this issue. I've been wanting to write a final-say essay on this for long. Don't feel like it now.

Unix Syntatical and Semantical Stupidity Exposition

The following is posted to a Mac OS X mailing list. Source

From: xah@xahlee.org
Subject: unix inanity: shell env syntax
Date: June 7, 2002 12:00:29 AM PDT
To: macosx-talk@omnigroup.com

Unix Syntatical and Semantical Stupidity Exposition. (this is one of the many technical expositions of unix stupidity)

(this is currently unpolished, but the meat is there. Input welcome.)

arguments are given with a dash prefix. e.g.

ls -a -l

Order (usually) does not matter. So,

ls -a -l

is the same as

ls -l -a

but arguments can be combined, e.g.

ls -al

means the same thing as

ls -a -l

However, some option consists of more than one character. e.g.

perl -version
perl -help

therefore, the meaning of a option string "-ab" is ad hoc dependent on the program. It can be "-a -b" or just a option named "ab".

Then, sometimes there are two versions of the same optional argument. e.g.

perl -help
perl -h
perl -version
perl -v

this equivalence is ad hoc.

Different program will disagree on common options. For example, to get the version, here are common varieties:

-v
-V
-version

sometimes v/V stands for "verbose mode", i.e. to output more detail.

Sometimes, if a option is given more than once, then it specifies a degree of that option. For example, some command accept the -v for "verbose", meaning that it will output more detail. Sometimes there are few levels of detail. The number of times a option is given determines the level of detail. e.g. on Solaris 8,

/usr/ucb/ps -w
/usr/ucb/ps -w -w

Thus, meaning of repeated option may have special meaning depending on the program.

Oftentimes some options automatically turn on or surpress a bunch of others. e.g. Solaris 8,

/usr/bin/ls -f

When a named optional parameter is of a boolean type, that is a toggle of yes/no, true/false, exist/nonexist, then it is often times that instead of taking a boolean value, their sole existence or non-existence defines their value.

Toggle options are sometimes represented by one option name for yes, while another option name for no, and when both are present, the behavior is program dependent.

Toggle options are represented by different option names.

For named options, their syntax is slack but behavior is usually dependent on the program. i.e. not all of the following works for every program

command -o="myvalue"
command -omyvalue
comand -o myvalue

Often one option may have many synonyms...

A example of a better design... (Mathematica, Scheme, Dylan, Python, Ruby... there's quite a lot elegance and practicality yet distinct designs and purposes and styles ...)

(recall that unix have a bad design to begin with; it's a donkey shit pile from the beginning and continuation. Again, unix is not simply technically incompetent. If that, then that's easy to improve, and i don't have a problem with, since there are things in one way or another considered horrendous by today's standard like COBOL or FORTRAN or DOS etc. But, unix is a brain-washing idiot-making machine, churning out piles and piles of religiously idiotic and pigheaded keyboard punchers. For EVERY aspects of good engineering methodology improvement or language design progress opportunity, unixers will unanimously turn it down.

Inevitably someone will ask me what's my point. My point in my series of unix-showcasing articles have always been clear for those who studies it: Unix is a crime that caused society inordinate harm, and i want unix geeks to wake up and realize it.

Microsoft PowerShell

Note: Microsoft's new shell programing language, PowerShell (b2006), adopted much of unix shell's syntax and the pipe paradigm, but with a much consistent and formal design. (see: Xah's PowerShell Tutorial)