Blocking Image Leechers

perm url with updates:

Blocking Image Leechers

Xah Lee, 2010-01-22

This page gives some tips about preventing websites that use images from your website.

Image leeching is a problem. Basically, some other website using inline image with the image at your website. Besides copyright issues, it causes bandwidth problem on your site. There are a lot websites these days that allow its users to insert images from a url. The user may not be aware that it is a problem, since most of them are not technical person, and they simply wanted to show their friends some images they found.

Image leeching often takes significant bandwidth from yoursite. If you have a image, let's say a beautiful girl. Many sites that are porn or othewise shady sites, such as 4chan or lots others that are infested by teens and highschool and college students, gamers, they have huge amounts of traffic for rather useless content (mostly teen drivels and bantering). If they insert one of your images, your image will get few hundred hits a day, or thousands. If you get leeched, then more than 50% of your site's bandwidth will be from leechers, more likely, the bandwidth usage will be few times your normal.

My website does not have image leecher protection up to 2003 or so. Then i noticed image leechers, they cause my site to go over bandwidth limit for that month. This happened to me several times in the past. This means, i have to pay extra for the hosting fee, by the mega bytes. (See: Web Traffic Report)

The following code is Apache HTTP Server config file for blocking leechers. You need to place it in a file named “.htaccess”, usually at the root web dir.

RewriteEngine on

# block image leechers
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.xahlee\.org.+|^http://xahlee\.org.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www\.xahlee\.info.+|^http://xahlee\.info.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www\.ergoemacs\.org.+|^http://ergoemacs\.org.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://xahlee\.blogspot\.com.+$ [NC]
RewriteRule \.(png|gif|jpg|jpeg|mov|mp3|wav|ico)$ - [NC,F]

What the above code does is this: Overall, it tries to matche the text from the HTTP_REFERER the browser sends. (HTTP_REFERER contains text where the requested page is from) If conditions are met, then do a “rewrite” about the url (that is, deny access). Here's the conditions: The HTTP_REFERER line does not match blank (if it is blank, it usually means the visitor typed the image url in the browser). The HTTP_REFERER does not matches any of of,,, Otherwise (the image is inline from some other website), deny access, for url ending in “png”, “gif”, “jpg”, etc. The “NC” means ingore letter case. The “F” means deny access (Forbidden).

Note that this protection is not absolute. The very nature of web from its original conception is to allow anyone to share info, without much thought about copyright or commecial development. Leecher can easily just mirror your image on their server (which steal your image but doesn't steal your bandwidth), or server's pages can have js code that bypass this protection. Anyhow, that's getting into hacking. The above code should prevent vast majority bandwidth theft.

There are several more advance blocking methods. One is, others are using js to prevent people from knowing your image url, or embed images in Flash. But these usually gets a bit complicated.

Site Whackers

Another related problem is site whackers. Some people, typically programing geekers, when they like your site, they download everything to their computer, so they can read offline, or as some packrat habit. My website today is 679 MB. A few whacks, and your monthly bandwidth will be gone. Whack also causes huge spike in your traffic, causing your site to become slow.

Here's the code to prevent web whackers:

# forbid bots that whack entire site
RewriteCond  %{HTTP_USER_AGENT}  ^Wget|Teleport\ Pro||WebCopier|HTTrack|WebCapture
RewriteRule  ^.+$                -  [F,L]

Note that the above does not provide absolute protection. Dedicated whackers can easily bypass it by masking their user agent id. But it should prevent majority of casual web whackers.

Some popular website downloader are: HTTrack, Wget. Note: cURL does not allow you to download a site recursively or a page with all linked pages. It allows you to download one single page, or a set of pages or images.

Of course there are legitimate uses. In the older days (say, late 1990s or early 2000s), internet speed are not as fast as today, and many are still using modem with 28.8 kbit/s, and websites are not that reliable or accessible. So, people need to download sites to read offline, or archive it in case the info disappears the next day. However, today, with video sites and 100 megabyte movie trailers and all that, these reasons are mostly gone.

Note that there's Robots exclusion standard, which instruct how web crawlers should behave. However, web downloading software usually ignore that, or can be easily set to ignore it.

Popular posts from this blog

11 Years of Writing About Emacs

does md5 creates more randomness?

Google Code shutting down, future of ErgoEmacs