2008-03-29 16:37 - Tech
Back in 2005, Google released their Web Authoring Statistics survey. It covered a wide variety of interesting topics, from HTML elements and attributes, to HTTP headers and beyond. The analysis was wonderfully in depth, with a self reported sample set of a billion documents. I truly enjoyed reading through this data the firs time I found it. And found myself starting to wish that I could run a similar survey. But of what?
A few years later, I found Andrew Wooster's robots.txt adventure, a similar survey. This time, however, it was an analysis of the contents of the ubiquitous robots.txt file, again with interesting and entertaining results. I still wanted to do something like it. I still didn't know what to survey.
Then, in January of 2008, there was a funny post to reddit about the favicon on panasonic.com. It was quickly fixed after the post, but when posted, Panasonic had the Internet Explorer icon as their favicon. Comments in the post quickly pointed out the also humorous frequency with which the netscape logo and the sun logo apparently are provided as a default with said HTTP server software, and then get left on public sites for all the world to see.
Then I had it! It's not nearly as meaty, but it might be the only one left. I decided to survey the frequency of common favicons, across the web.
Please note: The various links scattered through this article worked at the time of the survey. That doesn't mean they'll still work forever, so take them with a grain of salt.
I needed to start with a big list of sites. It's a complicated topic, but for this purpose, I've decided to equate "domain name" with "site". I ended up with a list of all .com, .net, .org, and .info domains, totaling 83,670,574 sites. (Source) I split the data up into about 350 chunks, then interleaved the chunks to equalize the progress. The result was about 350 chunks containing line 1 from each source chunk, then line 2 from each source chunk, and so on, chunked separately.
Then I started downloading! I started by attempting to locate http://example.com/favicon.ico and failing that http://www.example.com/favicon.ico . In the case that both fail, the failure was simply recorded. If either failed to resolve the DNS, or to return an HTTP 200 code, the result was simply marked as a failure.
For each success, the favicon.ico file was downloaded and saved. It was stored as it's SHA1 hash, to establish global "uniqueness" of an actual image. This process repeated, around twenty million times.
I ended up downloading 1,814,959 favicons, 554,552 of them unique. From 28,044,394 domains. I skipped 55,626,180 domains in my survey. This means, assuming a linear scale, I surveyed 33.52% of the domains on the 'net. Multiply by three (Or to be too precise: 2.9835044394255763) to find the "real" global frequency.
First, the HTTP response codes from the entire survey:
The standard long tail. It's slightly interesting to note that the 5th-most-frequent value is "0", or in other words, the domain doesn't have a web server answering at all. That's 3.73% of the domains with no website. We also see that one "clever" server returning a HTTP/1.1 666 OK, HTTP/1.1 999 No Hacking and other unusual values down at the end of the tail. Some of them are valid, just extremely rare, like HTTP/1.1 409 Conflict and HTTP/1.1 201 OK.
Of slightly interesting note: Although I made sure to record response code as a string, in case, all responses came with an integer, as expected.
It's hard to really draw much meaning from that list, though, so let's group it together.
|Major Type||Minor Type||Frequency|
Who would have guessed it would be such a giant list? No surprise, following the general trend, the number one value is invalid: text/html with 2,879,503 covers 42.25% of the 200 OK responses. Thankfully, image/x-icon comes in second with 2,530,980 or 37.14%. The total of all possible types, counting misspellings and border cases, make up 3,001,452 or 44.04%. Less than half of the servers that send back a "favicon.ico" response actually do so with something that is (or might be) an image. Even then, many of them are not icon files.
The single largest was a 82,895,006 byte tarball, of (apparently) the entire site's contents. (Though, at that size, my script often failed to download, there might be a larger file out there.)
The largest valid image file was 7,047,018 bytes. Sadly, as an adobe photoshop file, very few browsers out there are going to display it correctly. It was one of 20 photoshop files in the top 100 by size. The largest valid image file, which a browser will successfully display, was 3,802,526 bytes. It's a confusing mish-mash of random white and yellow pixels, with a dash of red thrown in. The largest really valid (actual icon file, 16x16 size) file was 3,463,616 bytes.
Some people just put entire photos, or large artwork, in the favicon's place. Things that clearly don't belong: 1 (2244x1530 pixels) 2 (2514x2112 pixels) 3 (6850x7087 pixels) 4 (675x911 pixels) 5 (1200x1600 pixels) 6 (562x583 pixels) 7 (640x466 pixels) 8 (450x493 pixels) 9 (1100x1200 pixels) 10 (528x528 pixels) 11 (468x572 pixels) 12 (583x445 pixels) 13 (720x348 pixels) 14 (600x600 pixels) 15 (538x403 pixels) 16 (501x417 pixels) 17 (1600x1200 pixels) . These, again, are just out of the top 100 by (byte) size.
Common Icons, By Site Type
I went into this project thinking of the Sun and Netscape logos, and expected to find those as common icons. I did, but there was a whole bunch of other, more common, icons out there.
The Sun logo () came in at #262 with 231 occurrences. The Netscape logo () came in at #468 with 108 occurrences.
So what was everything that was so much more common? I tracked the top 50 most common icons, and figured out what each of them was, what the sites they were on actually were. I came up with the following categories (listed most to least common among the top 50):
- Made For Ads. AKA "made for adsense". A site which is clearly automatically generated from a small set of keywords. Often, a really lame set of search results.
- Host Default
- The default icon left behind by a particular web host provider. Apparently, some hosts choose to set this up (intentionally?) and some users never change/override it. Sometimes, a frameset wrapping a page not really hosted at the domain is the culprit.
- Software Default
- Various software packages provide a default icon, and again, users don't always replace it.
- Service Provider
- Some companies provide a service which is a sort of automatic, or generally low effort, site set up for you. I saw, for example, real estate services, providing a simple page for agents, and Yahoo! Stores leave the "Y!" icon showing.
- Real Content
- Real sites, with some kind of real content. This ended up often being just a transparent image, which a bunch of separate sites individually chose.
- Mass Content
- Occasionally, this was a bone-headed move by a company thinking it's a great idea to register all the domain names they can come up with, and host the same site (with minor content differences) at each one. Sometimes, it really was just mass hosting of the exact same site on many different domains. I couldn't rightly call this "real" content.
- An unused domain can be "parked" so that the owner keeps it. The registrar usually offers this feature.
The 50 most common favicon.ico files:
|1||252744||MFA||Hitfarm.com http://www.hitfarm.com/ or their customers ALL IP 220.127.116.11|
"MDNH, Inc." or "MYDOMAIN, INC." or "Marchex" or more? http://www.marchex.com/
Including "OpenList" with many zip codes registered, i.e. http://12182.net/
|3||104766||MFA||DIRECTNIC.COM (?). ALL IP 18.104.22.168 . That IP sends you to: http://dotzup.com/|
|4||80110||Host Default||Bluehost http://www.bluehost.com/|
|5||51957||Host Default||Host Monster http://www.hostmonster.com/|
An old version of Plesk's favicon. See: http://plesk21.net/
|7||36508||Service Default||Yahoo! http://www.yahoo.com/ -- Mostly yahoo stores|
|8||28781||Host Default||Google Pages http://pages.google.com/|
|9||22848||MFA||ALL IP 22.214.171.124 . Not sure who. http://126.96.36.199/|
ALL in 188.8.131.52/24
Tagline: "Another XSite by a la mode, inc." (http://www.alamode.com/) Also: http://www.xsitesnetwork.com/
ALL IP 184.108.40.206 .
They all say at the bottom:
"Powered by Point2 Real Estate Websites (http://nls.point2.com/)
The Point2 Homes Real Estate Network (http://homes.point2.com/)"
|13||13272||Software Default||DotNetNuke http://www.dotnetnuke.com/|
They're all in 220.127.116.11/24 except http://alamarbythesea.com/ .
|15||11660||Software Default||Plesk default icon again, another (older?) version. See, for example: http://artwooooooooorker.net/|
Provides sites (at subdomains) and emails, at many domains. Seems to always be MFA inside the ugly frame.
Weird algorithmic "content" for the most part.
|18||9595||Service Default||Blogger http://www.blogger.com/|
133 for 420 sites
They're (almost) all 403 forbidden or "this page has been suspended", but some real sites it seems.
dot mac http://www.apple.com/dotmac/
ALL IP 18.104.22.168 except http://www.mcvsd.org/
http://www.ddc.com/ Domain Development Corp
ALL IP 22.214.171.124 but one
|22||7584||Parking||http://net4.in/ ALL IP 126.96.36.199|
|23||7524||Mass Content||"WN Network" http://www.wn.com/ ALL IP 188.8.131.52|
|24||5620||Mass Content||Cities Unlimited http://www.citiesunlimited.com/|
Seems to be http://webinceptions.com/
Algo/RSS based "content", but MFA nonetheless.
|27||4547||Software Default||Xoops http://www.xoops.org/|
ALL IP 184.108.40.206/24 -- spread out in that class C
Guesss: first-clickonline http://first-clickmedia.com/http://www.searchbizonline.com/
ALL IP 220.127.116.11
|30||3603||Software Default||VBulletin http://www.vbulletin.com/|
|31||3401||Software Default||Plesk http://www.parallels.com/plesk/|
|34||2894||Service Provider||I think? http://innuity.com/|
|37||2420||Software Default||cPanel http://www.cpanel.net/|
|38||2358||MFA||Independent/multi? All in 18.104.22.168/24 all the same content|
|40||2117||Service Provider||E-zekiel http://e-zekiel.com/|
|41||1990||Mass Content||Some sort of astroturf, I guess. WHOIS all report something related to http://www.cleanenergypublications.com/|
|42||1967||Software Default||Confixx control panels|
|43||1906||Software Default||Drupal http://drupal.org/|
|44||1895||MFA||Well masked. Not sure. Appears to be the registrar http://www.siteurl.com/ snapping up domains. Almost all at IP 22.214.171.124 .|
|46||1825||Software Default||iwms (nee DvNews) http://www.xmlasp.net/|
|47||1805||Software Default||Wordpress MU http://mu.wordpress.org/|
|48||1672||Software Default||Older DotNetNuke http://www.dotnetnuke.com/|
|49||1569||Unknown||Hard to tell, asian text I can't read. Possibly: http://www.shaidc.com/|
There isn't much conclusion to make, really. Despite my initial impressions, the Sun and Netscape logo aren't really all that common. On the other hand, I have started noticing the icons that really are common (especially the Plesk logo) as I surf around the web.
Including only the MFA sites that return a valid favicon.ico file, a minimum of 2.03% of the domains on the internet are such sites. Given the limited number of sites with favicons, it turns out that 1 of every 3.19 favicons out there are on an MFA site. However, the survey was initiated with the Sun and Netscape logos in mind. The Sun logo one of every 121,404 favicons, and Netscape is one of every 259,670. And a whole lot of them aren't really favicons at all!