FavIcon Survey

2008-03-29 16:37 - Tech

Back in 2005, Google released their Web Authoring Statistics survey. It covered a wide variety of interesting topics, from HTML elements and attributes, to HTTP headers and beyond. The analysis was wonderfully in depth, with a self reported sample set of a billion documents. I truly enjoyed reading through this data the firs time I found it. And found myself starting to wish that I could run a similar survey. But of what?

A few years later, I found Andrew Wooster's robots.txt adventure, a similar survey. This time, however, it was an analysis of the contents of the ubiquitous robots.txt file, again with interesting and entertaining results. I still wanted to do something like it. I still didn't know what to survey.

Then, in January of 2008, there was a funny post to reddit about the favicon on panasonic.com. It was quickly fixed after the post, but when posted, Panasonic had the Internet Explorer icon as their favicon. Comments in the post quickly pointed out the also humorous frequency with which the netscape logo and the sun logo apparently are provided as a default with said HTTP server software, and then get left on public sites for all the world to see.

Then I had it! It's not nearly as meaty, but it might be the only one left. I decided to survey the frequency of common favicons, across the web.

Please note: The various links scattered through this article worked at the time of the survey. That doesn't mean they'll still work forever, so take them with a grain of salt.

The Process

I needed to start with a big list of sites. It's a complicated topic, but for this purpose, I've decided to equate "domain name" with "site". I ended up with a list of all .com, .net, .org, and .info domains, totaling 83,670,574 sites. (Source) I split the data up into about 350 chunks, then interleaved the chunks to equalize the progress. The result was about 350 chunks containing line 1 from each source chunk, then line 2 from each source chunk, and so on, chunked separately.

Then I started downloading! I started by attempting to locate http://example.com/favicon.ico and failing that http://www.example.com/favicon.ico . In the case that both fail, the failure was simply recorded. If either failed to resolve the DNS, or to return an HTTP 200 code, the result was simply marked as a failure.

For each success, the favicon.ico file was downloaded and saved. It was stored as it's SHA1 hash, to establish global "uniqueness" of an actual image. This process repeated, around twenty million times.

The Data

I ended up downloading 1,814,959 favicons, 554,552 of them unique. From 28,044,394 domains. I skipped 55,626,180 domains in my survey. This means, assuming a linear scale, I surveyed 33.52% of the domains on the 'net. Multiply by three (Or to be too precise: 2.9835044394255763) to find the "real" global frequency.

Favicon Presence

First, the HTTP response codes from the entire survey:

CodeFrequency
40411357338
2006959148
3026662373
4001943682
01082346
301487433
410229701
500118099
403105561
40133870
30326803
50317415
40611142
3072891
5012074
502718
999607
300495
304418
202393
504114
41277
41974
40264
20437
49028
50827
55013
4059
5097
4076
2016
4704
2054
4164
5103
4082
4501
4991
4091
6661
31
4201

The standard long tail. It's slightly interesting to note that the 5th-most-frequent value is "0", or in other words, the domain doesn't have a web server answering at all. That's 3.73% of the domains with no website. We also see that one "clever" server returning a HTTP/1.1 666 OK, HTTP/1.1 999 No Hacking and other unusual values down at the end of the tail. Some of them are valid, just extremely rare, like HTTP/1.1 409 Conflict and HTTP/1.1 201 OK.

Of slightly interesting note: Although I made sure to record response code as a string, in case, all responses came with an integer, as expected.

It's hard to really draw much meaning from that list, though, so let's group it together.

CodeFrequencyPercent
4xx1368156647.11%
3xx718041324.72%
2xx695958823.96%
010823473.73%
5xx1384700.48%
9xx6070.00%
6xx10.00%

So, 76.04% of sites do not have a favicon file, by my definition which includes not following 3xx redirections. The remaining 2xx responses make up 6,959,588 distinct responses. Out of the 28,044,394 surveyed, that means 24.82% of sites do have a favicon.

You might notice, however, that I said above, I downloaded 1,814,959 favicons, significantly less than 6,959,588. That's because I filtered both on response code and content type. Let's take a look at the content type of those 200 code responses.

Major TypeMinor TypeFrequency
**677
applicaionoctet-stream145
applicationoctet-stream302371
applicationx-httpd-php36
applicationx-icon23
applicationx-trash13
applicationx-ico11
applicationoctet_stream10
applicationx-macbinary9
applicationforce-download8
applicationbinary7
applicationx-shockwave-flash7
applicationx-backup7
applicationzip6
applicationico5
applicationx-zope-edit5
applicationx-stuffit4
applicationbin3
applicationicon-library3
applicationpdf3
applicationunknown3
applicationxhtml+xml2
applicationxml2
applicationrss+xml2
applicationvnd.wap.xhtml+xml2
applicationx-empty2
applicationx-executable-file2
applicationx-not-regular-file1
applicationx-zip-compressed1
applicationoctet-scream1
applicationpostscript1
applicationvcard1
applicationx-httpd-php51
audiounknown3
audiompeg2
contentunknown8
fileico8
gradpicsjpg1
ico*1
imadex-icon1
imagex-icon2530980
imagevnd.microsoft.icon152167
imagegif6331
imagebmp3311
imagex-ico2563
imagejpeg1922
imageico610
imagepng474
imageicons163
imagex-xbm95
imageicon85
imagexicon28
imagex22
imagepjpeg11
imagetiff11
imagex-ms-bmp11
imagex-7
imagefavicon6
imagevnd.microsoft6
imagex-portable-pixmap4
image*4
imagex-icor4
imagex-os2-icon2
imagex-png2
imagex-icon.ico2
imagex-photoshop1
imagex-rgb1
imageimage/vnd.microsoft.icon1
imagex-win-bitmap1
imagepcx1
imagexx-icon1
imagerv1
imagex-bmp1
imagea-icon1
imageapplication1
imagesx-icon10
imagesico2
messagerfc8226
msieunknown-type1
nonnon5
texthtml2879503
textplain933002
textvnd.wap.wml46
textx-unknown-content-type26
textxml14
textcss9
textx-icon1
textx-perl1
textico1
textplainmutascu.com1
textplainwebalizer1
textunknown1
unknownnmmm15
videox-ms-asf103
videounknown5
videompeg1
videox-ms-wmv1
wwwunknown166
xunknown7
xxxyyy1

Who would have guessed it would be such a giant list? No surprise, following the general trend, the number one value is invalid: text/html with 2,879,503 covers 42.25% of the 200 OK responses. Thankfully, image/x-icon comes in second with 2,530,980 or 37.14%. The total of all possible types, counting misspellings and border cases, make up 3,001,452 or 44.04%. Less than half of the servers that send back a "favicon.ico" response actually do so with something that is (or might be) an image. Even then, many of them are not icon files.

Big Icons

The single largest was a 82,895,006 byte tarball, of (apparently) the entire site's contents. (Though, at that size, my script often failed to download, there might be a larger file out there.)

The largest valid image file was 7,047,018 bytes. Sadly, as an adobe photoshop file, very few browsers out there are going to display it correctly. It was one of 20 photoshop files in the top 100 by size. The largest valid image file, which a browser will successfully display, was 3,802,526 bytes. It's a confusing mish-mash of random white and yellow pixels, with a dash of red thrown in. The largest really valid (actual icon file, 16x16 size) file was 3,463,616 bytes.

There was an amazing assortment in the "big icons" set of images made of random noise, 16 of the top 100: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 . I have no idea why this is so common.

Some people just put entire photos, or large artwork, in the favicon's place. Things that clearly don't belong: 1 (2244x1530 pixels) 2 (2514x2112 pixels) 3 (6850x7087 pixels) 4 (675x911 pixels) 5 (1200x1600 pixels) 6 (562x583 pixels) 7 (640x466 pixels) 8 (450x493 pixels) 9 (1100x1200 pixels) 10 (528x528 pixels) 11 (468x572 pixels) 12 (583x445 pixels) 13 (720x348 pixels) 14 (600x600 pixels) 15 (538x403 pixels) 16 (501x417 pixels) 17 (1600x1200 pixels) . These, again, are just out of the top 100 by (byte) size.

Common Icons, By Site Type

I went into this project thinking of the Sun and Netscape logos, and expected to find those as common icons. I did, but there was a whole bunch of other, more common, icons out there.

The Sun logo () came in at #262 with 231 occurrences. The Netscape logo () came in at #468 with 108 occurrences.

So what was everything that was so much more common? I tracked the top 50 most common icons, and figured out what each of them was, what the sites they were on actually were. I came up with the following categories (listed most to least common among the top 50):

MFA
Made For Ads. AKA "made for adsense". A site which is clearly automatically generated from a small set of keywords. Often, a really lame set of search results.
Host Default
The default icon left behind by a particular web host provider. Apparently, some hosts choose to set this up (intentionally?) and some users never change/override it. Sometimes, a frameset wrapping a page not really hosted at the domain is the culprit.
Software Default
Various software packages provide a default icon, and again, users don't always replace it.
Service Provider
Some companies provide a service which is a sort of automatic, or generally low effort, site set up for you. I saw, for example, real estate services, providing a simple page for agents, and Yahoo! Stores leave the "Y!" icon showing.
Real Content
Real sites, with some kind of real content. This ended up often being just a transparent image, which a bunch of separate sites individually chose.
Mass Content
Occasionally, this was a bone-headed move by a company thinking it's a great idea to register all the domain names they can come up with, and host the same site (with minor content differences) at each one. Sometimes, it really was just mass hosting of the exact same site on many different domains. I couldn't rightly call this "real" content.
Parking
An unused domain can be "parked" so that the owner keeps it. The registrar usually offers this feature.
TypeFrequencyPercent
MFA56932356.26%
Host Default18542018.32%
Service Provider10435310.31%
Software Default1006829.95%
Real Content169721.68%
Content151341.50%
Unknown107051.06%
Parking94230.93%

Common Icons

The 50 most common favicon.ico files:

Rank Frequency Icon Type Notes
1 252744 MFA Hitfarm.com http://www.hitfarm.com/ or their customers ALL IP 72.51.27.51
2 157513 MFA "MDNH, Inc." or "MYDOMAIN, INC." or "Marchex" or more? http://www.marchex.com/
Including "OpenList" with many zip codes registered, i.e. http://12182.net/
3 104766 MFA DIRECTNIC.COM (?). ALL IP 66.165.42.3 . That IP sends you to: http://dotzup.com/
4 80110 Host Default Bluehost http://www.bluehost.com/
5 51957 Host Default Host Monster http://www.hostmonster.com/
6 39681 Software Default Plesk http://www.parallels.com/plesk/
An old version of Plesk's favicon. See: http://plesk21.net/
7 36508 Service Default Yahoo! http://www.yahoo.com/ -- Mostly yahoo stores
8 28781 Host Default Google Pages http://pages.google.com/
9 22848 MFA ALL IP 72.35.4.17 . Not sure who. http://72.35.4.17/
10 20646 Service Provider ALL in 66.210.173.0/24
Tagline: "Another XSite by a la mode, inc." (http://www.alamode.com/) Also: http://www.xsitesnetwork.com/
11 16972 Real Content  
12 13482 Service Default ALL IP 64.141.48.209 .
They all say at the bottom:
"Powered by Point2 Real Estate Websites (http://nls.point2.com/)
The Point2 Homes Real Estate Network (http://homes.point2.com/)"
13 13272 Software Default DotNetNuke http://www.dotnetnuke.com/
14 12174 Service Default http://www.superpages.com/
They're all in 192.31.222.0/24 except http://alamarbythesea.com/ .
15 11660 Software Default Plesk default icon again, another (older?) version. See, for example: http://artwooooooooorker.net/
16 11291 Host Default http://www.netidentity.com/
Provides sites (at subdomains) and emails, at many domains. Seems to always be MFA inside the ugly frame.
17 10597 MFA http://www.whypark.com/
Weird algorithmic "content" for the most part.
18 9595 Service Default Blogger http://www.blogger.com/
19 9136 Unknown 133 for 420 sites
They're (almost) all 403 forbidden or "this page has been suspended", but some real sites it seems.
20 8211 Host Default dot mac http://www.apple.com/dotmac/
ALL IP 17.250.248.34 except http://www.mcvsd.org/
21 8205 MFA http://www.ddc.com/ Domain Development Corp
ALL IP 64.255.172.50 but one
22 7584 Parking http://net4.in/ ALL IP 202.71.128.225
23 7524 Mass Content "WN Network" http://www.wn.com/ ALL IP 195.149.84.100
24 5620 Mass Content Cities Unlimited http://www.citiesunlimited.com/
25 5486 Software Default Lotus Domino
Ref: http://media.arstechnica.com/journals/apple.media/lotus_icon.jpg
26 4675 MFA Seems to be http://webinceptions.com/
Algo/RSS based "content", but MFA nonetheless.
27 4547 Software Default Xoops http://www.xoops.org/
28 3866 Service Default http://www.weddingtracker.com/
ALL IP 216.127.53.0/24 -- spread out in that class C
29 3722 MFA Guesss: first-clickonline http://first-clickmedia.com/http://www.searchbizonline.com/
ALL IP 208.52.146.19
30 3603 Software Default VBulletin http://www.vbulletin.com/
31 3401 Software Default Plesk http://www.parallels.com/plesk/
32 3149 Software Default http://www.discuz.net/
33 3071 Service Default http://www.reynoldswebsolutions.com/
34 2894 Service Provider I think? http://innuity.com/
35 2832 Host Default http://myhosting.com/
36 2729 Software Default http://e107.org/
37 2420 Software Default cPanel http://www.cpanel.net/
38 2358 MFA Independent/multi? All in 64.182.149.0/24 all the same content
39 2238 Host Default http://www.active24.cz/
40 2117 Service Provider E-zekiel http://e-zekiel.com/
41 1990 Mass Content Some sort of astroturf, I guess. WHOIS all report something related to http://www.cleanenergypublications.com/
42 1967 Software Default Confixx control panels
43 1906 Software Default Drupal http://drupal.org/
44 1895 MFA Well masked. Not sure. Appears to be the registrar http://www.siteurl.com/ snapping up domains. Almost all at IP 66.79.189.124 .
45 1839 Parking http://www.forpsi.com/
46 1825 Software Default iwms (nee DvNews) http://www.xmlasp.net/
47 1805 Software Default Wordpress MU http://mu.wordpress.org/
48 1672 Software Default Older DotNetNuke http://www.dotnetnuke.com/
49 1569 Unknown Hard to tell, asian text I can't read. Possibly: http://www.shaidc.com/
50 1559 Software Default http://www.supesite.com/

Conclusion

There isn't much conclusion to make, really. Despite my initial impressions, the Sun and Netscape logo aren't really all that common. On the other hand, I have started noticing the icons that really are common (especially the Plesk logo) as I surf around the web.

Including only the MFA sites that return a valid favicon.ico file, a minimum of 2.03% of the domains on the internet are such sites. Given the limited number of sites with favicons, it turns out that 1 of every 3.19 favicons out there are on an MFA site. However, the survey was initiated with the Sun and Netscape logos in mind. The Sun logo one of every 121,404 favicons, and Netscape is one of every 259,670. And a whole lot of them aren't really favicons at all!

Comments:

No comments!

Post a comment:

Username
Password
  If you do not have an account to log in to yet, register your own account. You will not enter any personal info and need not supply an email address.
Subject:
Comment:

You may use Markdown syntax in the comment, but no HTML. Hints:

If you are attempting to contact me, ask me a question, etc, please send me a message through the contact form rather than posting a comment here. Thank you. (If you post a comment anyway when it should be a message to me, I'll probably just delete your comment. I don't like clutter.)