Wednesday, February 20, 2008

why we're blocking Microsoft Live

While going through server logs I noticed something funny, Microsoft Live has surpassed Google in the number of hits we've received. Weird, eh? I didn't think anyone actually used it.

The terms people arriving at our site through search.live.com are just... weird, though. Most are outright vulgar, searching for obscure pornography or celebrity names, drugs, sex aids... Here's an few examples:

65.55.165.36 http://search.live.com/result.aspx?q=valtrex&mrt=en-us&FORM=LVSP
/log/trunk Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR
1.1.4322)
131.107.0.95 http://search.live.com/result.aspx?q=breast+enhancement&mrt=en-us&FORM=LVSP /
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; Win64; x64; SV1)
65.55.165.122 http://search.live.com/result.aspx?q=ferarri&mrt=en-us&FORM=LVSP
/browser/media/tutorials Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2;
.NET CLR 1.1.4322)

These hits were to small obscure pages on our site, such as svn changelogs, and then I noticed the source IP addresses were in just a few IP ranges, so I ran a whois to see if one ISP tied connected them all;

arc@sobek ~/work/pysoy $ whois 65.55.165.83

OrgName: Microsoft Corp
OrgID: MSFT
Address: One Microsoft Way
City: Redmond
StateProv: WA
PostalCode: 98052
Country: US

NetRange: 65.52.0.0 - 65.55.255.255
CIDR: 65.52.0.0/14
NetName: MICROSOFT-1BLK

You read that correctly. Microsoft, in a desperate attempt to make themselves seem more important, or perhaps just to flood free software project's websites with unwanted traffic, is running bots which act like normal web crawlers. Indeed, over 97% of the hits we got from search.live.com were from Microsoft's own IP subnets. Searching Google, I found this story was previously covered by others more observant of their logs.

In response, I'm adding a special rule to block all future traffic from the offending netblocks, including MICROSOFT 131.107.0.0 - 131.107.255.255 and MICROSOFT-1BLK 65.52.0.0 - 65.55.255.255.

2 comments:

Dan O'H said...

From reading the Microsoft comment, they do give a decent reason for it. i.e. they're identifying sites that are trying to game search engines by sending different pages to bots vs. humans. Being annoyed and blocking them are fine, but I don't think it's reasonable to claim Microsoft are doing it "to make themselves seem more important, or perhaps just to flood free software project's websites with unwanted traffic"

Arc Riley said...

If they were checking to see if people were gaming them, as Google does as well, they wouldn't be reporting to be referred from search.live.com search queries.

This is statistics skewing and, because they're ignoring robots.txt in this, were hammering our server's CPU in the process. Most of the pages they were hitting were not even in their own index.