NEWS

AOL Goofup Leads to Google Keywords Leak

August 07, 2006
Suyog

In what must be a stunning development across the web, AOL in its foolhardiness (depends on how you look at it) has released a research paper showing data of "top searched queries" on AOL by 650000 users resulting in a data set of nearly 20 million search terms.

According to AOL:

500k User Session Collection

This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

Brief description:

This collection consists of ~20M web queries collected from ~650k users over three months.

The data is sorted by anonymous user ID and sequentially arranged.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.

Why is this important?

Because AOL uses Google as their search engine. Which means anyone laying their hands on this data potentially has nearly 20 million queries to work with to come up with list of top keywords people are searching for!

Imagine the possibilities with this - anyone can manipulate this data with the search terms and make websites for those queries. Its obvious that Google is now going to be majorly pissed at AOL for leaking this data. But what is more concerning is the way AOL has released this keyword list, with not enough concerns about any privacy. (Links below on this).

I will be keeping a tab on this story as it develops and people come up with more and more data analysis of this data, and general views on this subject.
More resources on this huge development:
Forums.Digitalpoint.com is carrying a thread on this topic, and naturally everyone is stunned on being able to access the goldmine.
Plentyoffish.wordpress.com  has already begun to analyze the search data and is posting the results of his analysis. Some of the queries are downright disturbing. Specific posts include "Aol data shows users planning to commit murderr", "AOL data showing Myspace growing SEO spam" "Myspace killing dating sites"

Head over to http://www.gregsadetsky.com/aol-data/ to find links to download this huge 500 Mb behemoth file.

Techcrunch is on top of the story as well, and echoes my sentiments as to why this is a huge concern as well:

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with "buy ecstasy" and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.

It will be interesting to see how Google / AOL react to this leak. I will also be keeping an update on my blog here.

I began writing around 6 years ago online, when I earned 1 Rs per article; unfortunately that website bust. I wrote for another website whose hypocritical policies shooed me away. I have now taken to blogging at http://suyogdeshpande.net/blog - my personal blog, and recently began my two new projects: nokjhok.com where I ramble on Bollywood, and techb.org, where I write about Technology, my other interest. Yeah I know, I have nothing better to do :D
eXTReMe Tracker
Keep reading for comments on this article and add some feedback of your own!

AOL Goofup Leads to Google Keywords Leak

Article

Author: Suyog

 

Comments! Feedback! Speak and be heard!

Comment on this article or leave feedback for the author

#1
Aaman
URL
August 7, 2006
01:05 PM

Google has a display in their main office that shows random selections of searches people are doing from around the world - some real wierd stuff - will be interesting to see what's in these logs.

How does AOL benefit from releasing these? Is there an expectation to keep search queries private, I understand the IP addresses need to be private, but why the queries, per se?

#2
Suyog
URL
August 7, 2006
01:09 PM

@Aaman: I echo what techcrunch says about this:

"The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the abilitiy to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with "buy ecstasy" and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless. "


Actually AOL took it off after a while - they no longer have the paper and data on their website anymore. However internet being internet, its all over the place.

Suyog

#3
Aaman
URL
August 8, 2006
04:29 AM

Some dude has provided a web interface to query the AOL Search Database - spiffy, and interesting...

#4
Intrepid
URL
August 9, 2006
04:28 PM

I did a brief analysis at
Everyday Entrepreneurs

#5
Sujatha
URL
August 17, 2006
09:35 AM

ok, here's a rather stupid question (I'm assuming it's stupid because better technological minds than mine are raising a stink about the invasion of privacy): From this release can anyone tell who is searching for what? The article mentions that this search term release is potentially embarrassing for some internet users. You mention ego searches plus porn as being embarrassing. But how can anyone tell that both searches originated from the same person? I thought AOL did not release identifying information.

#6
Aaman
URL
August 17, 2006
09:46 AM

The User ID (internal number, not AOL username) is part of the data, and many people do ego-surf, so it's likely some research can pull up the actual person doing the search.

Add your comment

(Or ping: http://desicritics.org/tb/2633)

Personal attacks are not allowed. Please read our comment policy.






Remember Name/URL?

Please preview your comment!