About 5 years ago, every keynote I attended seemed to be by an evangelist from one of the big Web companies, such as Google, Yahoo!, Twitter, Facebook and the like. At each of these the general sentiment seemed, to me, to be that they had the crawl data, they could do the big analysis, they were ‘big data’ and look at all the cool things we can tell from access to that data. Of course, they’d throw us researchers a ‘sop’ in the interests of being good Web citizens, with Twitter offering 15% of their tweets as a corpus for research, but the feeling from the researchers I spoke with, was that we could never compete against the shear mass of data crawled (and therefore, it seemed, owned) by the big-guys!
Well thankfully, we were to some extent wrong. Open source at the data level rides again with the common crawl!
The Common Crawl data set contains approximately 6 billion web documents stored on a publicly accessible, scalable computer cluster. Here is some more information on the content and storage of the data set.
Well, the Common Crawl won’t solve all problems, it won’t let us look at user interactivity or user behaviour, as we’d need a search Engine and tracking for that – but it is a very good start and who knows, we may get behaviour in the future via some means not yet considered.
Eitherway, before the common crawl we thought only big commercial engines would have access, the common crawl teaches us this is not the case, and that there is always hope. Well Done to the Common Crawl!
Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.
As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.