To establish a relationship with Common Crawl, I first studied the organization's structure, services, and community activities. After a status report, I will summarize the findings that led to my approach.
So far I have
- subscribed to their mailing list
- emailed with a Berkeley MIMS alumnus (Dave Lester) who interned at Common Crawl last year, and is still on their "Volunteer" page. He answered some questions about the organization, and introduced me via email to their Director.
- emailed the Director of Common Crawl on 9/16, offering some possible plans and asking if we might talk about my contributing (while using them as a case study for this class).
My communications methods have been private, since their website suggests that volunteers and interns email Lisa Green directly. They are not a typical "open source software" community, since they provide a specific web crawl dataset as a service.
Here are the differences between Common Crawl and typical open source software (these informed my approach:
They do not have active forums or transparent communications about their dataset production process, possibly because it is done mainly by one or two internal people (though they do have discussion forums and post all the code on github). Their datasest is therefore not community-produced, but produced internally by staff/volunteers (who may monitor the dicussion forums). Management processes and decision-making are not all made public, and it is a nonprofit 501c3.
Their datasets for the past few years of their existence are mainly annual web crawls which seem quite comprehensive-- the latest being 200+ TB.
Their community seems to be entrepreneurs and researchers. The main contribution has been in explanations of how the datasets have been used (with some analytic overview), and some various github postings on their data analysis code (mainly java). I'm still taking inventory of their internally-released and community-released code.
The github shared codes seem to have one contributor (except for the internal org's sample code), and the repositories for such code do not seem to be forked. So the code is released, and usable, but not a "living, evolving, community" codebase as per common types of open source software.
The organization itself may not be the main users of the code (which is unusual- usually open source software is produced by people who use it themselves). Their mission is to encourage research, entrepreneurship, and any other kind of innovation through others' use of their datasets.
Considerations about the organization (more soon...:
- Since my contributions may involve private emails, I will need to clarify with them that I would like to publish some of these materials. To honor their privacy, I may have to set up a process whereby they give clearance permissions, possibly on a case-by-case basis, to my working communications with them.
Interesting opportunities for the organization (more soon...):
- This year, professor Jim Hendler joined the advisory board. He has a long history with open data initiatives for government. Perhaps their dataset could be even more valuable in synergy with government datasets that are being released (in partnership possibly other open data communities).
MORE SOON: Ways that I might contribute (will update based on my solicitation email to Lisa Green):
read by Thomas 11/6/2013