Startup idea for theinfo.org community

fauigerzigerk · on Jan 16, 2008

Yes, I think you're right that something like this could and should be done. I've been collecting ideas and experimenting with stuff like that for a long time. It's not easy. Finding datasets is not the big problem, so I think tagging datasets doesn't help as much as it does with, say, photo collections.

The crucial thing is data quality. You basically have three kinds of public datasets:

1) Academic ones, which are mostly high quality, but tend to be dusted and not kept up to date.

2) High quality commercial datasets, which are expensive and tightly guarded.

3) Free datasets of mostly low quality. Yes you can use dapper to scrape it and freebase to store it, but what's missing is a process to assure data quality. That's what a community effort could provide or coordinate. Something like apache.org for data. And there would have to be a way for non-programmers to help, because with most datasets programmers are not the ones who know the data best and the coding can be extremely dull. It's unbelievable how many different ways there are to screw up data and how difficult it is to clean. There's always some manual work left and you can't beat a pair of eyes (yet) to spot errors.

There would also have to be a way for users of datasets to pay a reasonable amount of money to have a particular dataset brought up to high quality standards.

tocomment · on Jan 16, 2008

Call me pessamistic but I always assume most data collections are illegal in some way, violoating TOS, privacy, etc.

bayareaguy · on Jan 16, 2008

Something like a community-based Dapper? http://www.dapper.net

whacked_new · on Jan 16, 2008

freebase.com?

I don't follow them, but the vision seems similar. They launched about a year ago.