I love Aaron's idea to create a community around people "build[ing] a Web of data". There may be potential for a startup to help them out, also...
Well organized data can be very valuable. When the data are so valuable that companies providing them have revenues in the billions [1], the costs of distribution are negligible, and there is a lot of competition in such fields. However, there must be many not-as-valuable data sets, the sale of which would cover gathering and organization costs, but the distribution costs of which make it not worthwhile (hosting, coding of payment processing backend, keeping track of legal issues, etc.).
If so, would it make sense to create a website which lets people scrape their own data streams (for example, tagged and organized texts of political speeches from around the world), focusing on the quality of the data, and letting the site take care of hosting and distribution? At the least, it would be a searchable repository of organized data sets. Optimistically, it could be the search engine of the semantic web... ;)
What do you think, ladies and gents?
1. The Thomson Corporation had revenues of $6.6 billion in 2006: http://en.wikipedia.org/wiki/The_Thomson_Corporation
The crucial thing is data quality. You basically have three kinds of public datasets:
1) Academic ones, which are mostly high quality, but tend to be dusted and not kept up to date.
2) High quality commercial datasets, which are expensive and tightly guarded.
3) Free datasets of mostly low quality. Yes you can use dapper to scrape it and freebase to store it, but what's missing is a process to assure data quality. That's what a community effort could provide or coordinate. Something like apache.org for data. And there would have to be a way for non-programmers to help, because with most datasets programmers are not the ones who know the data best and the coding can be extremely dull. It's unbelievable how many different ways there are to screw up data and how difficult it is to clean. There's always some manual work left and you can't beat a pair of eyes (yet) to spot errors.
There would also have to be a way for users of datasets to pay a reasonable amount of money to have a particular dataset brought up to high quality standards.