Before you become a professional data jerk (like myself) people tell you 80% of the job is data scraping, cleaning, organizing, piping etc. And the last 20% is the analysis and stats (or predictive whosiwhatsits). They're lying. Its 85% cleaning, organizing etc, 5% doing "real" stats and 10% convincing people you're not lying. This site seems to nicely outline many of the tasks that fall outside of the "fun" 5%.
I love your breakdown, especially the 10%, so sadly true. But is it right? is the question I get asked a lot, and its a tough one to answer without getting into all the intricacies of what you actually did to the data to get it into a form where you could answer the question asked.
Loving it. This is really easy and accessible stuff, and uses real world data and questions early on in the process. Perfect way to get people experience the subject and bring a bit more fun and inspiration to a otherwise not that exciting area.
After quite a few years studying 'data', I can confidently say that you can take the mini-courses from School of Data, then plunge into Coursera courses, then stop by your local library, if you have money throw them at cherry-picked books from Amazon.com, bribe friends from college to get you papers from science journals and at the end of this, you will still find things you won't know.
Statistics, Probability, Data Analysis, Data Mining, Decision Theory, (Digital) Signal Analysis, Machine Learning, Algorithmics, Graph Theory etc.
I think it really depends on what you're trying to do. I'm all for statistical rigour, and stats will help you with a nice structured dataset. A lot of the time though knowing how to scrape data and convert between formats opens opportunities that a pure statistician wouldn't have. Most of the interesting visualizations I've seen lately don't involve much stats at all.
Neither an conventional undergrad class nor this are "better." They just have different focuses.
But many (perhaps "most") intro stats classes don't involve any programming. So if you want to "implement" anything, an intro stats class may not get you there (even if it gives you a better foundation to understand what various statistical manipulations actually mean.)
I think this is aimed at an audience outside the academic system, and I like the focus on active involvement with data. I think there could be an activist underpinning here - Paulo Freire style data literacy. (http://www.infed.org/thinkers/et-freir.htm)
Having said that, my Maths teacher self wants to do some work on the glossary. In the spirit of 'code talks' I'll post some definitions up and link them to the issue tracker and see what happens...
Why not? Scraping is part of the data acquisition and cleanup process. You need to do it unless you're working with Bloomberg terminals or Census data.
I agree. If I want to engage with my local government on a local issue (e.g. anti-social behaviour) I need data. The data is increasingly available on Web sites. Hence scraping and format conversion become important...
Just the other day I was thinking -- I end up losing in debates because I'm unable to cite data. Scraping and acquiring data is a key part of research, so I'm very much looking for a text that presents the big picture as well as the nitty gritty details from beginning to end.