This is all the python code you would need to mine the complete StackExchange dump. The data used is all the stackexchange data till September, 2011, released under a creative-commons license. I used this for a course project that is not even remotely commercial.
You can download the data over Http, which is what I did, or you can use torrents. The data is quite large in size, I am uploading the smallest XML files for you to get started while your data gets downloaded ;)
The data is in the form of big XML files. Some questions you can answer
- how many users
- badges distribution
- how many producers/consumers
- popular tags
and so on.
There was a slightly higher purpose of doing this project. I wanted to find out the kind of patterns that are visible in contributions. But, those details can be found in the blog posts here : http://www.rohitdholakia.com/blog/categories/stackoverflow/
I have added data about Bicycles StackExchange in the folder. Lets find out how many users. Earlier, I had a separate script for each. Now, to get an idea, you can generate Summary.py
Rohits-MacBook-Pro:Analysis-StackExchange rohitdholakia$ python Scripts/Summary.py Bicycles/ yo ! num users and epic users are 2002 0 num questions and answers and accepted answers are 1208 4396 862 and finally, num famous in this are 0
The script is very simple. But, it has a couple of things which will be seen throughout. Note that we are doing this work on a 4gb macbook pro with a modest processor. nothing fancy. Hence, dealing with this gigantic xml files keeping them all in memory in not an option. That is where the iterparse and clearElem() play a role. we parse it line-by-line and when we are done, we remove the element and all its parents. The parents part becomes important if you are dealing with highly nested XML files.