Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 1.85 KB

File metadata and controls

8 lines (5 loc) · 1.85 KB

masters-projects_trec-social-media-search

This project was a proof of concept for my thesis. I wanted to gauge the viability of using frequent itemset mining in search tasks. The outcome of the project was a submission to the TREC 2012 microblog track (http://trec.nist.gov/pubs/trec21/papers/waterloo.microblog.nb.pdf), showing that it is possible to use results of frequent itemset mining for query expansion. The code builds on Mahout's implementation of FP-Growth to calculate the frequent patterns. I did some tweaks to it for working around the noisy patterns generated because of the characteristics of text data, and for making it more scalable (on one machine).

This project showed the viability of the thesis idea, and it also uncovered many challenges for applying frequent itemset mining on social media text. Many of the contributions of my thesis were geared towards overcoming those challenges, eventhough the biggest challenge I faced was discovering that the FP-Growth algorithm implentation distrubuted in Mahout is buggy. As soon as I switched to another implementation of Frequent Itemset Mining, the effects of my additions to the algorithm became obvious. Before that, what I did didn't matter because Mahout FP-Growth algorithm would generate out-of-whack results anyway. This was the most important learning to me from that work.. and now I don't trust any implementation of any algorithm -- unless it is something as well known as Joachim's SVMlight.

The code also has some utils; for enriching Lucene 3.6 indexes by Corpus wide term statistic (now available in Lucene 4 indexes), for reading and writing TREC formats, .. etc. There are also some R scripts for calculating performance measures (area under the ROC curve for example), and testing the significance of the difference in some performance measures between runs.