GitHub

Summary

These codes are an attempt to build a vote score predictor from Hackers News Stories.

The best resulting accuracy was just hitting 70%, which is not good enough to be any practical usage.

Takeaway lesson: to use neural network effectively, one has to become the meta neural-network that learns which model to choose under different circumstance. This requires lots of experience, a bit of intuition, and more trial-and-error. There are very little systematic ways about picking the best model.

Comparison of Classifiers

Classifier	Accuracy
Metamind¹	68%
seq2-CNN²	70%
NearestCentroid³	63%
Perceptron³	60%

While there are other classifiers from sci-kit learn, their accuracy weren't much better than a random guess.

Experiment Data

I used the firebase API from Hackers News, grabbed 20000+ webpage content from the links of HN story post. There are a lot more post with 15 or less score so I excluded them after crawling for a while. The content, score and word_count are stored in a postgresql database. Exclude post with < 100 which usually indicate a crawler problem, or the original page is a video/image.

select
      case when s.score >= 0 and s.score <= 15    then '  0 - 15'
           when s.score > 15 and s.score <= 40   then ' 15+ - 40'
           when s.score > 40 and s.score <= 80  then ' 40+ - 80'
           when s.score > 80 and s.score <= 150  then ' 80+ - 150'
           else 'over 150'
      end ScoreRange,
      count(*) as TotalWithinRange
   from
      story_contents sc LEFT JOIN story s ON s.id = sc.id
   where sc.word_count > 100 group by 1;

scorerange	totalwithinrange
80+ - 150	2532
over 150	2530
40+ - 80	2765
0 - 15	10200
15+ - 40	2359

Package and Reference

Metamind has very easy to use API, and easy to understand reporting page for classifier performance. Though out of the box I am guessing the classifier doesn't work that well. ↩
Rie Johnson and Tong Zhang. Effective use of word order for text categorization with convolutional neural networks. To appear in NAACL-HLT 2015. Also available as arXiv:1412.1058v2. To run this, I used a public AMI image that had CUDA 6.5 pre-installed on AWS. Source ↩
scikit learn: A python package with lots of good stuff. Easy to understand example code, and fast prototyping. ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cnet_log		cnet_log
word2vec		word2vec
README.md		README.md
gensim_train.py		gensim_train.py
get_article.py		get_article.py
metamind_train.py		metamind_train.py
parse_article.py		parse_article.py
scikit_train_2labels.ipynb		scikit_train_2labels.ipynb
scikit_train_5labels.ipynb		scikit_train_5labels.ipynb
speak_cnn.py		speak_cnn.py
story_teller.py		story_teller.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Comparison of Classifiers

Experiment Data

Package and Reference

About

Releases

Packages

Languages

zhengwy888/HN_rank

Folders and files

Latest commit

History

Repository files navigation

Summary

Comparison of Classifiers

Experiment Data

Package and Reference

Footnotes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages