Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

NxtPerfect / hacker-news-picker Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Breadcrumbs

hacker-news-picker

/

TODO.md

Copy path

Latest commit

History

109 lines (94 loc) · 3.66 KB

Breadcrumbs

hacker-news-picker

/

TODO.md

File metadata and controls

109 lines (94 loc) · 3.66 KB

Data preparing

Step 1

script to go onto hacker news and fetch titles with links of articles
- if article in db, skip
- for pagination when all articles done, link is [https://news.ycombinator.com/?p=2] for 2 = page
- script should be seperate from the model, so that if at any point i want to change website to get news from it should only require the script to be changed, not the model
when adding article, update stats
aiohttp to parallelize website requests

Optional

[/] 2000/6000 articles
- Maybe it's not needed
- At least for now it's not needed
Ensure categories are consistent, don't create categories for a singular article

Optimization

Parallelize web requests using threads
Only save to file after every page was parsed

Step 2

One-hot-encode the titles or some other way to turn text into numbers
- ensure all titles are of same length
- preprocess before splitting data
- actually LabelEncoder is better for categories
Encode titles into numbers, probably tensor
- csr_matrix has no attribute 'to' when trying to move tfidf vector to cuda
- Create custom dataset in src/categorize/dataset.py
- Uses bert

Model

Ideally, it should be able to both categorize the article, and judge if it's interesting and on what scale

Step 1

[/] Model to categorize article based on title
- Use already existing categories from file
- At least 90% accuracy - currently 30%
- RNN, especially LSTM or GRU might be good, but computionally expensive
- CNN can be faster but less effective
- Uses GRU with 98% accuracy
Find optimal hyperparameters using gridsearch
- for lr in lr
- for epoch in epochs
- for batch_size in batch_size

Step 2

[/] Model to predict the interesting rate of an article based on title
- Needs at least 80% accuracy
- MLP easy to implement but worse than RNN
- RNN, better than MLP but more computionally expensive
- How to embed title + category

Generalization

main.py should run scraper
- categorize articles that don't have label
  - almost getting labels, just need to each each value from the tensor
- predict interest rating
  - bad size when loading model
  - input size is equal to how many articles there are
- once all data is labeled and has good accuracy load models and only retrain on feedback

Feedback

Feedback should change data in the .csv then there's an option to retrain the model

Optional

Create one model to do it all or somehow join them together

UI

Step 1

What frontend will I use?
- supposedly it needs python backend, but if i only need .csv file of data the model spits out, I could get away with using nextjs or htmx
- maybe using streamlit? or mesop
- or use flask and htmx?

Step 2

Create UI in Figma
- minimalistic
- page for newest articles
  - detailed page for articles
    - feedback button, interesting or not, change rating, change category, remove article
    - link to article
- sort by category
- sort by interesting_rating
- [/] select tag to choose category and rating
  - then save it to file
- [/] page to control the model
  - ?retrain model not needed if feedback works realtime
- statistics
  - how many articles fetched
  - how many articles discarded
  - how many articles accepted
  - accuracy based on user feedback
Implement it
Rewrite scraper with golang

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.