- script to go onto hacker news and fetch titles with links of articles
- if article in db, skip
- for pagination when all articles done, link is [https://news.ycombinator.com/?p=2] for 2 = page
- script should be seperate from the model, so that if at any point i want to change website to get news from it should only require the script to be changed, not the model
- when adding article, update stats
- aiohttp to parallelize website requests
- [/] 2000/6000 articles
- Maybe it's not needed
- At least for now it's not needed
- Ensure categories are consistent, don't create categories for a singular article
- Parallelize web requests using threads
- Only save to file after every page was parsed
- One-hot-encode the titles or some other way to turn text into numbers
- ensure all titles are of same length
- preprocess before splitting data
- actually LabelEncoder is better for categories
- Encode titles into numbers, probably tensor
- csr_matrix has no attribute 'to' when trying to move tfidf vector to cuda
- Create custom dataset in src/categorize/dataset.py
- Uses bert
- Ideally, it should be able to both categorize the article, and judge if it's interesting and on what scale
-
[/] Model to categorize article based on title
- Use already existing categories from file
- At least 90% accuracy - currently 30%
- RNN, especially LSTM or GRU might be good, but computionally expensive
- CNN can be faster but less effective
- Uses GRU with 98% accuracy
-
Find optimal hyperparameters using gridsearch
- for lr in lr
- for epoch in epochs
- for batch_size in batch_size
- [/] Model to predict the interesting rate of an article based on title
- Needs at least 80% accuracy
- MLP easy to implement but worse than RNN
- RNN, better than MLP but more computionally expensive
- How to embed title + category
- main.py should run scraper
- categorize articles that don't have label
- almost getting labels, just need to each each value from the tensor
- predict interest rating
- bad size when loading model
- input size is equal to how many articles there are
- once all data is labeled and has good accuracy load models and only retrain on feedback
- categorize articles that don't have label
- Feedback should change data in the .csv then there's an option to retrain the model
- Create one model to do it all or somehow join them together
- What frontend will I use?
- supposedly it needs python backend, but if i only need .csv file of data the model spits out, I could get away with using nextjs or htmx
- maybe using streamlit? or mesop
- or use flask and htmx?
-
Create UI in Figma
- minimalistic
- page for newest articles
- detailed page for articles
- feedback button, interesting or not, change rating, change category, remove article
- link to article
- detailed page for articles
- sort by category
- sort by interesting_rating
- [/] select tag to choose category and rating
- then save it to file
- [/] page to control the model
- ?retrain model not needed if feedback works realtime
- statistics
- how many articles fetched
- how many articles discarded
- how many articles accepted
- accuracy based on user feedback
-
Implement it
-
Rewrite scraper with golang