benchmark StorySniffer as potential tool to help identify parked EN domains

The [StorySniffer](https://github.com/palewire/storysniffer/) module uses a trained model and some URL-based heuristics to guess if an EN-language URL is a news story. Reading the docs and training method, I think the model is actually guessing if it is a homepage or not. 

We often end up with parked URLs with the RSS feed spams us with links to things that look like homepages (see the recent issues with `sktoday.com` in our system).  What if we used StorySniffer to classify 100 random stories in the last week from an english-language source. Would that be a useful signal to indicate if a source has turned into a parked spam domain or not?

The clear limitation here is the English-only, and potential worries about the model not aging well with HTML and writing norms that constantly shift. But it is an interesting question and is an off-the-shelf module we can try out.

I'd say the tasks here are to:
1. verifying the module can be pulled and works on a single URL of a story and a single URL of a homepage
2. picking 5-10 english sources we've had issues with and writing a script to see what it says about a sample of 500 recent stories from each
3. reporting back on run time, compute usage, and accuracy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark StorySniffer as potential tool to help identify parked EN domains #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

benchmark StorySniffer as potential tool to help identify parked EN domains #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions