Skip to content

benchmark StorySniffer as potential tool to help identify parked EN domains #7

@rahulbot

Description

@rahulbot

The StorySniffer module uses a trained model and some URL-based heuristics to guess if an EN-language URL is a news story. Reading the docs and training method, I think the model is actually guessing if it is a homepage or not.

We often end up with parked URLs with the RSS feed spams us with links to things that look like homepages (see the recent issues with sktoday.com in our system). What if we used StorySniffer to classify 100 random stories in the last week from an english-language source. Would that be a useful signal to indicate if a source has turned into a parked spam domain or not?

The clear limitation here is the English-only, and potential worries about the model not aging well with HTML and writing norms that constantly shift. But it is an interesting question and is an off-the-shelf module we can try out.

I'd say the tasks here are to:

  1. verifying the module can be pulled and works on a single URL of a story and a single URL of a homepage
  2. picking 5-10 english sources we've had issues with and writing a script to see what it says about a sample of 500 recent stories from each
  3. reporting back on run time, compute usage, and accuracy

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions