Skip to content

Conversation

@dgoiko
Copy link

@dgoiko dgoiko commented Jan 25, 2020

I've prepared a branch with some modifications I had to include in the codebase in order to provide custom WebURLs with adittional fields (and the necesary diferent WebURLTupleBindings).

In order to achieve that, I had to introduce the following modifications:

  • WorkQueues creates a WebURLTupleBinding. I added a protected constructor which accepts it as a parameter, so superclases can provide custom WebURLTupleBinding instances.
  • Frontier creates the WorkQueues in the constructor. Now it has a createWorkQueues method that can be overriden by subclases in order to create custom subclases.
  • Same than above for CrawlController and methods createFrontier createEmptyWebURL
  • WebCrawler class logic to follow redirections and outgoing URLs is now placed inside protected functions that can be overriden.

WorkQueues now has a protected constructor that accepts WebURLTupleBinding as a parameter.

It helps to use custom WebURLs with aditional parameters that require a custom WebURLTupleBinding
Now the constructor calls createWorkQueues to get a new WorkQueues instance. This allows subclasses to override this behaviour and create custom work queues.
Now CrawlController subclasses can create custom Frontiers.
Function createEmptyWebURL created to allow subclasses of CrawlController create their custom WebURLs in addSeed operations.
scheduleOutgoingUrls and performRedirect created in order to be able to modify behaviour in subclasses,

schedule and scheduleAll wrap frontier functions so subclases can schedule manually.
redirectionPhase separated into a protected method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant