Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyWB2] Remove "source" and "source-coll" fields from results #7

Open
sebastian-nagel opened this issue Sep 26, 2019 · 1 comment
Open
Labels
pywb2 Upgrade to PyWB 2

Comments

@sebastian-nagel
Copy link

With PyWB 2.x every result record contains two extra fields "source" and "source-coll" absent in the original index, e.g.

{
  "url": "http://commoncrawl.org/",
  "mime": "text/html",
  "mime-detected": "text/html",
  "status": "200",
  "digest": "FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT",
  "length": "5413",
  "offset": "42695747",
  "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313617.6/warc/CC-MAIN-20190818042813-20190818064813-00014.warc.gz",
  "charset": "UTF-8",
  "languages": "eng",
  "source": "CC-MAIN-2019-35/indexes/cluster.idx",
  "source-coll": "CC-MAIN-2019-35"
}

This is redundant as the collection (aka. "source") is explicitly queried and means 20% more content with Content-Encoding "identity" (which is mostly used in requests). The 20% matter, given that the index server answers 10 millions of requests per month sending multiple TiB results.

Note: there is a nosource param in BaseAggregator,, must be passed permanently resp. made configurable in config.yaml.

@sebastian-nagel sebastian-nagel added the pywb2 Upgrade to PyWB 2 label Sep 26, 2019
@sebastian-nagel
Copy link
Author

Addressed in commoncrawl/pywb@00a84c9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pywb2 Upgrade to PyWB 2
Projects
None yet
Development

No branches or pull requests

1 participant