Skip to content

Python script to create CDX index files of WARC data

Notifications You must be signed in to change notification settings

travisfw/CDX-Writer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

Python script to create CDX index files of WARC data.

--format flag specifies the list of fields to include.

The format syntax can is specified here, and is copied below:

  • http://www.archive.org/web/researcher/cdx_legend.php

  • https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/format/CDXFormat.java

    The default first line of a CDX file is : CDX A b e a m s c k r V v D d g M n

    The letters use in dat files and cdx files are as follows :

    A canonized url B news group C rulespace category *** D compressed dat file offset F canonized frame G multi-columm language description (* soon) H canonized host I canonized image J canonized jump point K Some weird FBIS what's changed kinda thing L canonized link M meta tags (AIF) * N massaged url P canonized path Q language string R canonized redirect U uniqness *** V compressed arc file offset * X canonized url in other href tages Y canonized url in other src tags Z canonized url found in script a original url ** b date ** c old style checksum * d uncompressed dat file offset e IP ** f frame * g file name h original host i image * j original jump point k new style checksum * l link * m mime type of original document * n arc document length * o port p original path r redirect * s response code * t title * v uncompressed arc file offset * x url in other href tages * y url in other src tags * z url found in script *

    comment

    • in alexa-made dat file ** in alexa-made dat file meta-data line *** future data

About

Python script to create CDX index files of WARC data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published