Skip to content
This repository was archived by the owner on Oct 29, 2019. It is now read-only.

Commit 6392146

Browse files
committed
test(Sanitize): add initial test for record sanitization
working on getting formats like docs & pdf's to properly encode & decode to warcs
1 parent 2c1a114 commit 6392146

9 files changed

+133
-642
lines changed

Diff for: README.md

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# warc
2+
[![GitHub](https://img.shields.io/badge/project-Data_Together-487b57.svg?style=flat-square)](http://github.com/datatogether)
3+
[![Slack](https://img.shields.io/badge/slack-Archivers-b44e88.svg?style=flat-square)](https://archivers-slack.herokuapp.com/)
4+
[![GoDoc](https://godoc.org/github.com/datatogether/warc?status.svg)](http://godoc.org/github.com/datatogether/warc)
5+
[![License](https://img.shields.io/github/license/mashape/apistatus.svg)](./LICENSE)
6+
7+
warc is an implementation of ISO28500 1.0, the WebARCive specfication.
8+
it provides readers, writers, and structs for working with warc records.
9+
10+
from the spec:
11+
> The WARC (Web ARChive) file format offers a convention for concatenating
12+
multiple resource records (data objects), each consisting of a set of
13+
simple text headers and an arbitrary data block into one long file. The
14+
WARC format is an extension of the ARC File Format [ARC] that has
15+
traditionally been used to store "web crawls" as sequences of content
16+
blocks harvested from the World Wide Web. Each capture in an ARC file is
17+
preceded by a one-line header that very briefly describes the harvested
18+
content and its length. This is directly followed by the retrieval
19+
protocol response messages and content. The original ARC format file is
20+
used by the Internet Archive (IA) since 1996 for managing billions of
21+
objects, and by several national libraries.
22+
package warc
23+
24+
## License & Copyright
25+
26+
[Affero General Public License v3](http://www.gnu.org/licenses/agpl.html) ]
27+
28+
## Getting Involved
29+
30+
We would love involvement from more people! If you notice any errors or would like to submit changes, please see our [Contributing Guidelines](./.github/CONTRIBUTING.md).
31+
32+
We use GitHub issues for [tracking bugs and feature requests](https://github.com/datatogether/REPONAME/issues) and Pull Requests (PRs) for [submitting changes](https://github.com/datatogether/REPONAME/pulls)
33+
34+
## Usage
35+
`import "gitnub.com/datatogether/warc"`

Diff for: doc.go

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
// warc is an implementation of ISO28500 1.0, the WebARCive specfication.
2+
// it provides readers, writers, and structs for working with warc records.
3+
// from the spec:
4+
5+
// The WARC (Web ARChive) file format offers a convention for concatenating
6+
// multiple resource records (data objects), each consisting of a set of
7+
// simple text headers and an arbitrary data block into one long file. The
8+
// WARC format is an extension of the ARC File Format [ARC] that has
9+
// traditionally been used to store "web crawls" as sequences of content
10+
// blocks harvested from the World Wide Web. Each capture in an ARC file is
11+
// preceded by a one-line header that very briefly describes the harvested
12+
// content and its length. This is directly followed by the retrieval
13+
// protocol response messages and content. The original ARC format file is
14+
// used by the Internet Archive (IA) since 1996 for managing billions of
15+
// objects, and by several national libraries.
16+
package warc

Diff for: reader.go

+1-1
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ func readBlockBody(data []byte) ([]byte, error) {
195195
if start == -1 {
196196
return data, nil
197197
}
198-
return data[start:], nil
198+
return data[start+1:], nil
199199
}
200200

201201
const (

Diff for: reader_test.go

+6
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
package warc
22

33
import (
4+
"io/ioutil"
45
"os"
6+
"path/filepath"
57
"testing"
68
)
79

@@ -34,3 +36,7 @@ func TestReadAll(t *testing.T) {
3436
// fmt.Println(r.Type().String())
3537
// }
3638
}
39+
40+
func readTestFile(path string) ([]byte, error) {
41+
return ioutil.ReadFile(filepath.Join("testdata", path))
42+
}

0 commit comments

Comments
 (0)