From b64550d3c37902bde3b22e0c690174218e7994d1 Mon Sep 17 00:00:00 2001 From: Tyler Nickerson Date: Tue, 25 May 2021 16:17:25 -0700 Subject: [PATCH] feat(*): added in-repo docs (#22) --- README.md | 96 ++++++++++++++++++++++++++++++++++++--- cli/merge.go | 7 +-- docs/api.md | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++ docs/cli.md | 67 ++++++++++++++++++++++++++++ docs/odxml.md | 62 ++++++++++++++++++++++++++ go/write.go | 7 ++- 6 files changed, 349 insertions(+), 11 deletions(-) create mode 100644 docs/api.md create mode 100644 docs/cli.md create mode 100644 docs/odxml.md diff --git a/README.md b/README.md index dc0a1b80..69bd88d7 100644 --- a/README.md +++ b/README.md @@ -7,12 +7,96 @@ -The Open Dictionary Project (ODict for short), is an open-source alternative to proprietary dictionary file formats such -as [Babylon](http://www.babylon-software.com/free-dictionaries/) and -[Apple Dictionaries](https://developer.apple.com/library/content/documentation/UserExperience/Conceptual/DictionaryServicesProgGuide/Introduction/Introduction.html). +The Open Dictionary Project (ODict for short), is an open-source alternative to proprietary dictionary file formats Babylon and Apple Dictionaries. Similar to other dictionaries, Open Dictionary files are converted from XML to compressed binary files that can be easily indexed, searched, and modified using the Open Dictionary [CLI tool](cli.md) or via its open-source Go bindings. -Similar to Apple dictionaries, Open Dictionary files are converted from XML (sometimes referred to as ODXML) to compressed, serialized, bite-sized files. Originally written in C++, ODict has since been ported to Go for portability and maintainability purposes. Each compiled dictionary consists some basic header information, as well as a [Snappy](https://github.com/google/snappy)-compressed [Flatbuffers](https://github.com/google/flatbuffers) that contains all of the dictionary's entries and definitions. +## Motivation -The ODict CLI uses [Bleve](https://github.com/blevesearch/bleve) to perform ad-hoc indexing on the local file system for rapid full-text searching of entries. ODict has a number of sister repos of varying completeness. As of this writing, there is a pretty comprehensive [Java port](https://github.com/TheOpenDictionary/odict-java) of the project as well as an example of how to use the ODict CGo extension [in Python](https://github.com/TheOpenDictionary/freedict/blob/master/odictlib.py). +So your first thought is most likely, why would I need this? Why would someone take the time to build this? Who gives a shit about dictionaries? -Full documentation available at https://odict.org. +Turns out, a lot of people. Amazing products like [freedict](https://freedict.org), [Wiktionary](https://wiktionary.org), and [Linguee](https://linguee.com) are aimed at providing people with completely free dictionaries to aid NLP research, language learning, and just share knowledge. The only issue is... this data is either only accessible online, not machine readable, not well structured, or some combination of the three. + +So... how does ODict help? + +ODict was developed out of the need for a fast, readily available, and open solution to cross-platform dictionary formats such as Apple Dictionaries, Babylon, Dictd, StarDict and others. Current formats are specifically designed to work with specific applications (usually developed by the same company that made the format), and as a result are somewhat uni-directional (there is official documentation on how to write a dictionary in their format, but not on how to read one). This forces users to write dictionaries that only work with one specific dictionary app. + +Certains formats, like StarDict or Slab, can be read by multiple dictionary apps, but like most dictionary formats are HTML-based. The data inside is not structured, and the HTML markup might be inconsistent between dictionaries. + +_Wouldn't it be nice if there was a completely open-source format, with documentation on both reading and writing the format, that anyone could use in **any** dictionary app?_ + +That's where ODict comes in. + +## Key Advantages + +1. **ODict is open-source.** Yep. 100% available for anyone to help build or extend, and not to mention 100% free. Most development happens in the repository you're looking at right now! + +2. **ODict is layout agnostic.** Because ODict is structured lexical data as opposed to a bunch of indexed HTML, you can display ODict dictionary entries in any way you want, without any special CSS selectors. Both the CLI and language-specific ODict bindings return JSON for all fuzzy search and entry lookups. + +3. **ODict is tiny (relatively speaking).** Seeing we're not storing any large HTML text bodies and instead just store compressed binary data, ODict files can be expected to be smaller than your standard dictionary files. + +## ODXML + +ODict XML (ODXML) is the XML used to define and structure ODict dictionaries. All compiled ODict dictionaries originate from ODXML files, and can easily be reverted back to XML via the CLI [`dump` command](cli.md#dumping-dictionaries). + +An example dictionary might look like this: + +```xml + + + + + + + + + + A cute pupper + + + Common way of saying a dog is a cutie + + + + + +``` + +For more info on writing ODXML, check out the [official specification](docs/odxml.md). + +## Language Bindings + +ODict can currently be used in [Go](docs/api.md#go), [Java](docs/api.md#java), and [Python](docs/api.md#python). If you're just interested in compiling or reading dictionaries without writing any code, you can just the official command-line tool below. + +## CLI + +The ODict command-line interface (CLI) is a Go program you can execute in any terminal to create, dump, merge, search, and index ODict dictionaries. + +The CLI's primary distribution channel is currently through its [Homebrew](homebrew.sh) formula: + +``` +$ brew tap TheOpenDictionary/odict +$ brew install odict +``` + +While you most likely would interface with dictionaries via a language-specific library, the CLI exists as a convenience tool that can be used to help debug or rapidly produce new dictionaries. For full docs on the CLI, [see here](docs/cli.md). + +## File Format (For Nerds) + +> **NOTE:** You can probably skip this section unless you're trying to debug changes to this code-base, or are writing an ODict parser in a language not currently supported. + +Compiled .odict files are relatively straightforward, as they utilize the [ODict Flatbuffer schema](../schema/schema.fbs). + +The buffer generated by this schema take up over 90% of the compiled file, however, addition header information still exists. The table below illustrates the full breakdown of a compiled .odict file, in the order in which the values are written to the file. + +All values written in Little Endian byte order. + +| Name | Type | Bytes | Description | +| -------------- | ------- | -------- | ------------------------------------------------------------------------------------------------------- | +| Signature | char[6] | 6 | Signature for the ODict format. Assertions fail if this signature is missing. Should always be `ODICT`. | +| Version | ushort | 2 | Represents the major version of ODict with which the file was created. | +| Content Length | long | 8 | Size (in bytes) of the compressed content to read. Used in assertions to validate file length. | +| Content | []byte | Variable | Snappy-compressed FlatBuffer object. Must be decompressed by Snappy before it can be used. | + +A design decision was made to keep the structural data of the ODict format as a cross-platform [Flatbuffers](https://google.github.io/flatbuffers/) schema as opposed to simply encoding a +Go struct so that the format could be used by anyone, even without necessarily using any of the core ODict libraries. + +For an example of how the files are written, you can look at the official [Go code that does so](go/read.go). diff --git a/cli/merge.go b/cli/merge.go index 2957ca68..b049f10d 100644 --- a/cli/merge.go +++ b/cli/merge.go @@ -10,9 +10,10 @@ import ( func merge(c *cli.Context) error { inputFile1 := c.Args().Get(0) inputFile2 := c.Args().Get(1) + outputFile := c.Args().Get(2) - if len(inputFile1) == 0 || len(inputFile2) == 0 { - return fmt.Errorf("Usage: odict merge [dictionary1] [dictionary2]") + if len(inputFile1) == 0 || len(inputFile2) == 0 || len(outputFile) == 0 { + return fmt.Errorf("Usage: odict merge [dictionary1] [dictionary2] [outputFile]") } t(func() { @@ -21,7 +22,7 @@ func merge(c *cli.Context) error { result := odict.MergeDictionaries(dict1, dict2) - fmt.Println(odict.DumpDictionary(result)) + odict.CreateODictFile(outputFile, result) }) return nil diff --git a/docs/api.md b/docs/api.md new file mode 100644 index 00000000..5678d187 --- /dev/null +++ b/docs/api.md @@ -0,0 +1,121 @@ +# Using the API + +## Installing + +Currently, it is only possible to use language bindings from another Bazel project, as the ODict JAR is not yet on Maven Central and Python's dependency on the shared ODict library makes it difficult to distribute through `pip`. Fortunately, setting up ODict in another Bazel project is easy. + +Just add the following to your `WORKSPACE` file: + +```python +http_archive( + name = "odict", + sha256 = "b58fd3432a6f84865c67a16ef6718be12ecd6b9b32c12dfd917c0a899807062f", + strip_prefix = "odict-1.4.5", + url = "https://github.com/TheOpenDictionary/odict/archive/1.4.5.tar.gz", +) + +load("@odict//bazel:odict_deps.bzl", "odict_deps") + +odict_deps() + +load("@odict//bazel:odict_extra_deps.bzl", "odict_extra_deps") + +odict_extra_deps() +``` + +then require either `@odict//java` or `@odict//python` in your respective Bazel rules. + +If any API usage is unclear, you may be able to get a better idea of how to use the APIs by looking at ODict's unit tests. + +## Go + +ODict is built in Go, so naturally it supports a public API out-of-the-box: + +```go +import ( + odict "github.com/TheOpenDictionary/odict/go" +) + +func main() { + // Write a dictionary from a local ODXML file + odict.CompileDictionary("mydict.xml") + + // Write an XML string to a binary + odict.WriteDictionary("", "mydict.odict") + + // Read a compiled dictionary into memory + dict := odict.ReadDictionary("mydict.odict") + + odict.IndexDictionary( + "mydict.odict", + false // Set to "true" to force-index + ) + + // Search an indexed dictionary by ID + results := odict.SearchDictionary( + dict.id, + "dog", + false // Set to "true" if you need to match the given word exactly + ) +} + +``` + +## Java + +While a [standalone Java client](https://github.com/TheOpenDictionary/odict-java) for ODict _used_ to exist, it has since been superseded by a Java binding that uses the ODict core's cgo dynamic library that lives in this repo. + +Fortunately, the new ODict Java interface is extremely easy to use and will always stay up-to-date with the latest upstream changes to the ODict format. + +```java +// Import statement +import org.odict.Dictionary; + +void main() { + // Compile a dictionary + Dictionary.compile("path/to/file"); + + // Write a new dictionary + Dictionary.write("an XML string", "path/to/output.odict"); + + // Load a dictionary + Dictionary dict = new Dictionary("path/to/dictionary.odict"); + + // Lookup an entry by word + System.out.println(dict.lookup("giraffe")); + + // Index the dictionary + dict.index(); + + // Perform a fuzzy-search + System.out.println(dict.search("full text")); +} +``` + +## Python + +The Python interface for ODict is similar to that of Java and Go: + +```python +# Import statement +from python.odict import Dictionary + +def main(): + # Compile a dictionary + Dictionary.compile("path/to/file") + + # Write a new dictionary + Dictionary.write("an XML string", "path/to/output.odict") + + # Load a dictionary + dict = Dictionary("path/to/dictionary.odict") + + # Lookup an entry by word + print(dict.lookup("giraffe")) + + # Index the dictionary + dict.index() + + # Perform a fuzzy-search + print(dict.search("full text")) +``` diff --git a/docs/cli.md b/docs/cli.md new file mode 100644 index 00000000..2eeea04e --- /dev/null +++ b/docs/cli.md @@ -0,0 +1,67 @@ +# CLI Reference + +## Creating Dictionaries + +To create a dictionary, you'll first need to write a dictionary using the ODict [markup language](odxml.md) and save it as an ".xml" file. Once you're confident your file is in the correct format, compiling it to an ".odict" dictionary is as simple as running: + +``` +$ odict c mydictionary.xml +``` + +The output file will always be a corresponding ".odict" dictionary that appears in the same directory as the source file. + +## Searching Dictionaries + +There are two ways to search a compiled .odict file: via a case-insensitive entry lookup, which outputs a single entry, or via a full text fuzzy search, which will output an array of matching entries. + +Let's look at each respectively. + +### Entry Lookup + +Looking up entries is super-duper easy. Just run: + +``` +$ odict l mydictionary.odict "word" +``` + +and a full-bodied JSON object will print out if there is a match. + +### Fuzzy Search + +Fuzzy searching uses [Bleve](https://blevesearch.com/) under-the-hood, so an index of your dictionary is required before searching. Dictionary indexes are stored in a temporary directory that differs depending on your OS, so you'll need to re-index the library on any new device you decide to access the dictionary on. Indexing can take quite some time if you have a particularly large dictionary. + +There are two ways to index a dictionary. You can index it ahead of time by running: + +``` +$ odict i mydictionary.odict +``` + +or you can index it before you run your first search query by passing an `-i` flag to the search command: + +``` +$ odict s -i mydictionary.odict "my query" +``` + +If you omit this flag, ODict will automatically use the correct index for the provided file if one already exists. Each .odict file has a unique identifier baked into it, so once you index a dictionary once, ODict will always know where to find that index in the future. + +## Dumping Dictionaries + +Often times while developing an ODict application, it may be helpful to understand the underlying structure of the dictionary at hand without picking through code. As a result, the ODict CLI has a `dump` command which can be used to convert a compiled binary back into a rough estimation of its [original XML](ODXML.md). I say rough estimation because the library might add back certain XML attributes or ID fields that were not present in the original document used to create the file. + +Using `dump` is easy: + +``` +$ odict d mydictionary.xml outputfile.xml +``` + +## Merging Dictionaries + +The ODict CLI also has the ability to merge two compiled dictionaries and blend their definitions together. This feature uses the [mergo](https://github.com/imdario/mergo) to merge the underlying dictionary structs and is currently not as customizable as it should be. Right now ODict performs a full merge, so you may wind up with an entry with duplicate definitions if your two dictionaries contain similar definitions for the same word. + +However, you can always `dump` the merged file, edit it, then re-compile it. + +To merge two dictionaries, run: + +``` +$ odict m mydictionary1.odict mydictionary2.odict output.odict +``` diff --git a/docs/odxml.md b/docs/odxml.md new file mode 100644 index 00000000..60d4ab5e --- /dev/null +++ b/docs/odxml.md @@ -0,0 +1,62 @@ +# ODXML + +ODict XML is the XML variant used to define and structure ODict dictionaries. All compiled ODict dictionaries originate from ODXML files, and can easily be reverted back to XML via the CLI [`dump` command](cli.md#dumping-dictionaries). + +An example dictionary might look like this: + +```xml + + + + + + + + + + A cute pupper + + + Common way of saying a dog is a cutie + + + + + +``` + +Pretty easy to read, right? + +Now let's break this down. + +--- + +## `` + +Dictionary nodes occur at the base of all source files and will not compile without one. ODict looks for these nodes by default when compiling. + +### Attributes + +| Name | Description | Required? | +| ---- | ------------------------------------- | --------- | +| name | A descriptive name for the dictionary | :x: | + +### Children + +- [``](#entry) + +--- + +## `` + +Entries are the primary entry point to the dictionary and represent **unique terms**. They are used as lookup keys internally by ODict, so it is important there are no duplicate entries. + +### Attributes + +| Name | Description | Required? | +| ---- | ----------------------------- | ------------------ | +| term | The word the entry represents | :white_check_mark: | + +### Children + +- [``](#entry) diff --git a/go/write.go b/go/write.go index 6a9cd9b3..e3d6eb21 100644 --- a/go/write.go +++ b/go/write.go @@ -270,7 +270,9 @@ func dictionaryToBytes(dictionary Dictionary) []byte { return builder.FinishedBytes() } -func createODictFile(outputPath string, dictionary Dictionary) { +// CreateODictFile writes a new .odict binary from a +// Dictionary struct +func CreateODictFile(outputPath string, dictionary Dictionary) { dictionaryBytes := dictionaryToBytes(dictionary) compressed := snappy.Encode(nil, dictionaryBytes) file, err := os.Create(outputPath) @@ -310,7 +312,7 @@ func createODictFile(outputPath string, dictionary Dictionary) { // WriteDictionary generates an ODict binary file given // a ODXML input file path func WriteDictionary(xmlStr, outputPath string) { - createODictFile(outputPath, xmlToDictionary(xmlStr)) + CreateODictFile(outputPath, xmlToDictionary(xmlStr)) } // CompileDictionary compiles an XML file into an ODict binary @@ -327,5 +329,6 @@ func CompileDictionary(xmlPath string) { xmlStr, err := ioutil.ReadAll(xmlFile) Check(err) + WriteDictionary(string(xmlStr), outputPath) }