Skip to content

Commit

Permalink
feat(*): added in-repo docs (#22)
Browse files Browse the repository at this point in the history
  • Loading branch information
Nickersoft authored May 25, 2021
1 parent 4f52d28 commit b64550d
Show file tree
Hide file tree
Showing 6 changed files with 349 additions and 11 deletions.
96 changes: 90 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,96 @@

</div>

The Open Dictionary Project (ODict for short), is an open-source alternative to proprietary dictionary file formats such
as [Babylon](http://www.babylon-software.com/free-dictionaries/) and
[Apple Dictionaries](https://developer.apple.com/library/content/documentation/UserExperience/Conceptual/DictionaryServicesProgGuide/Introduction/Introduction.html).
The Open Dictionary Project (ODict for short), is an open-source alternative to proprietary dictionary file formats Babylon and Apple Dictionaries. Similar to other dictionaries, Open Dictionary files are converted from XML to compressed binary files that can be easily indexed, searched, and modified using the Open Dictionary [CLI tool](cli.md) or via its open-source Go bindings.

Similar to Apple dictionaries, Open Dictionary files are converted from XML (sometimes referred to as ODXML) to compressed, serialized, bite-sized files. Originally written in C++, ODict has since been ported to Go for portability and maintainability purposes. Each compiled dictionary consists some basic header information, as well as a [Snappy](https://github.com/google/snappy)-compressed [Flatbuffers](https://github.com/google/flatbuffers) that contains all of the dictionary's entries and definitions.
## Motivation

The ODict CLI uses [Bleve](https://github.com/blevesearch/bleve) to perform ad-hoc indexing on the local file system for rapid full-text searching of entries. ODict has a number of sister repos of varying completeness. As of this writing, there is a pretty comprehensive [Java port](https://github.com/TheOpenDictionary/odict-java) of the project as well as an example of how to use the ODict CGo extension [in Python](https://github.com/TheOpenDictionary/freedict/blob/master/odictlib.py).
So your first thought is most likely, why would I need this? Why would someone take the time to build this? Who gives a shit about dictionaries?

Full documentation available at https://odict.org.
Turns out, a lot of people. Amazing products like [freedict](https://freedict.org), [Wiktionary](https://wiktionary.org), and [Linguee](https://linguee.com) are aimed at providing people with completely free dictionaries to aid NLP research, language learning, and just share knowledge. The only issue is... this data is either only accessible online, not machine readable, not well structured, or some combination of the three.

So... how does ODict help?

ODict was developed out of the need for a fast, readily available, and open solution to cross-platform dictionary formats such as Apple Dictionaries, Babylon, Dictd, StarDict and others. Current formats are specifically designed to work with specific applications (usually developed by the same company that made the format), and as a result are somewhat uni-directional (there is official documentation on how to write a dictionary in their format, but not on how to read one). This forces users to write dictionaries that only work with one specific dictionary app.

Certains formats, like StarDict or Slab, can be read by multiple dictionary apps, but like most dictionary formats are HTML-based. The data inside is not structured, and the HTML markup might be inconsistent between dictionaries.

_Wouldn't it be nice if there was a completely open-source format, with documentation on both reading and writing the format, that anyone could use in **any** dictionary app?_

That's where ODict comes in.

## Key Advantages

1. **ODict is open-source.** Yep. 100% available for anyone to help build or extend, and not to mention 100% free. Most development happens in the repository you're looking at right now!

2. **ODict is layout agnostic.** Because ODict is structured lexical data as opposed to a bunch of indexed HTML, you can display ODict dictionary entries in any way you want, without any special CSS selectors. Both the CLI and language-specific ODict bindings return JSON for all fuzzy search and entry lookups.

3. **ODict is tiny (relatively speaking).** Seeing we're not storing any large HTML text bodies and instead just store compressed binary data, ODict files can be expected to be smaller than your standard dictionary files.

## ODXML

ODict XML (ODXML) is the XML used to define and structure ODict dictionaries. All compiled ODict dictionaries originate from ODXML files, and can easily be reverted back to XML via the CLI [`dump` command](cli.md#dumping-dictionaries).

An example dictionary might look like this:

```xml
<!-- Dictionary Root -->
<dictionary name="My Dictionary">
<!-- Entry -->
<entry term="Doggo">
<!-- Etymology -->
<ety>
<!-- Usage (typically determined by part-of-speech) -->
<usage pos="n">
<!-- Definition -->
<definition>A cute pupper</definition>
<!-- Definition Group -->
<group description="Slang for dog">
<definition>Common way of saying a dog is a cutie</definition>
</group>
</usage>
</ety>
</entry>
</dictionary>
```

For more info on writing ODXML, check out the [official specification](docs/odxml.md).

## Language Bindings

ODict can currently be used in [Go](docs/api.md#go), [Java](docs/api.md#java), and [Python](docs/api.md#python). If you're just interested in compiling or reading dictionaries without writing any code, you can just the official command-line tool below.

## CLI

The ODict command-line interface (CLI) is a Go program you can execute in any terminal to create, dump, merge, search, and index ODict dictionaries.

The CLI's primary distribution channel is currently through its [Homebrew](homebrew.sh) formula:

```
$ brew tap TheOpenDictionary/odict
$ brew install odict
```

While you most likely would interface with dictionaries via a language-specific library, the CLI exists as a convenience tool that can be used to help debug or rapidly produce new dictionaries. For full docs on the CLI, [see here](docs/cli.md).

## File Format (For Nerds)

> **NOTE:** You can probably skip this section unless you're trying to debug changes to this code-base, or are writing an ODict parser in a language not currently supported.
Compiled .odict files are relatively straightforward, as they utilize the [ODict Flatbuffer schema](../schema/schema.fbs).

The buffer generated by this schema take up over 90% of the compiled file, however, addition header information still exists. The table below illustrates the full breakdown of a compiled .odict file, in the order in which the values are written to the file.

All values written in Little Endian byte order.

| Name | Type | Bytes | Description |
| -------------- | ------- | -------- | ------------------------------------------------------------------------------------------------------- |
| Signature | char[6] | 6 | Signature for the ODict format. Assertions fail if this signature is missing. Should always be `ODICT`. |
| Version | ushort | 2 | Represents the major version of ODict with which the file was created. |
| Content Length | long | 8 | Size (in bytes) of the compressed content to read. Used in assertions to validate file length. |
| Content | []byte | Variable | Snappy-compressed FlatBuffer object. Must be decompressed by Snappy before it can be used. |

A design decision was made to keep the structural data of the ODict format as a cross-platform [Flatbuffers](https://google.github.io/flatbuffers/) schema as opposed to simply encoding a
Go struct so that the format could be used by anyone, even without necessarily using any of the core ODict libraries.

For an example of how the files are written, you can look at the official [Go code that does so](go/read.go).
7 changes: 4 additions & 3 deletions cli/merge.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@ import (
func merge(c *cli.Context) error {
inputFile1 := c.Args().Get(0)
inputFile2 := c.Args().Get(1)
outputFile := c.Args().Get(2)

if len(inputFile1) == 0 || len(inputFile2) == 0 {
return fmt.Errorf("Usage: odict merge [dictionary1] [dictionary2]")
if len(inputFile1) == 0 || len(inputFile2) == 0 || len(outputFile) == 0 {
return fmt.Errorf("Usage: odict merge [dictionary1] [dictionary2] [outputFile]")
}

t(func() {
Expand All @@ -21,7 +22,7 @@ func merge(c *cli.Context) error {

result := odict.MergeDictionaries(dict1, dict2)

fmt.Println(odict.DumpDictionary(result))
odict.CreateODictFile(outputFile, result)
})

return nil
Expand Down
121 changes: 121 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Using the API

## Installing

Currently, it is only possible to use language bindings from another Bazel project, as the ODict JAR is not yet on Maven Central and Python's dependency on the shared ODict library makes it difficult to distribute through `pip`. Fortunately, setting up ODict in another Bazel project is easy.

Just add the following to your `WORKSPACE` file:

```python
http_archive(
name = "odict",
sha256 = "b58fd3432a6f84865c67a16ef6718be12ecd6b9b32c12dfd917c0a899807062f",
strip_prefix = "odict-1.4.5",
url = "https://github.com/TheOpenDictionary/odict/archive/1.4.5.tar.gz",
)

load("@odict//bazel:odict_deps.bzl", "odict_deps")

odict_deps()

load("@odict//bazel:odict_extra_deps.bzl", "odict_extra_deps")

odict_extra_deps()
```

then require either `@odict//java` or `@odict//python` in your respective Bazel rules.

If any API usage is unclear, you may be able to get a better idea of how to use the APIs by looking at ODict's unit tests.

## Go

ODict is built in Go, so naturally it supports a public API out-of-the-box:

```go
import (
odict "github.com/TheOpenDictionary/odict/go"
)

func main() {
// Write a dictionary from a local ODXML file
odict.CompileDictionary("mydict.xml")

// Write an XML string to a binary
odict.WriteDictionary("<dictionary></dictionary>", "mydict.odict")

// Read a compiled dictionary into memory
dict := odict.ReadDictionary("mydict.odict")

odict.IndexDictionary(
"mydict.odict",
false // Set to "true" to force-index
)

// Search an indexed dictionary by ID
results := odict.SearchDictionary(
dict.id,
"dog",
false // Set to "true" if you need to match the given word exactly
)
}

```

## Java

While a [standalone Java client](https://github.com/TheOpenDictionary/odict-java) for ODict _used_ to exist, it has since been superseded by a Java binding that uses the ODict core's cgo dynamic library that lives in this repo.

Fortunately, the new ODict Java interface is extremely easy to use and will always stay up-to-date with the latest upstream changes to the ODict format.

```java
// Import statement
import org.odict.Dictionary;

void main() {
// Compile a dictionary
Dictionary.compile("path/to/file");

// Write a new dictionary
Dictionary.write("an XML string", "path/to/output.odict");

// Load a dictionary
Dictionary dict = new Dictionary("path/to/dictionary.odict");

// Lookup an entry by word
System.out.println(dict.lookup("giraffe"));

// Index the dictionary
dict.index();

// Perform a fuzzy-search
System.out.println(dict.search("full text"));
}
```

## Python

The Python interface for ODict is similar to that of Java and Go:

```python
# Import statement
from python.odict import Dictionary

def main():
# Compile a dictionary
Dictionary.compile("path/to/file")

# Write a new dictionary
Dictionary.write("an XML string", "path/to/output.odict")

# Load a dictionary
dict = Dictionary("path/to/dictionary.odict")

# Lookup an entry by word
print(dict.lookup("giraffe"))

# Index the dictionary
dict.index()

# Perform a fuzzy-search
print(dict.search("full text"))
```
67 changes: 67 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# CLI Reference

## Creating Dictionaries

To create a dictionary, you'll first need to write a dictionary using the ODict [markup language](odxml.md) and save it as an ".xml" file. Once you're confident your file is in the correct format, compiling it to an ".odict" dictionary is as simple as running:

```
$ odict c mydictionary.xml
```

The output file will always be a corresponding ".odict" dictionary that appears in the same directory as the source file.

## Searching Dictionaries

There are two ways to search a compiled .odict file: via a case-insensitive entry lookup, which outputs a single entry, or via a full text fuzzy search, which will output an array of matching entries.

Let's look at each respectively.

### Entry Lookup

Looking up entries is super-duper easy. Just run:

```
$ odict l mydictionary.odict "word"
```

and a full-bodied JSON object will print out if there is a match.

### Fuzzy Search

Fuzzy searching uses [Bleve](https://blevesearch.com/) under-the-hood, so an index of your dictionary is required before searching. Dictionary indexes are stored in a temporary directory that differs depending on your OS, so you'll need to re-index the library on any new device you decide to access the dictionary on. Indexing can take quite some time if you have a particularly large dictionary.

There are two ways to index a dictionary. You can index it ahead of time by running:

```
$ odict i mydictionary.odict
```

or you can index it before you run your first search query by passing an `-i` flag to the search command:

```
$ odict s -i mydictionary.odict "my query"
```

If you omit this flag, ODict will automatically use the correct index for the provided file if one already exists. Each .odict file has a unique identifier baked into it, so once you index a dictionary once, ODict will always know where to find that index in the future.

## Dumping Dictionaries

Often times while developing an ODict application, it may be helpful to understand the underlying structure of the dictionary at hand without picking through code. As a result, the ODict CLI has a `dump` command which can be used to convert a compiled binary back into a rough estimation of its [original XML](ODXML.md). I say rough estimation because the library might add back certain XML attributes or ID fields that were not present in the original document used to create the file.

Using `dump` is easy:

```
$ odict d mydictionary.xml outputfile.xml
```

## Merging Dictionaries

The ODict CLI also has the ability to merge two compiled dictionaries and blend their definitions together. This feature uses the [mergo](https://github.com/imdario/mergo) to merge the underlying dictionary structs and is currently not as customizable as it should be. Right now ODict performs a full merge, so you may wind up with an entry with duplicate definitions if your two dictionaries contain similar definitions for the same word.

However, you can always `dump` the merged file, edit it, then re-compile it.

To merge two dictionaries, run:

```
$ odict m mydictionary1.odict mydictionary2.odict output.odict
```
62 changes: 62 additions & 0 deletions docs/odxml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# ODXML

ODict XML is the XML variant used to define and structure ODict dictionaries. All compiled ODict dictionaries originate from ODXML files, and can easily be reverted back to XML via the CLI [`dump` command](cli.md#dumping-dictionaries).

An example dictionary might look like this:

```xml
<!-- Dictionary Root -->
<dictionary name="My Dictionary">
<!-- Entry -->
<entry term="Doggo">
<!-- Etymology -->
<ety>
<!-- Usage (typically determined by part-of-speech) -->
<usage pos="n">
<!-- Definition -->
<definition>A cute pupper</definition>
<!-- Definition Group -->
<group description="Slang for dog">
<definition>Common way of saying a dog is a cutie</definition>
</group>
</usage>
</ety>
</entry>
</dictionary>
```

Pretty easy to read, right?

Now let's break this down.

---

## `<definition>`

Dictionary nodes occur at the base of all source files and will not compile without one. ODict looks for these nodes by default when compiling.

### Attributes

| Name | Description | Required? |
| ---- | ------------------------------------- | --------- |
| name | A descriptive name for the dictionary | :x: |

### Children

- [`<entry>`](#entry)

---

## `<entry>`

Entries are the primary entry point to the dictionary and represent **unique terms**. They are used as lookup keys internally by ODict, so it is important there are no duplicate entries.

### Attributes

| Name | Description | Required? |
| ---- | ----------------------------- | ------------------ |
| term | The word the entry represents | :white_check_mark: |

### Children

- [`<entry>`](#entry)
Loading

0 comments on commit b64550d

Please sign in to comment.