Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Export as JSON #1031

Open
vtempest opened this issue Jul 30, 2024 · 7 comments
Open

Feature: Export as JSON #1031

vtempest opened this issue Jul 30, 2024 · 7 comments
Labels
enhancement New feature or request
Milestone

Comments

@vtempest
Copy link

vtempest commented Jul 30, 2024

25MB JSON: https://raw.githubusercontent.com/vtempest/wiki-phrase-tokenizer/master/data/dictionary-152k.json

Example


 "ability": {
    "cat": 7,
    "defs": [
      "the quality of being able to perform; a quality that permits or facilitates achievement or accomplishment",
      "possession of the qualities (especially mental qualities) required to do something or get something done Example: danger heightened his powers of discrimination"
    ],
    "pos": "n",
    "syns": "power"
  },

Script to download, decompress, parse and process into JSON (300 lines)
https://github.com/vtempest/wiki-phrase-tokenizer/blob/master/src/dataset-import/dictionary-import.js

Original PR: #1029

The exact format is up to you. The importer is highly customizable and we can do all synonyms and etc that is needed. JSON is the best as it is needed for web apps / Javascript which is the majority.
In other words, we can create a 120mb json that is lossless for a certain type of use case, and allow compression and specific attribute selection to create a JSON which is vital for web apps. There seems to be no reason not to support JSON given how that is what is needed to make it useful in AI search and apps.

Another advantage of the JSON Prefix Trie is O(1) lookups instead of having to loop through the index each time. There is "unanimous consensus" that this feature alone makes it better than any other data type for storing dict data. Source: https://johnresig.com/blog/javascript-trie-performance-analysis/

@vtempest vtempest added the enhancement New feature or request label Jul 30, 2024
@jmccrae
Copy link
Member

jmccrae commented Aug 20, 2024

We already use a YAML format internally that is convertable to JSON easily and the data is available as JSON on the web interface:

https://en-word.net/json/lemma/autodidact

Perhaps you could be more specific about what you want that is not already delivered?

@vtempest
Copy link
Author

vtempest commented Aug 21, 2024

It is not that easy to convert, needs various xml parsers and deep knowledge of the schema like oewn @tags etc. My 300 line file is the needed importer to get a json file that is usable within JS web apps. The other formats are not directly importable into Javascript and python which are the dominant languages for all these NLP apps.

Example of schema:

 const processedLexicalEntry = LexicalEntry.map((lex) => ({
    writtenForm: lex.Lemma["@_writtenForm"],
    default_pos: lex.Lemma["@_partOfSpeech"],
    senses: Array.isArray(lex.Sense)
      ? lex.Sense.map((lex_s) =>
          parseInt(lex_s["@_synset"].replace("oewn-", ""))
        )
      : [parseInt(lex.Sense["@_synset"].replace("oewn-", ""))],
  }));

  processedLexicalEntry.forEach((lex) => {
    dictionaryObj[lex.writtenForm.toLowerCase()] = {
      defs: lex.senses,
      pos: lex.default_pos,
    };
  });
  
  const processedSynset = Synset.map((s) => ({
    id: parseInt(s["@_id"]?.replace(/oewn-/g, "")),
    def: s.Definition?.Definition,
    example: s.Example?.Example,
    synonyms: s["@_members"]
      .replace(/oewn-/g, "")
      ?.split(" ")
      .map((syn) => syn.replace(/-.$/g, "").replace(/_/g, " "))
      .join(", "),
    pos: s["@_partOfSpeech"],
    cat: categories.indexOf(s["@_lexfile"]),
  }));

Does that look simple and intuitive? No. It took days of work to grok the schema and make it simple for json and work without errors. Everyone has to replicate these steps.

@jmccrae
Copy link
Member

jmccrae commented Aug 23, 2024

I think this is the reason that we provide a Javascript interface at https://en-word.net/ is precisely to allow these kinds of use cases.

For example, this JS fiddle is a simple (if not great) app that looks up the definition of a word using our API

https://jsfiddle.net/vkjd0x9L/

I can assure you that the YAML version of the source is very easy to work with in Python and there are libraries such a @goodmami's wn for working with the XML releases.

@vtempest
Copy link
Author

vtempest commented Aug 23, 2024

Right but some people need a JSON single data file they can modify and reuse in an app. Cannot call the remote API not efficient - other than to scrape everything which is just a waste of bandwidth when the json can be provided.

There is no documentation as to the schema of yaml oewn file so it is not "very easy to work with" and I spent days wadding thru it to understand the synset @id's etc. before I could get a workable data type to be used in a JS web app. We should provide this since otherwise everyone has to do these conversions.

And another reason yaml is not good is it is split up, should be all in one. neither yaml or xml is directly usable, all have to be converted. At least provide datatypes ie in typescript or jsdoc for each TermEntry

@jmccrae
Copy link
Member

jmccrae commented Aug 26, 2024

Okay, so I see two requests here:

  • Release JSON data as a complete database (alongside existing XML, RDF and WNDB)
  • Document the data structure used in our YAML (possibly with TypeScript)

@vtempest
Copy link
Author

vtempest commented Aug 29, 2024

https://airesearch.wiki/functions/src_dataset_import_dictionary_import.importDictionary.html
It is done in typedoc. Better typedef

Please help if you can.... my code shows 151k but it's supposed to be 160k terms. Where am I in error in the schema reading.

@jmccrae
Copy link
Member

jmccrae commented Aug 29, 2024

Your link does not seem to work.

I am not sure why there would be a discrepancy in the number of terms.

@jmccrae jmccrae added this to the 2025 Release milestone Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants