Feature: Export as JSON #1031

vtempest · 2024-07-30T19:24:42Z

25MB JSON: https://raw.githubusercontent.com/vtempest/wiki-phrase-tokenizer/master/data/dictionary-152k.json

Example


 "ability": {
    "cat": 7,
    "defs": [
      "the quality of being able to perform; a quality that permits or facilitates achievement or accomplishment",
      "possession of the qualities (especially mental qualities) required to do something or get something done Example: danger heightened his powers of discrimination"
    ],
    "pos": "n",
    "syns": "power"
  },

Script to download, decompress, parse and process into JSON (300 lines)
https://github.com/vtempest/wiki-phrase-tokenizer/blob/master/src/dataset-import/dictionary-import.js

Original PR: #1029

The exact format is up to you. The importer is highly customizable and we can do all synonyms and etc that is needed. JSON is the best as it is needed for web apps / Javascript which is the majority.
In other words, we can create a 120mb json that is lossless for a certain type of use case, and allow compression and specific attribute selection to create a JSON which is vital for web apps. There seems to be no reason not to support JSON given how that is what is needed to make it useful in AI search and apps.

Another advantage of the JSON Prefix Trie is O(1) lookups instead of having to loop through the index each time. There is "unanimous consensus" that this feature alone makes it better than any other data type for storing dict data. Source: https://johnresig.com/blog/javascript-trie-performance-analysis/

The text was updated successfully, but these errors were encountered:

jmccrae · 2024-08-20T06:20:35Z

We already use a YAML format internally that is convertable to JSON easily and the data is available as JSON on the web interface:

https://en-word.net/json/lemma/autodidact

Perhaps you could be more specific about what you want that is not already delivered?

vtempest · 2024-08-21T23:13:29Z

It is not that easy to convert, needs various xml parsers and deep knowledge of the schema like oewn @tags etc. My 300 line file is the needed importer to get a json file that is usable within JS web apps. The other formats are not directly importable into Javascript and python which are the dominant languages for all these NLP apps.

Example of schema:

 const processedLexicalEntry = LexicalEntry.map((lex) => ({
    writtenForm: lex.Lemma["@_writtenForm"],
    default_pos: lex.Lemma["@_partOfSpeech"],
    senses: Array.isArray(lex.Sense)
      ? lex.Sense.map((lex_s) =>
          parseInt(lex_s["@_synset"].replace("oewn-", ""))
        )
      : [parseInt(lex.Sense["@_synset"].replace("oewn-", ""))],
  }));

  processedLexicalEntry.forEach((lex) => {
    dictionaryObj[lex.writtenForm.toLowerCase()] = {
      defs: lex.senses,
      pos: lex.default_pos,
    };
  });
  
  const processedSynset = Synset.map((s) => ({
    id: parseInt(s["@_id"]?.replace(/oewn-/g, "")),
    def: s.Definition?.Definition,
    example: s.Example?.Example,
    synonyms: s["@_members"]
      .replace(/oewn-/g, "")
      ?.split(" ")
      .map((syn) => syn.replace(/-.$/g, "").replace(/_/g, " "))
      .join(", "),
    pos: s["@_partOfSpeech"],
    cat: categories.indexOf(s["@_lexfile"]),
  }));

Does that look simple and intuitive? No. It took days of work to grok the schema and make it simple for json and work without errors. Everyone has to replicate these steps.

jmccrae · 2024-08-23T01:35:42Z

I think this is the reason that we provide a Javascript interface at https://en-word.net/ is precisely to allow these kinds of use cases.

For example, this JS fiddle is a simple (if not great) app that looks up the definition of a word using our API

https://jsfiddle.net/vkjd0x9L/

I can assure you that the YAML version of the source is very easy to work with in Python and there are libraries such a @goodmami's wn for working with the XML releases.

vtempest · 2024-08-23T20:33:45Z

Right but some people need a JSON single data file they can modify and reuse in an app. Cannot call the remote API not efficient - other than to scrape everything which is just a waste of bandwidth when the json can be provided.

There is no documentation as to the schema of yaml oewn file so it is not "very easy to work with" and I spent days wadding thru it to understand the synset @id's etc. before I could get a workable data type to be used in a JS web app. We should provide this since otherwise everyone has to do these conversions.

And another reason yaml is not good is it is split up, should be all in one. neither yaml or xml is directly usable, all have to be converted. At least provide datatypes ie in typescript or jsdoc for each TermEntry

jmccrae · 2024-08-26T08:23:38Z

Okay, so I see two requests here:

Release JSON data as a complete database (alongside existing XML, RDF and WNDB)
Document the data structure used in our YAML (possibly with TypeScript)

vtempest · 2024-08-29T04:13:28Z

https://airesearch.wiki/functions/src_dataset_import_dictionary_import.importDictionary.html
It is done in typedoc. Better typedef

Please help if you can.... my code shows 151k but it's supposed to be 160k terms. Where am I in error in the schema reading.

jmccrae · 2024-08-29T08:41:13Z

Your link does not seem to work.

I am not sure why there would be a discrepancy in the number of terms.

vtempest added the enhancement New feature or request label Jul 30, 2024

jmccrae added this to the 2025 Release milestone Nov 29, 2024

jmccrae mentioned this issue Nov 29, 2024

[FR] JSON API #1110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Export as JSON #1031

Feature: Export as JSON #1031

vtempest commented Jul 30, 2024 •

edited

Loading

jmccrae commented Aug 20, 2024

vtempest commented Aug 21, 2024 •

edited

Loading

jmccrae commented Aug 23, 2024

vtempest commented Aug 23, 2024 •

edited

Loading

jmccrae commented Aug 26, 2024

vtempest commented Aug 29, 2024 •

edited

Loading

jmccrae commented Aug 29, 2024

Feature: Export as JSON #1031

Feature: Export as JSON #1031

Comments

vtempest commented Jul 30, 2024 • edited Loading

jmccrae commented Aug 20, 2024

vtempest commented Aug 21, 2024 • edited Loading

jmccrae commented Aug 23, 2024

vtempest commented Aug 23, 2024 • edited Loading

jmccrae commented Aug 26, 2024

vtempest commented Aug 29, 2024 • edited Loading

jmccrae commented Aug 29, 2024

vtempest commented Jul 30, 2024 •

edited

Loading

vtempest commented Aug 21, 2024 •

edited

Loading

vtempest commented Aug 23, 2024 •

edited

Loading

vtempest commented Aug 29, 2024 •

edited

Loading