Skip to content
r0ller edited this page Aug 30, 2021 · 3 revisions

14.02.2020:

During machine learning integration I managed to introduce a bug in the interpreter due to which it took only one morphological analysis of a constant into account. But in the commit yesterday it got fixed. Besides that I changed the db schema again a bit which should not interfere with any existing db as I only replaced the partial db indices used to enforce integrity via foreign keys which affects only newly created dbs. I decided to do so as I realized that partial indices do not really fulfill my expectations so I added some key only tables filled by triggers so that they don't need to be filled manually.

Concerning the android app, I recently discovered that google offline speech recognition does not work any more (at least on certain phones, see: https://support.google.com/assistant/thread/2438314?hl=en ) and simply returns network error even if there's network access and for offline speech recognition it would not even need it. So I had to turn it off for the time being. That version is not yet out in play store as there are some minor changes I'd like to carry out (e.g. turning on C++14 build) before pushing it out there. An important thing to mention is that I'll drop 32 bit support on android with that change. It'll still be possible to build a 32 bit arm library if one wants but I won't bundle it myself in the apk.

01.08.2019:

I spent the last months with improving the performance of prep_abl which is used to preprocess the text corpus for the machine learning tool (ABL). As preprocessing a big text corpus required too much memory I rewrote prep_abl to write files instead. The calculation of token paths for all possible combination of word analyses returned by the morphological analyser (foma) was originally designed for analysis (interpretation) only and not for generation. So in case of calculating every possible token path for a big text corpus in the preprocessing phase for ml was also slow and used too much memory. Therefore, I started to redesign it so that only valid token paths get generated thus not spending time on invalid ones and not requiring memory for those. Thanks to NG's idea about calculating a path for a unique path number in an array representation of a cartasian product (see the int2indices() function in NG's solution or its reimplementation in the source file of tokenpaths.cpp in this project as tokenpaths::path_nr_to_indices), the performance has improved dramatically. In addition to that, prep_abl got multithreading support as well to make it even more faster. I haven't had time to deal with anything else (like updating the Android or js clients) and testing is still in progress as till now mainly functional/performance tests have been carried out to make sure that both the machine learning scenario and the interpreter work as well as they did earlier. I'll now prepare some bigger text corpus to see how the ml processes cope with that.

15.03.2019:

With the last few commits an initial machine learning integration has landed as part of the project. The machine learning framework and the ways you can use it are described under the Unsupervised machine learning aided language modelling section and to test the generated grammar the test tools have been extended as well which is also described in the corresponding Tests (NLTK based) section. As mentioned, it is for generating the grammar for a given corpus but having a foma fst for the language of the corpus is a prerequisite. It also won't give you the semantics so there are still some modelling tasks that you need to do yourself. Theoretically, using a machine learning tool for morphology like Morfessor would be an option but you'd still need to find a way to create a foma fst out of that (or extend hilib to handle other morphological analyzers beside foma). Concerning semantics, the dependency graphs of certain ml/ai platforms may prove useful at least for setting up the DEPOLEX content (DEPendencies-Of-LEXemes, exactly as its name suggests) but the rule-to-rule-map is still there to be maintained to connect the grammar to the semantics as I doubt that there's any way to generate it. Anyway, till the project gets that far to generate everything out of the box, you may use now the ml_tools for make as a new target to build the tools you need for machine learning. Of course, you'll need to build first the Alignment Based Learning framework itself.

19.12.2018:

Managed to get rid of numbering the tokens manually in the GCAT db table which has its advantages and disadvantages as well. Fortunately, the disadvantages are only short term as there some adjustments one needs to make to already existing language models in db. First, if you have NULLs anywhere in the gcat db table in any of the key fields, you must put a value there. It was anyway not a good idea to allow NULL in a key but sqlite allows it and I did so too but that's now hardened. If you had NULL in the feature field, you need to put the value 'Stem' in there as NULL was anyway interpreted as a lazy fallback for 'Stem'. If your model handles constants or unknown words (or using hilib terms call it concealed words) you'll need to make sure that you have an entry in gcat with the reserved 'CON' gcat and 'Stem' feature with a token number greater than zero for the modelled language to get a token generated for it in bison. Till now, you could have only one entry for constants/concealed words for all the languages in the db but now it became language specific. So instead of the language independent t_Con token in bison, you'll have language specific ones like t_ENG_CON_Stem for English. This is another thing you have to adjust in your grammar i.e. to replace t_Con with a language specific symbol. The good news is that you don't need to number your terminal tokens manually any more. You don't even need to change anything in the token column of the gcat db table as the parser generator (gensrc) program will interpret that field in a way that it generates a token in the bison source for values greater than zero which was the case for the token numbers anyway except for the entry having the reserved gcat 'CON' as that had to be zero. (That's why I mentioned if you have any such entry then you need to set a token number greater than zero).

13.12.2018:

Finally added a makefile to the project which aims to be cross platform. I have not yet tested it on Linux, only on my own NetBSD but both with (bsd)make and gmake (i.e. the gnu make). First I tried to get the job done by writing a cmake script which worked but I just didn't like it so threw it away and wrote a posix makefile. As indicated in the 'How to build' section as well, for the time being I keep the build scripts I used till now just to give the makefile some time to prove that it works fine. I built a help target for the makefile so until I write the documentation for it here, you can only rely on that which you can simply invoke by typing 'make help'. Another small step forward is switching over the c bison parser to c++ which turned out to be simpler than I thought. As a kind of heads up: I'll now try to get rid of numbering the tokens which means that in case of success, the hi_db.sql db schema will need to be changed as the corresponding column will be removed from the GCAT db table.

30.10.2018:

Although I didn't want to add more features till the first release is out but I could not avoid doing so. Mainly as I wanted to make it sure that the logical operators (not, and, or) can be modeled not only in English but in Hungarian as well which turned out to be a bit more difficult than in English:) So I had to add support to the bison operator precedence and context dependent precedence. This means that the GCAT table got extended with two fields (precedence and precedence_level) due to the operator precedence support which makes previous model db-s incompatible so you'll have to adjust your content sql files and rebuild your db. The technical documentation does not yet reflect these changes so please, refer to the example contents and the db schema file (hi_db.sql). The GRAMMAR table had to be extended as well with a new field (called precedence) to support context dependent precedence. Last but not least, the RULE_TO_RULE_MAP table got also extended with two fields (main_set_op and dependent_set_op) which make it possible to carry out set operations on the set of symbols collected for the main or the dependent nodes. Apparently, set operations are applied on two set of symbols collected for the same node so don't expect that e.g. you can merge (make union of) the symbols collected for the main node and that of the dependent node. This leads to simpler syntactic models as in many cases a syntax rule had to be introduced just to declare a new symbol which you could use as a restriction during the semantic validation.

A small improvement worth mentioning is that I introduced the END symbol for $end (end of file) so that you can use it in your grammar. If you want to use it, just add an entry to the SYMBOL table with the value "END" as symbol and the necessary language id (lid).

I also introduced some tests (based on NLTK) to be able to check the language models, of which you can read more at the Tests section.

Another new functionality that makes functor development easier is that the gensrc tool can now copy the functor definitions from files in the db. You may have already noticed the functors subdirectories in the platform specific directories where the shell script and javascript files reside. The content of a functor implementation file gets copied in the FUNCTOR_DEFS definition field if a file can be found with the name you put in the definition field (without quotes) within the directory you specify as the fourth parameter when invoking gensrc. Using quotes, means the definition field must be left untouched. As a corollary, you'll need to invoke gensrc each time you rebuild your db.

06.07.2018:

What got pushed yesterday is the result of squeezing out of the framework what I wanted to achieve in the last few years: introducing logical operators (not, and, or). It took a long time to fix all the bugs that got in the way of achieving this and I must admit that it's still not perfect but at least it's possible now. This time only the desktop client got updated to make use of this feature but as the library supports it now you can build any other clients to make use of it as well. The tricky thing was to preserve the logical order of the dependencies involved in a logical expression. In a sentence like:

list symlinked or executable directories in directory abc that are not empty

this is not at all trivial. Don't ponder much about the adjectives used here, I just wanted something to play with and the file flags came handy. The functor implementation is done in posix shell script as usual but as I'm not a shell script guru, I'm pretty sure that there are many bugs in the functors besides the ones I know for which I'll create some issues. The list of words that can be used in the desktop client got finally updated below. The biggest challenge currently when building a model that accepts logical operators preserving the logical order of dependencies is that the framework can only bind a dependency to its direct ancestor e.g.:

list directories that are not empty and symlinked

had to be modelled in depolex in a way that the verb "are" (be) has "not" as dependency and a logical group of adjectives describing directories (like "empty" and "symlinked") which I called "dirbeprop". At the same time, "not" also has "dirbeprop" as its dependency. This is necessary because once the interpreter finds "are" and looks for its dependencies, it'll find "not" and bind it to "are". Then it finds "empty" which gets bound to "not". (I'm ignoring the "and" operator now in order not to complicate matters.) So when the interpreter finds "symlinked" it can only bind it to "are" if "are" is a direct ancestor of it in the dependency hierarchy (depolex). If I modelled it in a way that "dirbeprop" was not added as a dependency to "are" only to "not" but marking "not" as optional then once "not" was found for "empty", "symlinked" could not be bound to "are" across "not" as if "not" was present only for "empty" but not for "symlinked" because "are" was 2 levels higher in the hierarchy and as "symlinked" wasn't negated it did not get bound of course to "not". So the interpreter currently cannot bind dependencies across levels but I treat this for the time being a feature not as a bug as there's a workaround and the solution to this is not trivial. This means that I had to apply the same strategy wherever negation popped up so I had to create negative and positive paths for the dependencies of each logical operator. It does work but I don't know yet how it scales. Nevertheless, I'll make use of the logical operators as soon as possible in the android client and the js client as well.

The android client got also an update to support 64 bit libraries as it was anyway a must due to google pushing this. So I'll need to bring this change into the playstore as well otherwise they'll kick the app out soon. Let's see if I manage to add some new features at the same time like searching in the contact list with conditions to make use of the logical operators.

04.01.2018:

Nothing peculiar, just creating an entry to mark the commit of today as RC1 alpha:) At least, I consider it feature complete so no new features are in the pipeline for the first release unless it turns out to be inevitable. The features added now are: support for more than one target language, functor tags, syntactic analysis and analysis switches.

By supporting more than one target language one can implement a functor in different languages. The language chosen as target must be passed over when calling the interpreter.

Functor tags are useful when parsing the semantic analysis you get back from the interpreter. E.g. you may want to tag some functors of verbs like 'go', 'drive', etc. representing a certain type of activity as 'navigate'. To achieve this, you'll need to add entries in the FUNCTOR_TAGS db table with your choice of tag-value pairs. You can add any number of tag-value pairs to a functor and in case of providing a trigger tag, they'll only be added if the trigger tag is present in the feature set of the morpheme belonging to the functor at the time of creating the analysis.

Supporting syntactic analysis got added finally so now you can get that as well from the interpeter along with the morphological and semantic analysis.

Adding switches for different types of analyses makes it possible to get only the type of analysis you want. It's not about simply creating the requested type of analysis in the end. The implementation is done in a way to only execute the codeline that's necessary for the analysis whenever possible. E.g. in case of requesting only morphological analysis, no syntactic or semantic analyses will be carried out. Similarly, when requesting morphological and syntactic analysis, no semantic analysis is carried out at all. Of course, requesting semantic analysis implies carrying out morphological and syntactic analyses as well.

No client code has been updated but I'll focus more now on preparing the project for a release instead of adding new features:)

12.12.2017:

Today's commit is a huge one -not only considering its size. Unfortunately though, I didn't have time to fix all issues for all clients as I was mainly focusing on the library itself and the Android client of which the Hungarian version can now be accessed in Play Store to check out the details or for alpha testing. Concerning the library, I made the design decision to take a step back in order to be able to take two forward in a different direction: I abandoned supporting different transcriptors inside the library and left only one for JSON. This does not mean that the functors cannot be implemented in any language, it only means that no executable code will be assembled by the library. Instead of that, an analysis will be returned in a JSON structure. The analysis needs to be parsed by the platform specific clients and create the executable code out of that. As mentioned, I was focusing this time on Android, so you can find an example implementation of such a client in the hi_android folder.

Another important improvement is the database design hardening but it's rather a hardening of the implementation as many foreign key constraints were not turned on till now, only because the SQLite version (3.8) supporting partial indexes e.g. came with the Android API level 21.

There's also a significant performance improvement due to adding a cache to the lexer and restructuring the process of analysis by carrying out tokenization completely before starting the syntactic analysis. This makes it possible to build in some switches influencing the type of analysis to be carried out be it either morphologic, syntactic or semantic. However, this is not yet complete. Another thing for which this change paves the way, is the parallel processing of different token paths. As one word may have more than one morphological analyses, it can easily happen that if A and B are words both having two morphological analyses, then A1-B1, A1-B2, A2-B1 and A2-B2 need to be analysed as well. Now, that the morphological analyses are done before starting the syntactic analysis, multiple threads can be started for each and every different token path being built from the different morphological analyses.

From now on, the interpreter will traverse all possible token paths that can be constructed from the morphological analyses reporting all successful interpretations and all emerging errors. This means, that in case of getting more than one successful interpretations, the caller has to decide which of them is more relevant.

Partial analysis has been enabled as well, so you can still get a full morphological and semantic analysis even if the sentence is ungrammatical. This makes it possible for the client to figure out what could have that sentence meant. E.g. 'list contacts with Peter' is a grammatical sentence compared to 'list contact with Peter' but as you get back the functors along with its dependencies in the analysis, the client can try to solve the puzzle. Check out the Android client for details.

Another new feature is introduced by which one can bind lexemes not existing in the lexicon to a functor by their grammatical categories. This solves e.g. the problem of interpreting numbers as they don't need to be enumerated in the lexicon assigning them to a/one functor. Furthermore, if neither a lexeme is registered for a stem in the lexicon, nor a dependency entry in the depolex table, it poses no problem any more unless the node of the stem is combined. Along with that I made the usage of foma files easier by getting rid of the [lfea] and [gcat] tags. So if you want to use an existing foma fst with this interpreter, you only need to change the rules that access the stems to add the [stem] tag.

Last but not least, a few words about the clients. The desktop client currently only prints the JSON analysis so that you can verify it. The morphological model has been updated to get rid of the [lfea] and [gcat] tags but neither the syntactic nor the semantics part has been touched or tested thoroughfully. Roughly the same can be said about JS client as well, however, the github pages now runs the code of this commit along with all the bugs that may occur. I'll fix/improve these clients as soon as possible. The Android client got a major update, making the project now gradle aware to be able to develop it in Android Studio 3.0. It can be considered as a demo of all these changes. The English part is not that useful IRL though, as you can only ask your phone to list your contacts with regards to a certain name using the words: list, contact(s), with, name. E.g. 'list contacts with (name) peter' or just 'list contacts'. Though, you can now use it offline as well since the EXTRA_PREFER_OFFLINE option got introduced for the speech recognizer intent in API level 23 which is therefore now the minimum API level (i.e. Android version 6.0) required by the app. The Hungarian part is a bit more useful as you can make real phone calls by using the words: hív(d), fel, a, az. E.g. 'hívd fel Pétert', 'hívd Pétert', 'hívd az orvost', 'hívd a 112-t', 'hívd fel a 00 36 1 234 56 78-at', etc. The client will give you a list of numbers if more than one are found with the contact name specified and you can just choose which of them to call simply by saying the sequence number assigned to the phone number in the list like: 'hívd az elsőt', 'hívd az utolsót', 'hívd a másodikat' or just 'a harmadikat'. This also works with names, if google speech recognition delivers a wrong result. If you already had a successful interpretation but the contact name did not match any of your contacts, you just have to repeat the name, part of the name or a spelled variant and the client will try to figure out a match in your contacts.

16.05.2017:

Most of the build scripts have been reworked so that some arguments can be passed on to them. Another major change is that finally the design efforts now bring their fruits as I could write a simple program called gensrc that can generate the bison source from the grammar db table where you only need to provide the syntactic rules as if they were usual A->B C or A->B bison rules but without any coding. So now those who don't like coding can model a language easier but in order to add linguistic features at runtime (like the obligatory main_verb symbol), you'll still need code snippets that take care of it. I also adjusted the documentation slightly on the main wiki page to reflect these changes, see the "Modelling a language" and "How to build" sections. Another important thing got improved in the meantime, namely the runtime error reporting which provides you better error messages about missing/inconsitent model configuration in the db file replacing the old slothful behavior of quitting with exit_failure simply.

29.07.2016:

Changed the transcriptor for js so that instead of real js now only a JSON structure is generated. I'd not drop the other possibility of generating js code directly either but for online usecases the JSON structure seems to be more valid just like api.ai does it. So now you can check out on the github pages of the project what you get back if you submit any of these examples (where 'abc', 'def', 'erding', 'thai' are not part of the dictionary, just constants so you can use whatever you like there):

  • show abc
  • show restaurant abc
  • show thai restaurants
  • show abc in def
  • show location of thai restaurants
  • show location of thai restaurants in erding
  • show location of restaurant abc
  • show location of abc
  • show location of abc in erding
  • show location of restaurant abc in erding
  • show thai restaurants in erding
  • show restaurant abc in erding
  • show restaurants in erding
  • show location of restaurants
  • show location of restaurants in erding
  • show restaurants

If you copy the JSON result from the browser's debugger console, you can simply validate it at e.g. http://jsonlint.com

12.04.2016:

Adding support for javascript thanks to emscripten. You can even try it on the github pages of the project. The words you can use are: show, restaurant, location, of, in. Even though it's a handful of words you can already create such sentences like 'show locations of thai restaurants in madrid' or just 'show restaurants in paris' or even just 'show mcdonalds'. It spits out an alert with a generated javascript translation of the command but without any functors being implemented so currently you won't get anything useful except the process of analysis in the browser's debugger console:)

Multiple language support has been validated on Android by adding a small Hungarian lexicon and morphological analyzer derived from the comprehensive work of Eleonora whose project Hunmoprh-foma is hosted by me among my projects. Due to that, one can make now phone calls by saying 'hívd fel ...t' where ... is the name of the person:) There are of course many things to improve but the multiple language support seems to work.

Improved error handling: until now the interpreter gave back either a string with an executable script or nothing. For the time being, if it cannot interpret the input, it gives back a string containing something like 'interpreted phrase/stuck at word'. Besides that, you get a feedback about the error from bison as well on the standard output.

And finally: hopefully managed to commit more bugfixes than new bugs:)

24.07.2015:

Adding support for Android. Check out the Android crosscompile steps file for details in the corresponding folder. Only a smoke test has been done on my own phone with Android 4.2.2. It accepts only the following voice commands in English: "list contacts", "list contacts with <name>" and "list contacts with name <name>". Instead of executing shell scripts and native executables on Android, which is actually possible but not the most convenient, javascript and java shall be used to implement the functors. Technical details about developing on Android is described here -especially in section 'Binding JavaScript code to Android code': https://developer.android.com/guide/webapps/webview.html#BindingJavaScript

Have fun:)

24.06.2015:

Translation capability is back in the framework and added support for defining relative clauses with the restriction that they can only have an auxiliary but no main verb so the following sentence can now be interpreted: "list files that are in directory abc".

02.10.2014:

Managed to screw up the whole repository and had to restore it from the available versions so the changes between the initial commits are not that gradual any more and the commit dates of course don't reflect the past dates when the changes were originally committed.

22.07.2013:

Since the first commit the framework has changed a lot and the current version is not capable of translating the commands into shell scripts but simply validates their feasibility according to the model set up in customizing. However, the feature will be back as soon as possible.

Support for morphosyntactic rules has been added using the foma library. In addition to it, a second goal has been set up and already partially achieved (while breaking the translation capability for the time being): providing a reusable framework either for development or educational purposes in natural language processing.

29.01.2012:

Moving to C++.

14.12.2010:

Uploading the initial version of the interpreter which was the thesis for my degree done in programming.