A Grammar engine for Nguni natural language interfaces (GeNi)


Introduction and objectives - participants - outputs

Project funded by the National Research Foundation of South Africal under the Competitive Programme for Rated Researchers (CPRR) -- Y-rated development grant, 2015-2017 (3 years)

Overview

Introduction and background

The use of natural languages in applications is ubiquitous. Canned, unchangeable, textcan be used for some scenarios, but not when the information to be communicateddepends on the context and large amounts of text. This is addressed by controlled naturallanguages and natural language generation (NLG) systems, which take structured data orknowledge as domain input, and are matched at runtime with templates or a grammarengine to generate the text. NLG systems mainly focus on generating English, however,and neither an NLG system nor sufficient theoretical foundations exist for the indigenousSouth African languages, despite the requirements for it. Preliminary results in isiZulu NLGhave shown that a template-based approach is unfeasible for Bantu languages, due to,mainly, their complex grammar rules, noun class system, and agglutination. Thus, extantNLG systems cannot be adopted for Bantu languages, and a grammar engine is requiredto obtain automatically generated understandable text.

Aims

The aims of this project are to define the formal and algorithmic foundations for anisiZulu/isiXhosa grammar engine and to implement it to realize a (controlled) NLG system.The project will uncover sentence and linguistic realization patterns, postulated to be verysimilar for isiZulu and isiXhosa, and it will ensure incorporation of multilingualism. Therules and modular, efficient, algorithms will make the grammar usable for computation.This will be optimized on linguistic annotations of the input and text generation at runtime.A proof-of-concept grammar engine for isiZulu/isiXhosa will be developed to validate thetheory. To ensure broad usability and interoperability with related theoretical andtechnological advances, such as linguistic linked data and ontology-driven informationsystems, it will use as input files domain knowledge that is represented in ontologiesserialized in the Semantic Web language OWL, which also facilitates incremental systemdevelopment.

Participants and collaborators


Outputs

A simplified view is as follows. One has the data, information, or knowledge represented in a structured way, e.g., in a Description Logic (DL; right-hand side of the figure below). They serve as input to certain algorithms (arrows pointing to their respective names). Each algorithm determines how it is verbalised (implemented as a set of functions written in Python in this case). Their respective automatically generated outputs are shown in the line below it, which are sentences in isiZulu. This involves a set of core functions for the axioms [RuleML14, CNL14, LRE16], how to pluralise isiZulu nouns [CICLing16], and how to handle part-whole relations [INLG16]. The above figure shows the various components being linked up 'conceptually', i.e, which axiom types are linked to whcih functions in Python (well, a subset of what is supported). This has been implemented in the meantime. That is: the "DL axiom" on the right-hand side of the figure is serialised in an OWL file so that a computer can process it, which is then linked to the implemented verbalisation algorithms using Owlready to process that OWL file, A graphical user interface is wrapped around it. This GUI is shown in the following screenshot, which also has some annotations added to it afterward so as to provide some explanation about what's going on. There is a bit of a disconnect between how the relations (verbs, object properties) are represented in that structured knowledge representation and what we need for isiZulu. For instance, in the figure above, on the right-hand side, it says "dla" (eat), but the output on the left-hand side shows it as, e.g. "zidla" and "azidli". The algorithm takes care of that through knowing the noun class of the noun and whether the verb is negated or not. There are more such issues, notably with prepositions, such as in 'part of' and 'contained in' [INLG16]. This is now dealt with using a new model for annotations and a separate data structure [EKAW16 and examples].
There are some indications as to how well these fundamentals will, or will not, work with languages related to isiZulu. A language spoken several thousand km up north in Uganda, Runyankore, was experimented with, and the bootstrapping approach from isiZulu was promosing [CNL16]. While this was initially surprising, an orthographic analysis showed it to be fairly similar to isiZulu regarding agglutination, as did several other languages not in the Nguni language clusters, such as chiShona, but not Kiswahili [arxiv16].