Appointments Using Turkish Natural Language Sentences
Abstract
We have implemented an interface for saving and querying appointments via Turkish sentences using a utilized Infrastructure. The developed program TuSA (Turkish Speaking Assistant), as the name implies, gets Turkish sentences which has the appointment details and applies morphological, syntactic and semantic analysis to convert them into logical formulas. This program also provides a way to query the appointment data set by parsing these formulas. Although this program is based on a utilizing infrastructure it has its own additions that do not prevent the cooperated implementation.
1 Introduction
NLP (Natural Language Processing) is an effective communication interface between human-beings and computer systems that has a very large scope based on different application areas and languages. NLP converts natural language sentences into a form that computers can handle by using morphological, syntactic and semantic analyses, which are usually performed separately but operate in coordination. We have implemented an interface for saving and querying appointments via Turkish sentences using a utilized infrastructure.
The developed program TuSA (Turkish Speaking Assistant) as the name implies fetches Turkish sentences which has the appointment details and applies morphological, syntactic and semantic analysis to convert them into logical formulas. This program provides a way to query the desired appointment data set by parsing these formulas. Although this program based on a utilizing infrastructure it has its own additions that do not prevent the cooperated implementation.
2 Knowledge Representation
For a NLP application, rather than trying to analyze all possible Turkish taking the whole Turkish sentences, which would be a very ambition, perhaps impossible task, we have limited ourselves to Turkish sentences that deal with appointments. Since we want to implement a NLP application for a sub-set of Turkish sentence structures, problems related to character sets, lexicons and morphology are inherently problems for TuSA. Most of the work in development phase of TuSA is done in syntax and semantic level, and a few modifications are applied to the morphology used in TOY (Çetinoğlu (2001)).
|
Phrase Name |
English |
Explanation |
|
Fiil |
Verb |
The command to TuSA |
|
Konu |
Subject |
The subject of appointment. |
|
Person |
Person |
The person in the appointment. |
|
Yer |
Location |
The location of appointment. |
|
Zamanustu |
Date |
Time of appointment in the names of days, months or years |
|
Zamanalti |
Time |
Time of appointment in the names of hour or minute |
|
Uzunluk |
Duration |
How long is this appointment? |
|
Tekrar |
Recurring |
Recurring information about appointment. |
|
TS |
Sentence |
Biggest phrase that can include all of the above. |
Table 1. Syntactic Functions in TuSA
Syntax in TuSA
In syntax level of TuSA the most important thing is the phrase concept. Phrase is a group of words. In the most general form a sentence is a phrase. But a sentence can be built of many phrases. And each phrase can be built of again many phrases or words.
Sentence → Phrase
Phrase → Phrase / Word
Phrases in TuSA are possible data fields in any calendar program. Since each calendar program keeps subject of meeting, person in meeting, location of meeting, date and time of meeting, duration of meeting and recurring event information, we have built our phrase list as shown in Table 1.
Syntactic approach starts with the phrase Turkish Statement (TS) which is the root for all decision rules. Starting from this phrase, many decision rules are applied in order to understand the possible syntax of any given input.
For instance, "Haftaya ona iki kala Aliyle Cananda yedi saatlik toplanti var" (Next week at two to ten we have a seven-hour meeting with Ali at Canan) is given to TuSA as an input.
It is caught by TS, and the syntax analyzer tries to find possible phrases for each word and word groups.
Each word or word group is assigned to a phrase at the end if the syntax of the sentence is correct as shown below:
Search all phrases for word "Haftaya"(Next week) and match it to Zamanustu phrase
Continue with the word "ona" (to ten), this yields no matches to a phrase, so continue search with "ona iki kala" (ten to two) again with no match, continue with "ona iki kala" (two to ten), and a match to Zamanalti phrase.
Continue with all words until the end of the sentence with that logical combination process described above.
At the end TS yields up with a list of phrases as below.
S->Zu Za P Y U K F
In the long form: sentence -> Zamanustu Zamanalti Person Yer Uzunluk Konu Fiil
Zu = Haftaya (Next week)
Za = Ona iki kala (2 to 10)
P = Aliyle (Proper noun Ali)
Y = Cananda (at Canan)
U = Yedi saatlik (for 7 hours)
K = Toplanti (meeting)
F = Var (There is)
Since Turkish is an agglutinative language and each phrase can go anywhere in the sentence, the phrases can be permuted. All possible phrase combinations are listed in Table 2.
|
|
Turkish |
English |
|
SKF |
Toplanti var. |
A meeting. |
|
SZKF |
Onda toplanti var. |
A meeting at 10 o’clock. |
|
SYKF |
Okulda gorusme var. |
A meeting in the school. |
|
SPKF |
Aliyle toplanti var. |
There is a meeting with Ali. |
|
SYZKF |
Okulda onda toplanti var. |
A meeting at school at 10 o’clock. |
|
SZYKF |
Onda okulda toplanti var. |
At 10 o’clock a meeting in the school. |
|
SZPKF |
Onda Ali ile toplanti var. |
At 10 o’clock with Ali there is a meeting. |
|
SPZKF |
Ali ile onda toplanti var. |
With Ali at 10 o’clock there is a meeting. |
|
SZPYKF |
Onda Ali ile okulda toplanti var. |
At 10 o’clock with Ali in the school there is a meeting. |
|
SPZYKF |
Ali ile onda okulda toplanti var. |
With Ali at 10 o’clock in the school there is a meeting. |
|
SYZPKF |
Okulda onda Ali ile toplanti var. |
In the school at 10 o’clock with Ali there is a meeting. |
|
SZYPKF |
Onda okulda Ali ile toplanti var. |
At 10 o’clock in the school with Ali there is a meeting. |
|
SPYZKF |
Ali ile okulda saat onda toplanti var. |
With Ali in the school at 10 o’clock there is a meeting. |
Table 2. Phrases in TuSA
In the program some assumptions are made to make life easy. First we have assumed that each command (Sentence) in TuSA should contain a verb and this verb should be at the end of sentence as is common in Turkish. Second we have noticed that most of the commands contains a subject field (any meeting without a subject is meaningless), so we have assumed that all commands should keep a subject phrase just before verb phrase (there are few exceptions without subject below, which we have already covered).
Under these assumptions, the sentence with the smallest number of phrases, contains only a verb at the end and a subject implied by the verb as in the sentence “Toplanti var” (There exists a meeting)
In Turkish, combination of words and phrases do not change their phrase type if that combination obeys the grammar rules.
Word “gel” (come) is a verb phrase, “Hemen gel” (come immediately) is a combination of verb phrase “gel” (come) and an adverb “hemen” (immediately). The possible combinations like this are taken into account in TuSA as shown in Table 3. Highest possibility of such combinations is in (Zaman) Time phrases.
During the development phase it is seen that there is no difference between the person phrase of “Ali ile toplanti var” (There is a meeting with Ali) and “Dün görüştüğüm Ali ile toplantı var” (There is a meeting with Ali whom I met with yesterday). In the second sentence, the word group “Dün görüştüğüm” is used as the adjective of Ali. So it is concluded that any word group that comes before a proper noun is added to the proper noun as its describing facility. This is implemented in TuSA as explained above. The main idea behind this is a simple recursion as below.
Person → Anything / Proper-noun
So analysis of proper-noun is passed to morphological level and if the input word fulfils the proper-noun condition, all the words before this word is considered as the determinative of this proper-noun.
|
Verb Phrases : |
|
|
FV |
Goster (Show) |
|
FNV |
Hemen goster. (Show immediately) |
|
Subject Phrases: |
|
|
KN |
toplanti (the appointment). |
|
KSK |
kalabalik toplanti (the crowded appointment) |
|
KKconjK toplanti ve yemek ( the appointment and the dinner) |
|
|
Proper Noun Phrases: |
|
|
PI |
Ali (proper noun) |
|
PSI |
butun Aliler (all Ali’s) |
|
PPconjP |
Ali ve Ahmet (Ali and Ahmet) |
|
Location Phrases: |
|
|
YL |
Ankarada (at Ankara) |
|
YSY |
kucuk salonda(in the little saloon) |
|
YYconjY |
evde ve ofiste (at home and in the office) |
|
Time Phrase: |
|
|
ZZa |
|
|
ZZu |
|
|
ZD |
|
|
ZZuZaD |
onbir aralik, on ikide, iki saatlik (on 11 December at 12.00 for two hours) |
|
ZaMH |
on ellide (at fifty past ten) |
|
ZaRelativeH |
iki saat sonra( two hours later) |
|
ZuGAY |
on ocak ikibinde(at 10 January 2002) |
|
ZuRelativeG |
iki gun sonra (two days later) |
Table 3 (Shortened version of Verb, Subject, Proper-noun, Location and Time Phrases)
But unfortunately this approach has some disadvantages. For all the basic phrases (Person, Location and Subject) this mechanism should be implemented, because all can take determinatives as the person phrase shown above. So synchronizing all these recursive functions is very struggling.
2.2 Morphology Level
In Turkish, each word is constructed by appending any inflectional and derivational suffixes to the roots in any order by taking vowel harmony into account. Because of this nature, Turkish morphology is really complicated for generating applications based on it.
In TuSA finite state machines (one for nouns and one for verbs) are used for parsing lexicons where the initial nodes are possible roots and the final nodes are the ones reached going through the arcs each of which represents a suffix addition. This FSM is based on Oflazer’s work on Turkish morphology. For instance, the verb
geliyor musun? (Are you coming?)
given to the FSM starting from possible verb root, reaches one of the final nodes by producing the following deconstruction:
gel + Hyor +mH + sun
(verb-root)(time)(question)(person2singular)
During this parsing operation, vowel harmony restriction is the most struggling part of the work. For applying this restriction, some letters are used to represent vowels that have to be determined according to the previous vowels like H in the example above. H represents “i ” or “ı ”and is checked during the parsing operation by taking the previous vowels into account.
During the construction of a word, each suffix can change the type or the role of the word in the sentence like:
Çocuk (child) gel (Come: Verb root)
gel (Coming)+en(suffix)çocuk(Child:Noun )
In the first sentence “gel” (come) is used as a verb but after adding -en suffix, this word becomes an adjective affecting the child. This is a derivational effect of suffixes to a word in Turkish sentences. These additions become more important in the syntactical phase where the words are matched to phrases.
In the morphology level, TuSA searches the most suitable words for the relevant phrases. Each phrase has its own morphological specifications. A matching to a phrase can be done just by looking to the root of the word or the final status reached by adding suffixes to the root ,so search for roots and search of suffixes are both implemented added.
For instance, if the matching phrase is Person , the program first searches the words with proper-noun roots. But this search is not enough to find correct matching, another search is done on the suffixes of these words, and then the correct words are found for the relevant phrases. For example an ambiguous example can be given as below:
“Aliyle Cananda toplanti var.”
(There is a meeting with Ali at Canan)
In this example both location and person phrases has proper-noun roots. So correct phrase matching can’t be done only by looking to the roots. As explained above TuSA, makes the correct matching by looking at the suffixes of these words.
In Turkish, any word can be matched to both person and location phrases. The selection between these possibilities is handled by the execution structure of Prolog. Because in Prolog each predicate execution is done according the order they have programmed so if it satisfies the first predicate than it does not continue to look for any other predicates. So, the first possibility in the code is chosen and matching is done.
Another problem is also for looking these suffixes. As we have discussed in syntax level before, there are functions working recursive and in any wrong ordered predicates case these recursive functions causes left recursion (which is a catastrophic end).
Since most of the words are constructed by adding suffixes to the right of the root, for some special purposes in TuSA the parsing operation on FSM are reversed and the words are parsed starting from each finals node sin order to find the root. (Only in some exceptional cases words are given without any suffixes)
2.3 Lexicon
Turkish lexicons mostly stands on root and suffix(es) combination. Most of the words are constructed by adding suffixes to the root, where vowel harmony is the most important necessity. In TuSA, each lexicon entry consists of the root, type of the root
(used for selection of the initial node in FSM), type of the word, its semantic representation and last vowel of the word. In order to keep lexicon entries in TuSA, following structures are used.
Proper noun representation:
tr_morph_entry('AdKök',[['K',e,m,a,l],
[type(propernoun),sem(kemal)]],_,_,_,e,l,ok),
where ,
first entry (‘AdKök’ –Noun root) represents the type of the root second entry ('K',e,m,a,l) represents the word
third entry ( type(propernoun),sem(kemal))
represents the semantic representation of the word and its type.
Location representation :
tr_morph_entry('AdKök',[[o,k,u,l],
[type(yer),sem(school)]],_,_,_,u,l,ok).
Time representation:
tr_morph_entry('ZamanKök',[[b,u,g,ü,n],
[type(zamanu),sem(today)]],_,_,_,ü,n,ok).
Number representation:
tr_morph_entry('AdKök',[[ç,e,y,r,e,k],
[type(number),sem(number([1,5]))]],_,_,e,k,ok)
.
Semantic representation of each lexicons are implemented by using the descriptions of Covington(1994)
2.4 Execution Steps of TuSA
2.4.1 Data insertion to TuSA
User inserts the sentence “Haftaya ona iki kala arkadasim Aliyle komsumuz Cananda yedi saatlik yemekli bir toplanti var” (Next week, at two to ten, we have a diner with my friend Ali at our neighbor Canan for seven hours) using the XPCE input screen offered by TuSA
The sentence is sent to Syntax analyzer (starting from TS predicate explained in Section 2) for matching each word or word group to phrases. During this matching process TS tries lots of predicates in the program and except one of them ends with failures like a matching try of “Haftaya” (next week) to person, location,…All these tries ends with a matching like the Date phrase in the example. In most of the cases, more than one predicate (Zu)has to be checked in order to match or mismatch the word to that phrase. In this example, although “Haftaya” is a word that can be accepted by one of the Zu predicates, it doesn’t match the first of them Zu GAY. In order to come over these drawbacks TuSA checks the same word in all 8 different Zu predicates. After these checks, it concludes whether it matches the date phrase or not.
Syntax analyzer stops execution when all words are matched to phrases.
After syntax analyzer returns the results in the form shown in Table 4, semantic representation of the sentence is constructed. The below example shows how a word group is translated into semantic representation.
“komsumuz Cananda” (at our neighbor Canan) is the input for TS in this case.
TS searches until reaching person phrase without any match and tries to satisfy person phrase. In this case Canan is a naturally a person phrase because it is a proper-noun. TS search for three different sub-functions and find the Person Anything Propernoun at the end. TS ignore Anything part (in this case “komsumuz” (our neighbor) is ignored). But there is one more check to satisfy this condition. TS checks for its suffixes. (In this case Canan (proper noun root) + -da (Locative suffix)). It does not accept this word group as a person phrase because there is no suitable person phrase definition. Only definitions on propernoun in person phrase are
Person Propernoun(without any suffixes) Person Propernoun(with relative)
So TS continues to search for other phrases and finds the location phrase. In this case TS satisfies the location phrase because it accepts the Location -> Anything Propernoun (locative)
So the semantic of input word “komsumuz Cananda” (our neighbor Canan) is returned to TS. In this case semantic of this word is [komsumuz ,Canan]
and TS yields out the following semantic representation:
takvim(11,9,58,22,11,2002,[komsumuz,canan],[arkadasim,ali],[yemekli,bir,gorusme],7,'saat').
where,
takvim(id of this record, minute, hour, day , month , year , location, person, subject, duration, duration unit).
Some comments on semantic:
Id is an automatically incremented value to keep each meeting unique. All minute, hour, day, month and year is calculated by simple arithmetic operations. For instance, if the input is relative time (e.g. “ikibin saat sonra” (after two thousand hours)), TuSA gets current time from operating system, converts 2000 hours to year, month, day and adds this value to current time, and makes arrangements for increment operation (some times day value is bigger than 30 for example so it get mod. 30 for day and adds division to month etc.)).
|
Input in Turkish |
English |
Phrase |
Phrase name |
|
Haftaya |
Next week |
Zu |
Date |
|
ona iki kala |
at two to ten |
Za |
Time |
|
arkadasim Aliyle |
with my friend Ali |
P |
Person |
|
komsumuz Cananda |
at our neighbor Canan |
Y |
Location |
|
yedi saatlik |
for seven hours |
D |
Duration |
|
yemekli bir toplanti |
a diner |
K |
Subject |
|
var |
there is |
F |
Verb |
Table 4 (Parsed output of input sentence)
Location, person and subject fields are lists. As discussed before, it is possible to enter several words to these areas. So we keep them as list to hold this kind of input. Duration is kept as an integer value and duration unit keeps the unit of this duration (minute, hour, day, month or year) If it is a very high number (greater than 100), it is automatically converted to a higher unit.
This semantic output is saved on a file. This file can be used as a relational database itself, because of the nature of Prolog.
2.4.2 Querying in TuSA
TuSA can be used not only for inserting the appointments but also querying. Again the same graphical user interface is used for getting the user’s query sentences. For instance, if the following sentence is given,
“Haftaya Aliyle olan yemekli toplantilari goster”
(Show the diners with Ali)
The program first converts the sentence to semantic representation explained in section 2.4.1 and yields out the following semantic
takvim(_,_,_,22,11,2002,_,[ali],[yemekli,toplanti],_,_).
This semantic representation is used as a template and the program tries to find word or word groups for each empty phrase location over the database file. In this process the elimination of the records is done by using the phrases in the semantic representation.
By the help of some intelligent search mechanisms, if any result is found, they are shown in the following format to the user:
---------------------------------------------------------
ID:11
Tarih:22/Kasim/2002|09:58|Cuma (Date:22/Novenber/2002|09:58|Friday)
Kisi: arkadasim ali (Person: my friend ali)
Yer:komsumuz Canan
(Location: Neighbour Canan)
Konu:yemekli bir toplanti (Subject: Dinner)
Suresi:7 saat (Duration: 7 hours)
Semantic:
takvim(11,9,58,22,11,2002,[komsumuz,canan], [arkadasim,ali],[yemekli,bir,gorusme],7,'saat').
---------------------------------------------------------
2 Environment Specifications
Programming Language: Prolog is used as the programming language, which consist of first order predicate calculus and it can be used as a relational database because of its nature.
Developing Environment: SWI-Prolog which is developed at University of Amsterdam is used as the programming environment. XPCE is used as the graphical user interface module.
Operating System: Swi-Prolog is an open source programming environment so it can run almost every operating system, but TuSA is tested under only Windows 98/2000, Unix and Linux.
Conclusion
TuSA, Turkish Speaking Assistant is an initiative in its area which is implemented for storing appointment information using Turkish natural sentences and can cooperate with the applications based on the same infrastructure developed by Çetinoğlu (2001). TuSA is an ongoing application and new additions will be done until a web interface, MS Outlook synchronization and phonetic usage are added to it. The chosen language, environment and the extendibility of the code will ease the desired additions.
The general information and usage examples and milestones can be accessed via TuSA homepage on the internet.
Acknowledgments
References
Covington M. A. (1994) Natural Language Processing for Prolog Programmers. Prentice Hall, Englewood Cliffs, NJ. 348 p.
Çetinoğlu Ö. (2001) A Prolog Based Natural Language Processing Infrastructure for Turkish, M.S. Thesis, Boğaziçi University.
Oflazer K. (1993) Two-Level Description of Turkish Morphology, Proc. Second Turkish Symposium on Artificial Intelligence and Neural Networks, BU Press, pp. 86-93.