T.C.

YEDİTEPE UNIVERSITY

FACULTY OF ENGINEERING AND ARCHITECTURE

DEPARTMENT OF COMPUTER ENGINEEERING

 

TEMPLATE GENERATOR USING NATURAL LANGUAGE

(TEG-NALAN)

 

by

Z.İlknur Karadeniz

ENGINEERING PROJECT REPORT

Approved by:

            Asst. Prof. Dr. Ender Özcan              

            (Supervisor)

            Prof. Dr. Şebnem Baydere                 

               Dr.Birol Aygün                                  

Date of Approval: 30 / 12  /  2003

TABLE of CONTENTS

LIST OF FIGURES  1

LIST OF TABLES  2

LIST OF ABBREVIATIONS  2

ACKNOWLEDGEMENTS  2

ABSTRACT  2

ÖZET  3

1.     INTRODUCTION  3

2.     PREVIOUS STUDIES  4

2. 1. Natural Language Processing and Machine Learning  4

2.1.1. TOY   4

2.1.2. TuSA   5

2.2. Template and Code Generation  6

2.2.1. Rational Rose/C++   6

2.2.2. A Natural Language Interface for Programming in Java (NaturalJava)  7

3.     NATURAL LANGUAGE PROCESSING(NLP) 8

3.1.What is NLP?  8

3.2.Turkish  10

4.     THE COMPONENTS OF TEG-NALAN  13

4.1.Augmented Transition Network (ATN)  13

4.2.Concept Hierarchies  13

4.2.1.Inclusion and ISA   13

4.2.2.Membership and HASA   14

4.3.Schemata  14

5.     REQUIREMENTS and PROGRAMMING ENVIRONMENT  15

5.1.SWI-Prolog  15



 

LIST OF FIGURES

 

 

Figure 2. 1Target Scenario of TOY                                                                                    5

Figure 2. 2 Results of a Query Given to TuSA                                                                   5

Figure 2. 3 Rational Rose Interface                                                                                    7

Figure 2. 4 Architecture of NaturalJava                                                                             8

Figure 3. 1 Turkish Letters                                                                                                10

Figure 3. 2 Turkish Morphology Example                                                                        11

Figure 3. 3 Suffix Examples                                                                                              11

Figure 3. 4 Example Sentences (Ambiguity)                                                                     11

Figure 3. 5 Turkish Syntactic Categories                                                                           12

Figure 3. 6 Word Order                                                                                                     12

Figure 3. 7 Inverted Sentences                                                                                           12

Figure 4. 1 ISA Hierarchy                                                                                                 14

Figure 4. 2  HASA Hierarchy                                                                                            14

Figure 6. 1 Library Hierarchies in TEG-NALAN                                                             17

Figure 6. 2 Data Flow Diagram                                                                                         18

Figure 6. 3 Architecture of TEG-NALAN                                                                         19

Figure 7. 1 Word Entries                                                                                                    20

Figure 7. 2 Suffix Changes                                                                                                20

Figure 7. 3 Example of a Class Definition Sentence                                                         21

Figure 7. 4 Example of an Interface Definition Sentence                                                  21

Figure 7. 5 ATN for Class Declaration                                                                              21

Figure 7. 6 Example of an Undetailed Attribute Declaration Sentence                             22

Figure 7. 7 Example of a Detailed Attribute Declaration Sentence                                   22

Figure 7. 8 ATN for Attribute Declaration                                                                        23

Figure 7. 9 ATN for Detailed Method Declaration                                                            24

Figure 7. 10 Method Declaration with Parameters                                                            24

Figure 7. 11Example of Detailed Method Declaration Sentence                                      25

Figure 7. 12 ATN for Detailed Method Declaration                                                         25

Figure 7. 13 Example of  Hierarchy Declaration Sentence                                               26

Figure 7. 14 ATN for ISA Relationship Declaration                                                         27

Figure 8. 1 Data Structure for Attributes                                                                           27

Figure 8. 2 Data Structure for Methods                                                                             28

Figure 8. 3 Data Structure for Classes                                                                               30

 

 

 

 

LIST OF TABLES

 

 

 

 

LIST OF ABBREVIATIONS

 

AI

Artificial Intelligence

ATN

Augmented Transition Network

NLP

Natural Language Processing

TEG-NALAN

Template Generator Using Natural language

TuSA

Turkish Speaking Assistant

 

ACKNOWLEDGEMENTS

 

 At first, I would like to thank my advisor Ender Özcan for his guidance and encouragement from the very beginning until the end. I am truly grateful for Şadi Evren Şeker’s valuable suggestions, comments and assistance at every step of my study. Special thanks are due to Şeniz Demir, for her guidance in Prolog and natural language concepts. Finally, I want to thank my family for their love and support during my entire education and life.

 

ABSTRACT

TEG-NALAN: Template Generator Using Natural Language

 

Natural Language Processing (NLP) is a subfield of Artificial Intelligence whose aim is the use of computers to understand natural languages such as Turkish, English, or Italian. In this report, an intelligent natural language interface based on Turkish Language is designed for creating Java class skeleton, listing the class and its members. This interface which is identified as TEG-NALAN (Template Generator Using Natural Language) is developed as a part of a project named as TUJA (Java Code Generator Using Turkish), a tool for producing Java programs using Turkish sentences. TEG-NALAN uses mainly three components to achieve its goal; augmented transition network (ATN), knowledge database and java code generator. Turkish sentences are converted into instances of schemata (attribute, method or class) and inserted into knowledge database by ATN, which utilizes concept hierarchies. Then, java code generator produces the output, Java class skeleton, retracting the required knowledge from that database.

 

ÖZET

TEG-NALAN: Doğal Dil Kullanan Java İskelet Kod Üreticisi

 

Doğal dil işleme yapay zekanın bir alt dalıdır. Amaç bilgisayarların ana dili anlamasını, insanların bilgisayarla olan ilişkilerinde ana dillerini kullanabilmesini sağlamaktır. Bu raporda, Java sınıf iskeleti yaratmak, ilgili her sınıfa ait üyeleri listelemek amacıyla geliştirilen Türkçe’ye dayalı akıllı bir doğal dil uygulaması  anlatılmaktadır. TEG-NALAN (Doğal Dil Kullanan Java İskelet Kod Üreticisi ) olarak adlandırılan bu uygulama, Türkçe cümlelerden Java programı üreten TUJA (Türkçe Kullanan Java Kod Üreticisi) yazılım aracı projesinin bir parçasıdır. TEG-NALAN verilen Türkçe cümleyi öncelikle parçalara ayırır; sonra da bu parçalardan cümlenin anlamına uygun bir yapı oluşturarak gerekli bilgiyi veri ambarında tutar. Sonuçta, kullanıcının isteğine bağlı olarak veri ambarından bilgiler çekilip, Java sınıf iskelet üretimi tamamlanır. Daha sonra kullanıcının isteğine göre Java kaynak dosyası oluşturulabilir veya sınıf  şemaları sorgulanabilir.

 

1.              INTRODUCTION

 

Programming languages are precise and mostly unambiguous with predefined syntax and semantics. Still, a programmer spends a lot of effort in learning syntactic rules and at the same time developing general programming skills. Even an experienced programmer may have the same problems, if the programming language is a new one. On the other hand, natural languages are more declarative, flexible, powerful and richer, being useful even for occasional users. Also, the programmer may not know the language used in the resources, such as books, to learn a new programming language.

 

There are visual tools for creating object oriented designs, furthermore, generating Java/C++ skeletal programs, such as Rational Rose (an IBM product) [10]. TEG-NALAN is a natural language processing (NLP) application, which involves in building an interface for creating a skeletal Java program, including all classes, their attributes (data) and prototypes of member methods of each class. TEG-NALAN accepts Turkish sentences, describing a class, a member method or a member attribute of a class, using a conversational front end. Then the input is fed into an augmented transition network (ATN) for parsing and semantic analysis. At the end of this process knowledge database is updated using the current command. Knowledge is represented using schemata. At any instant, the user can ask TEG-NALAN to produce the Java skeletal code, saving it into a file.

 

The next chapter is a brief summary of previous work on the topics examined in this report. In Chapter 3, natural language processing is explained in two levels NLP definition and NLP in Turkish. Required software packages and programming language Prolog, which our software is written by, are presented in the next chapter. Then, Some AI concepts used in TEG-NALAN are explained. Design used in TEG-NALAN is in Chapter6. The syntax, semantics and morphology of TEG-NALAN is explained in Chapter 7. In the following chapter, knowledge database is explained. Test concepts used in TEG-NALAN are examined in Chapter 9. Finally the results obtained are discussed and what can be done as future work is stated.

 

2.              PREVIOUS STUDIES

 

In this chapter, some concepts and previous work related to this report are discussed. Previous related work can be categorized into two groups:

Ø                                        Natural Language Programming and Machine Learning

Ø                                        Template and Code Generation

 

          2. 1. Natural Language Processing and Machine Learning

 

          2.1.1. TOY

           

 

Çetinoğlu [4] developed the TOY (Türkçe Okur Yazar” – “Literate (literally, “reads and writes”) in Turkish”) conversation application as her M.S. thesis. The lexicon and morphological, syntactical and semantic layers of TOY form the basis of this report, although some modifications and developments on syntactical and semantic layers have been done.

 

 Canan uyudu mu (Did Canan fall asleep?)

Bilmiyorum (I don’t know)

Kemal uyudu mu? (Did Kemal fall asleep?)

Evet, Kemal uyudu (Yes, Kemal fell asleep)

Canan küçük bir çocuktur. Kemal küçük bir çocuktur. Bütün küçük çocuklar 10 saat uyurlar.

(Canan is a little child. Kemal is a little child. All little children sleep for 10 hours.)

Kemal ne zaman uyudu? (When did Kemal fall asleep?)

Kemal yirmiüçte uyudu (Kemal fell asleep at 23 o’clock.)

Kemal ne zaman uyandı (When did Kemal wake up?)

Kemal saat 9’da uyandı (Kemal woke up at 9 o’clock.)

 

Canan kahvaltıda ne yiyecek? (What will Canan eat in the breakfast?)

Bilmiyorum (I don’t know.)

Canan kahvaltıda peynir,ekmek,zeytin yiyecek (Canan will eat some cheese, bread, olive in the breakfast.)

Teşekkürler öğrendim (Thanks, I learned.)

Küçük çocuklar hariç herkes kahvaltıda çay içer (Everybody except little children drink tea in the breakfast.)

Canan kahvaltıda ne içecek? (What will Canan drink in the breakfast? )

Bilmiyorum (I don’t know.)

Canan kahvaltıda çay içecek mi? (Will Canan drink tea in the breakfast? )

Bilmiyorum (I don’t know.)

 

Kaç kişi kahvaltıda çay içmeyecek? (How many people will not drink tea in the breakfast?)

Bilmiyorum (I don’t know.)

Kim kahvaltıda çay içmeyecek? (Who will not drink tea in the breakfast? )

Bilmiyorum (I don’t know.)

Kaç kişi kahvaltıda peynir, ekmek, zeytin yiyecek?

(How many people will eat some cheese, bread, olive in the breakfast?)

Bir kişi kahvaltıda peynir, zeytin, ekmek yiyecek

(One person will eat some cheese, bread, olive in the breakfast.)

Kim kahvaltıda peynir, ekmek, zeytin yiyecek?

(Who will eat some cheese, bread, olive in the breakfast?)

Canan kahvaltıda peynir, ekmek, zeytin yiyecek.

(Canan will eat some cheese, bread, olive in the breakfast)

 

Figure 2. 1Target Scenario of TOY

 

          2.1.2. TuSA

 

Şeker [9] developed a Turkish speaking assistant to hold and retrive appoinment details. In TuSA,  the appointments (whose details are entered by the user using Turkish sentences) are stored in takvim schemata and written into a file. Takvim schemata has the following structure:

 

takvim(ID,Minute,Hour,Day,Month,Year,Person,Location,Subject,Duration,Recurring)

where

            ID: the unique number given to the appoinment.

            Minute and Hour: the representatives of the time phrases

            Day,Month and Year: the date of the appoinment

            Person,Location and Subject: with whom,where and about what the appoinment will be.

            Duration: either minute or hour

            Recurring: recurring event information like “aliyle iki günde bir toplantı var”(there is ameeting with ali every two days).

 

            If a sentence like “Haftaya Aliyle olan toplantıları göster” is given as an input, TuSA converts the query to an internal formula like takvim(_,_,_,15,9,2003,_,[ali],[yemekli,toplanti],_,_) and runs this query on the database. The results of the query, if found, are shown to the user at the end in the format shown in Figure 2. 2.

 

ID:11

Tarih:15/Eylul/2003|10:05|Pazartesi (Date: 15/September/2003|10:05|Monday)

Kisi: arkadasim ali (Person: my friend ali)

Yer:komsumuz Canan (Location: neighbour Canan)

Konu: yemekli bir toplanti (Subject: Dinner)

Suresi: 7 saat (Duration: 7 hours)

 

Semantics:

 

takvim(11,10,5,15,9,2003,[komsumuz,canan],[arkadasim,ali],[yemekli,bir,toplanti],7,’saat’).

 

Figure 2. 2 Results of a Query Given to TuSA

 

 

          2.2. Template and Code Generation

 

There are visual tools for creating object oriented designs, furthermore, generating Java/C++ skeletal programs. In this section, they will be examined.

 

          2.2.1. Rational Rose/C++

Rational Rose/C++ [10] is application software that generates C++ code from Unified Modeling Language (UML) diagrams. Rational Rose has the same notation and syntax as UML. Its notation comprises a set of specialized shapes for constructing different kinds of software diagrams such as class diagram, state diagram, activity diagram etc. TEG-NALAN resembles the generated class skeleton with Rational Rose/C++. The difference is that Rational Rose generates C++ code from UML diagrams drawn whereas TEG-NALAN generates Java class skeleton from Turkish sentences given by the user. Rational Rose interface can be seen below in Figure 2. 3.

 

Figure 2. 3 Rational Rose Interface

 

          2.2.2. A Natural Language Interface for Programming in Java (NaturalJava)

 

NaturalJava [11] is a prototype for an intelligent natural-language-based user interface for creating, modifying, and examining Java programs.  The interface exploits three subsystems:

 

Ø                              The Sundance natural processing system accepts English sentences as input and uses information extraction techniques to generate case frames representing program construction and editing directives.

 

Ø                              A knowledge- based case frame interpreter, PRISM, uses a decision tree to infer program modification operations from the case frames.

Ø                              A Java abstract syntax tree manager, TreeFace, provides the interface that PRISM uses to build and nevigate the tree representation of an evolving Java program.

 

 

Figure 2. 4 Architecture of NaturalJava

 

3.              NATURAL LANGUAGE PROCESSING(NLP)

 

In this section we will mention about definition and levels of NLP, then we will talk about Turkish and some difficulties faced with when developing NLP applications in Turkish.

 

          3.1.What is NLP?

 

Natural Language Processing (NLP) is a subfield of artificial intelligence which ultimate aim is to enable computers to use natural languages with performance levels comparable to humans. Natural language communication with computers has long been a major research area of artificial intelligence (AI), both for the information it can give about intelligence in general and for its practical utility.

 

There are some researches for Turkish as a natural language phenomenon, like creating morphological structure or building a statement structure or some chat robots and algorithm analyzers but each research is built on exclusive areas and it is almost impossible to combine them.

 

NLP depends on putting limits on the need for outside knowledge, human experience, cheap computer power and exact knowledge of how human languages work.

 

NLP has built into 5 levels [1] that are;

o                                     Phonology (sounds of words)

o                                     Morphology (structure of words)

o                                     Syntax (order of words)

o                                     Semantics (meaning of words)

o                                     Pragmatics (use of language)

o                                      

Phonology

Phonology is the study of how sounds are used in language. Every language has an alphabet of sounds that it distinguishes: These are called its Phonemes and each Phoneme has one or more physical realizations called Allophones. As an example consider the “t” sounds in the words “top” and “stop”. They are physically different. However in English these two sounds are allophones of the same phoneme, because the language does not distinguish them.

 

Morphology

Morphology is the word formation. Every language has two kinds of word formation processes: Inflection, which provides the various forms of any single word like singular man and plural men, and Derivation, which creates new words from the old ones. For example, the creations of dogcatcher from dog, catch, and –er is a derivational process.

 

Syntax

 Syntax is the lowest level at which human language is constantly creative. People not often create new speech sounds or new words. But everyone who speaks a language is constantly inventing new sentences that he or she has never heard before. Therefore, syntax is quite unlike phonology or morphology.

 

Semantics

Semantics is the level at which language makes contact with the real word. As a field of study, semantics has only recently started to mature. For a long time it was unclear how to describe the meanings of natural-language utterances. Suitable tools have now been provided by mathematical logic and set theory, and since 1970s the study of semantics has made great strides.

 

Pragmatics

Pragmatics is the use of language in context. The boundary between semantics and pragmatics is uncertain and different authors use the terms somewhat different from each other. As a result pragmatics includes aspects of communication that go beyond the literal truth conditions of each sentence.

The aim of the study reported in this thesis is to build a software infrastructure for computational processing of Turkish, which smoothly integrates the above-mentioned levels, and which can be used in the construction of various Natural Language Processing applications for the language. The Prolog logic programming language was selected for the implementation.

Since the whole Turkish space is a huge set, we have chosen to decrease the set of possible sentences by selecting a special subset of Turkish. This subset is also in APPENDIX – A.

 

          3.2.Turkish

Turkish [5] is a member of Ural-Altaic Language Family. This section analyses Turkish from the language perspective and shows important aspects of it. Turkish is characterized by certain morphophonemic, morphotactic, and syntactic features which are vowel harmony, agglutination of all-suffixing morphemes, free order of constituents, and head-final structure of phrases.
Turkish language uses Latin characters. In the Turkish alphabet there are 29 letters. These letters divided into two categories vowels and constants. As seen from Figure 3. 1, we have 8 vowels and 21 constants. As a further level vowels can be divided into sub-categories according to their phonetics or shape. Similarly constants also have some sub-categories where some of are fricative, nasal or liquid.
 

Figure 3. 1 Turkish Letters

 

Turkish Morphology
Turkish morphology is really complicated for generating applications based on it. Because, Turkish is an agglutinative language [6] with word structures formed by productive affixations of derivational and inflectional suffixes to root words. This extensive use of suffixes causes morphological parsing of words to be rather complicated, and results in ambiguous lexical interpretations in many cases. 
 

Figure 3. 2 Turkish Morphology Example

 

For example in Figure 3. 2, “annesi” (his or her mother) may be interpreted as their child. This type of ambiguity can be resolved at phrase and sentence levels by the help of agreement requirements though this is not always possible.

Let’s look at the examples in Figure 3. 3. The first word takes several suffixes and although the root is a verb (“gör” see) it turns into a noun. In the second word when we add the suffix “-a” to the word “ağaç” (tree) the letter “ç” turns to be “c”. And the last one is an example of letter drop. When “–ıyor” suffix is added to the verb “ağla” (cry) the “a” letter drops and verb becomes “ağlıyor”.

·       görünürlerde         à        gör + ün + ür + ler + de

 

·       ağaca                    à        ağaç + a

 

·       ağlıyor                   à        ağla + ıyor

 

 
 

 

 

 

 

 

 


Figure 3. 3 Suffix Examples

 

 Moreover typical heuristics used in English to disambiguate between noun and verb readings of the same lexical form (just like checking the previous word whether a determiner or not) are in general applicable in Turkish as a, or, the determiner (“bir” in Turkish) may also function as an adverb. Let’s consider   the sentences in Figure 3. 4 .  In   the   first sentence “giderim” means that expense, on the other sentence it is used as go. For an NLP application the program should recognize this type of morphological structures.

 

Figure 3. 4 Example Sentences (Ambiguity)

Syntactic Categories

As in Figure 3. 5, these are nouns, proper nouns, compound nouns, adjectives, verbs, adverbs and conjunctions. Notice that determiners are not in this list.

 

Figure 3. 5 Turkish Syntactic Categories

 

Word Order

Order of words in Turkish is subject – object – verb (SOV). However different orders from SOV are also commonly used. In Turkish grammatical function of the sentence is determined noun phrase (NP) regardless to its position. Therefore typical word order can change freely without affecting the grammar of the sentence. Only verb keeps its position that is at the end in sentence. In Figure 3. 6, we have an

 

Figure 3. 6 Word Order

 

example for this kind of situation. All of the three sentences have the same meaning. The first one is an example of typical word order. In the second one the subject I is emphasized. Although we changed order of subject and object, that of verb remains unchanged in all of the three sentences. If the verb is also moved from its typical place (at the end) we call this type of sentences as inverted sentence. Figure 3. 7 is an example of   inverted sentences. The reason why   inverted    sentences    are    used    is   they

 

Figure 3. 7 Inverted Sentences

 

generally emphasizes the verb. But this type of change in word order results in change in grammar functions of the sentence. In other words these sentences are not means the same. Because grammar function remains unchanged if just the order of noun phrase is changed.

 

4.              THE COMPONENTS OF TEG-NALAN

 

In this section, some AI concepts ( “Augmented Trantion Network”, “Concept Hieararchy”, “Schemata” ) [13] utilized in the project  are explained.

 

          4.1.Augmented Transition Network (ATN)

Augmented Transition Networks  (or ATNs) were developed in an attempt to provide a practical framework for natural –language understanding. In order to combine parsing with semantic analysis, it should be possible to attach semantic routines to specific parts of the parsing mechanism or grammar. An ATN can offer the following advantages: (1) The basic parsing scheme is easy to understand; the grammatical information is represented in a transition network, and consequently, an ATN is relatively easy to design. (2) Semantic analysis proceeds simultanously with syntactic analysis, and semantics may easily be used to constrain parsing to resolve ambiguities.

The ATN framework does not place any restrictions on the kinds of actions one can specify. Thus by deciding to use an ATN, one does not narrow the design alternatives for a system very much. However, the ATN approach seems to provide enough structure to a natural-language system to be helpful.

 

          4.2.Concept Hierarchies

Much of our knowledge about the world is organized hiearachically. All the “things” we know of we group into classes and sets. These classes are grouped ınto superclasses and the superclasses into even bigger ones. With most of theses classes we associate names which we use to identify the classes. There is a group we call “dogs” and another we call “cats”. These are grouped, with some other classes, into superclass called “mammals”. Plants, minerals, machines, emotions, information, and ideas are treated similarly. Much of our knowledge consists of an understanding of the inclusion  relationship on all these classes and cognizance of various properties shared by all rmembers of particular classes. “All horses have four legs” states that the property “has four legs” is shared by each member of the class of horses.

 

          4.2.1.Inclusion and ISA

The inclusion relation on a set of classes is very important in AI. “ A bear is mammal ” expresses that the class of the bears is a subclass of the class of mammals in Figure 4. 1. For this reason, the data structures used to represent inclusion relations are often called “ISA” hiearachies. 

 

Figure 4. 1 ISA Hierarchy

 

          4.2.2.Membership and HASA

“ All hourses have four legs ” expresses that the class of the hourse has a member legs and all hourses have this property. The data structures used to represent member relations are often called “HASA” hiearachies. 

 

 

Figure 4. 2  HASA Hierarchy

          4.3.Schemata

To represent knowledge, some organizational structures are required and schema is the one of them. A schema commonly consists of two parts: a name and a list of attribute-value pairs. The attributes are sometimes called “slot names” and the values “filters”. An example schema representation is given below in Table 1.

 

Slots

         Fillers

FRAME NAME

     KITCHEN

DISHWASHER

                  (5,4)

FRIDGE-LOC

                  (2,1)

STOVE-LOC

(3,5)

 

Table 1Schema representing a kitchen with its attributes

 

5.              REQUIREMENTS and PROGRAMMING ENVIRONMENT

 

TEG-NALAN is application software that is completely written in Prolog. So,I n order to run TEG-NALAN on your machine you need to install SWI-Prolog software package. It does not require any special hardware to run. What you need is only a computer capable of running Prolog.

 

          5.1.SWI-Prolog

TEG-NALAN is mainly implemented in SWI-Prolog that is a Free Software Prolog compiler, licensed under the Lesser GNU Public License. It is the most commonly used one among the other Prolog compilers especially in educational purposes. SWI-Prolog can be obtained from [7] where both Linux and Windows versions are available.

 

          5.2.Prolog

Among all of the programming languages available today; Prolog [2] [12] may be the most suitable for natural language processing purposes. Here are some reasons;

·                 It is possible to define, build, and modify large, complex data structures easily. This makes it easy to represent syntactic and semantic structures and lexical entries in Prolog.

·                 List manipulation is widely handled in Prolog, and lists are the preferred data structures for representing Natural Language structures in any level.

·                 The program can examine and modify itself. The user is able to make modifications to the program as well as the knowledge base dynamically as the program runs.

·                 Prolog is based on first-order predicate logic. Logic rules and knowledge representation are integrated within the system. Extensions to this logic are relatively easy to implement.

·                 The ability to store the knowledge base in terms of predicates and facts allows the programmer to easily integrate query systems using rules of interface.

·                 The dept-first search algorithm is built into Prolog and is easily used in all kind of parsers. In fact Prolog has a built-in and ready-to-use parser. These features of prolog ease the implementations of morphological level of syntactic level parsing algorithms and improve the efficiency relative to a hand-coded dept-first parser.

·                 The backtracking property of Prolog means that the user does not need to explicitly handle the alternatives of a clause. Whenever a clause fails, Prolog backtracks to find an alternative solution that does not fail. This property can also be used to find out the entire solution set of a given query.

·                 Pattern matching (unification) is built into Prolog. With this property, arguments of data structures can be constructed in different steps within the clauses and without any strict order.

·                 Most Prolog programs are reversible, that is, they can work in both directions without any changes or with slight changes on the code, in the sense that the output arguments of a Prolog predicate could be used as input arguments for the same predicate in another call. This feature of Prolog allows us to develop applications which can perform not only analysis but also generation with the same code.

Lisp shares only a few of these advantages. Conventional languages such as Pascal and C are lack all of them. Of course natural language processing can be done in any programming language; Prolog is much easier than others.

Beside all of the advantages, choosing Prolog will also have some disadvantages. The backtracking property sometimes causes the generation of unwanted solutions. To prevent such problems, the cut predicate ’!’ is used, which causes an artificial restriction. The usage of such kinds of restrictions prevents the user from testing the formalism whether it completely represents the entire set of the theory it is based on or not.

Another disadvantage of using Prolog can be efficiency results. In most of the cases Prolog programs run slower than the ones implemented in procedural languages such as C or Pascal.

However, the advantages we listed above are so important and convenient for NLP that Prolog is still the most suitable and widely used programming language, in spite of the fact that it has disadvantages.

 

6.              DESIGN

 

            In this section the main design issues considered in TEG-NALAN will be explained. These are Component Level Design, Data Flow and Data Design.

 

          6.1. Component Level Design

TEG-NALAN is composed of several libraries which are shown in Figure 6. 1 below. TEG-NALAN depends on two main libraries that are tuja and sohbetson.

 

Figure 6. 1 Library Hierarchies in TEG-NALAN

 

Ø                    Tuja file is the main library in our program. There are the vital components in TEG-NALAN which can be seen in Figure 6. 3 (ATN Manager, Knowledge Database, Java code Generator).

Ø                         Sohbetson file includes main Turkish grammar rules. As an example;

   Sentence à noun phrase + verb phrase.

 This file is originally written by [4]. However, all of the grammar rules are changed, in order to adapt them into TEG-NALAN. Semantic Creation that is an important part of TEG-NALAN is also implemented in Sohbetson.

Ø                                   Morphoson used as a morpheme database (Turkish dictionary). TEG-NALAN is a dictionary based application. Therefore all words (not only nouns also verbs, adjectives, numbers, everything) needed in TEG-NALAN should be defined beforehand. Otherwise we can not be able to recognize the given word.

Ø          Formula library contains rules that transform the semantic input given, into a set of Prolog facts and rules. In this file semantic formulas are applied. Sohbetson needs this file to finalize the semantic.

Ø          Arcson is the implementation of Oflazer’s finite state machine [8] for Turkish. Inside the file there are arc rules to realize FSM. Arc rule has two arguments to represent an arc from one to other.

Ø          Misc contains some predicates and rules that are in design and test level. If we compare this library with the other libraries, it is not as stable as the other ones.

 

          6.2. Data Flow and Data Design

TEG-NALAN takes a Turkish sentence, and then applies Turkish grammar rules to create a meaningful semantics, finally creates Java class templates and writes them into an output java file as in Figure 6. 2.

 

Figure 6. 2 Data Flow Diagram

            TEG-NALAN  achieve its task with the following components:

Ø                         ATN Manager

An augmented transition network (ATN) is developed for TEG-NALAN interface. HASA relationship is used for composing classes and ISA relationship is used for building the class hierarchy.

Ø                         Knowledge Database

Knowledge database keeps all the class hierarchy resulting from the object oriented design and the skeleton of each class.

Ø                         Java Code Generator

Retrieving information from the database, Java Code Generator generates java class skeleton.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 6. 3 Architecture of TEG-NALAN

 

7.              MORPHOLOGY, SYNTAX AND SEMANTICS

 

Morphology of TEG-NALAN is inherited from a previous project, TuSA [9] based on PROLOG.

For syntax part, some grammatical rules, which can be seen in APPENDIX – B, are created. All the possible syntax types are supported to create an abstract model representing the classes. More sample sentences except given in this section can be seen in APPENDIX – A.     

Sentences excepted by our program can be categorized into four different groups:

– Class Declaration Sentences

– Attribute Declaration Sentences

– Method Declaration Sentences

– Hierarchy Declaration Sentences

 

 

7.1.Morphology

 

At the start, the required Turkish words have been added into the morphoson as tr_morph_entry form. This is due to the reason that our program is a dictionary based software and an assumption is been made that there is a library containing the entire Turkish words in tr_morph_entry form. There is an example in Figure 7. 1, which shows different types of tr_morph_entry. In this example İlknur is used as a pronoun, “liste” (list) as a noun, “tüket” (consume) as a verb and “yüksek” (high) as an adjective.

 

Tr_morph_entry has eight parameters. Three of them are on explicitly left empty to leave space for improvement. The first parameter is a string that shows the root type of the word. The second parameter is a list that has two elements. The first one is the word in a list form and the following one is another list having type and semantic inside. The third, fourth and fifth arguments are left empty. Sixth parameter is the last vowel and the seventh one is last letter. The final parameter is the state value.

 

            The last vowel, last letter and the state value are used to determine the suffixes that that word may have. The last vowel and the last letter are saved since some suffixes can be changed according to the following word.

 

Figure 7. 2 illustrates some examples of the suffix changes due to ending of the word. In addition, some examples of state values are also shown in Table 2. Further detailed explanations about the suffix addition and the word formation are available in [4] [9].

tr_morph_entry('AdKök',[[l,i,s,t,e],[type(noun),sem(liste)]],_,_,_,e,e,ok).

tr_morph_entry('AdKök',[['İ',l,k,n,u,r],[type(propernoun),sem(ilknur)]],_,_,_,u,r,ok).

tr_morph_entry('FiilKök',[[t,ü,k,e,t],[type(verb),sem(tuket-al)]],_,_,_,e,t,ok). 

 
 

 

 

 

 

 

 


Figure 7. 1 Word Entries

à lar (plural suffix)

 

“araba + lar” (cars)

“çiçek + ler” (flowers)

 
 

 

 

 

 

Figure 7. 2 Suffix Changes

 

State Value

Description

Example

Ok

Regular words

“abla” (sister)

Specok

Standard forms of special words

“çocuk” (child)

Spec

Exceptional form of special words

“çocuğ” (child)

Table 2 State Values

 

7.2.Class Declaration Sentences

 

This group of sentences is used to create a new class as shown in Figure 7. 3. Note that declaration of abstract classes; Java interfaces are also supported as you see in Figure 7. 4.

 

 

 

 

 

 

 

 

 


Figure 7. 3 Example of a Class Definition Sentence

 

 

 

 

 

 

 


Figure 7. 4 Example of an Interface Definition Sentence

 

Part of the ATN, which detects class declaration sentences, is shown in Figure 7. 5. Note that the figure only shows the class declaration sentences, not the sentences required to define interfaces.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 7. 5 ATN for Class Declaration

 

 

7.3. Attribute Declaration Sentences

 

This group of sentences is used to define the attributes of an existing class or to define a new class with specified attributes.

 

The attributes of a class can be declared via two types of sentences according to the level of the details wanted to given by the user:

 

Ø               For undetailed programming:

 

The users who do not write programs using any object oriented language or professionals who do not want to enter detailed sentences can use this type of sentences. These sentences do not require any information about Java primitive types, or attribute access specifiers (public, private or protected). All attributes, which are entered by the user in this way, are accepted as private attributes. Each of these attributes is the instance of a user defined object. An example of this kind of sentence and the corresponding output can be seen in Figure 7. 6.

           

 

 

 

 

 

 

 

 

 

 

Figure 7. 6 Example of an Undetailed Attribute Declaration Sentence

 

Ø               For detailed programming:

 

For programmers, who are familiar with the object oriented concept and want to declare detailed classes, can use this type of sentences. In this case, user has the opportunity to determine the name and access specifier of the attribute. So, the attribute can be public, private or protected. Each of these attributes can either be an instance of a user defined object or a primitive type such as “int”. An example of this kind of sentence and the corresponding output can be seen in Figure 7. 7.

 

 

 

 

 

 

 

 

 

 

 

 


Figure 7. 7 Example of a Detailed Attribute Declaration Sentence

 

HAS relation is used to define the attribute members of a class. Part of the ATN, which detects attribute declaration sentences, is shown in Figure 7. 8.