PhD. Thesis Preliminary Report

 

Written by: Sadi Evren SEKER

Advisor: Assoc. Prof. Banu DIRI

 

Topic: Turkish Semantic Search Engine

 

Statement of Purpose: Our aim is implementing a software to track internet pages and index turkish pages for intelligent search.

 

Introduction: Besides the difficulties of reaching an information, Internet is the largest data source of the world. Unfortunetly most of the languages are lack of smart implementations to utilize this information. Almost 90% of the information in internet is in natural language and using computers to extract such an information is only available after the natural language processing studies. Heading a language study in an environment like Internet makes it necessary to declare a formal way of categorisation of Internet pages.

 

Analysis: Since there are more than 40 search engines available in turkish, most of them are lack of natural language operations. Unfortunetly almost all of them are static search engines which can search only in a static database which is filled with an operator or the site owners.

 

Possible steps:

  1. Web Crawler
    1. Network module to connect internet
    2. Web site fetcher
    3. Html parser
    4. Multi threaded or multi process link tracker
    5. An efficient time to live for each page (to turn back and refresh information about the pages)
  2. Indexing
    1. Link tracker
    2. String operations to fetch words
    3. NLP operations to understand the language of page
    4. Morphological analysis to get semantic representation of the words
    5. Stop words
  3. Database
    1. Keeping information of link and semantic representations
    2. An efficient compression method
  4. Search module
    1. An interface for users to search web sites
    2. Morphological analysis to get semantic representation of search words
    3. Compression and database seach for the results

 

Possible environment:

For the network modules JAVA is a good alternative because of unrivalled network classes. On the other hand because of predicate calculus abilities and built-in rdbs prolog is the strongest nlp programming environment for this project. JPL may be the intersection set of java and prolog while interchanging the information between network and nlp modules. On the other hand the database implementation requires great amount of memory and a reliable server to keep whole database in memory.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Since turkish is an agglutinative language, we can build any word by using roots and suffixes. While building any word we can use a finite state machine. Indexing all the words in a web page requires huge amount of storage and time consuming. To solve the problem we can apply hughes algorithm for data compression. The most suitable part of compression is the conversion of letter groups to symbols is already done while traversing the finite statemachine. If we append the track of every nodes in the machine to the root of the word, we can find out a unique morphological representation. [burada ambiguity olabilir]

 

Language related studies:

  1. stop words
  2. time to live for a web page (the update frequencies of the web pages)
  3. how to determine the language of a web page (to look the charachter set of the web page) (fetch random words from the webpage and try to pass from morphological analyzer) (what if a user types without using turkish charachters, what if a page contains quotations)

 

Crawler related studies:

  1. tracking links
    1. finding duplicate web links
    2. finding circular links.
    3.