ระบบการค้นคืนข้อความภาษาไทยโดยใช้แฟ้มข้อมูลผกผัน

วิฑูรย์ กัลยาณวัฒน์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/12965

Title:	ระบบการค้นคืนข้อความภาษาไทยโดยใช้แฟ้มข้อมูลผกผัน
Other Titles:	Thai text retrieval system using inverted files
Authors:	วิฑูรย์ กัลยาณวัฒน์
Advisors:	สมชาย ประสิทธิ์จูตระกูล
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. บัณฑิตวิทยาลัย
Advisor's Email:	Somchai.P@Chula.ac.th
Subjects:	ภาษาไทย การค้นข้อสนเทศ แฟ้มดัชนี
Issue Date:	2540
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	นำเสนอขั้นตอนวิธีการจัดทำดัชนีสำหรับระบบสืบค้นข้อความไทย ที่ใช้โครงสร้างแฟ้มข้อมูลแบบผกผัน โดยที่เอกสารต่างๆ ที่ได้รับมานั้นสามารถมีคำที่ไม่มีอยู่ในพจนานุกรมของระบบได้ ปัญหานี้เกิดขึ้นจากการเขียนข้อความในภาษาไทย ที่ไม่มีตัวกำหนดขอบเขตระหว่างคำ โดยอาศัยพจนานุกรมของระบบ ขั้นตอนวิธีที่นำเสนอนี้หาคำตอบที่ยาวสุดต่างๆ ที่มีในพจนานุกรมที่ปรากฏในข้อความ จากนั้นสร้างกราฟที่แทนการติดกันและการทับกันของคำต่างๆ ในข้อความ โดยที่เส้นทางที่สั้นสุดในกราฟนี้ แทนกลุ่มที่เล็กสุดของคำในข้อความที่เมื่อเลือกแล้ว จะลดจำนวนสายอักขระย่อยที่ไม่รู้จักให้ปรากฏขึ้นเป็นจำนวนน้อยที่สุด สายอักขระย่อยเหล่านี้จะถูกเทียบกับพยางค์ต่างๆ ในข้อความ โดยการใช้ขั้นตอนวิธีการแบ่งพยางค์แบบใช้กฎ คำต่างๆ ที่ได้บนเส้นทางสั้นสุดของกราฟ และพยางค์ต่างๆ ที่ได้จากการเทียบกับสายอักขระย่อยที่ไม่เป็นคำที่รู้จัก จะเป็นกลุ่มของคำสำคัญในการจัดทำดัชนีของข้อความที่ได้รับ ผลการทดลองแสดงให้เห็นว่าจำนวนคำสำคัญที่หาได้นั้น ลดจากจำนวนคำทั้งหมดที่หาได้จากข้อความประมาณ 72%
Other Abstract:	Presents an autormatic indexing algorithm for inverted-file-based Thai text retrieval system where given documents can have words that are unkhown to the system's dictionary. The problem arises from the fact that there is no explicit inter-word delimiter in Thai text. By using system dictionary, the algorithm first finds a set of recognizable words that maximally match all the semi-infinite substrings of a given text. It then constructs an adjacent-overlapping graph whose a shortest path represents a smallest list of known words minimizing unknown substrings of the text. The unknown substrings are matched with the set of syllables obtained from a rule-based syllable segmentation of the text. The words on the shortest path of the adjacent-overlapping graph and the matched syllables are then used as keywords for indexing of the given text. Experimental results showed that the number of keywords obtained is approximately 72% less compared to the number obtained by using matching-all-known-words technique.
Description:	วิทยานิพนธ์ (วศ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2540
Degree Name:	วิศวกรรมศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	วิศวกรรมคอมพิวเตอร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/12965
ISBN:	9746376632
Type:	Thesis
Appears in Collections:	Grad - Theses

Files in This Item:

File	Size	Format
Witoon_Ka_front.pdf	343.49 kB	Adobe PDF	View/Open
Witoon_Ka_ch1.pdf	230.51 kB	Adobe PDF	View/Open
Witoon_Ka_ch2.pdf	447.33 kB	Adobe PDF	View/Open
Witoon_Ka_ch3.pdf	536.84 kB	Adobe PDF	View/Open
Witoon_Ka_ch4.pdf	427.39 kB	Adobe PDF	View/Open
Witoon_Ka_ch5.pdf	230.06 kB	Adobe PDF	View/Open
Witoon_Ka_back.pdf	380.61 kB	Adobe PDF	View/Open

Show full item record