การออกแบบแฟ้มผกผันเพื่อการค้นคืนข้อความไทย

สมชาย ประสิทธิ์จูตระกูล

dc.contributor.author	สมชาย ประสิทธิ์จูตระกูล
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. ภาควิชาวิศวกรรมคอมพิวเตอร์
dc.date.accessioned	2008-01-25T09:47:01Z
dc.date.available	2008-01-25T09:47:01Z
dc.date.issued	2541
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/5608
dc.description.abstract	งานวิจัยนี้นำเสนอขั้นตอนวิธีการหาคำเพื่อจัดทำดัชนีสำหรับระบบการค้นคืนข้อความไทยที่ใช้โครงสร้างแฟ้มผกผัน โดยอาศัยพจนานุกรมช่วยในการแยกคำ และยังสามารถจัดการกับกรณีที่ข้อความที่ได้รับมีคำที่ไม่ปรากฏพจนานุกรม อาทิเช่นคำทับศัพท์ หรือคำที่สะกดผิดเป็นต้น โดยอาศัยกฎการแบ่งพยางค์ข้อความไทย ขั้นตอนวิธีนี้จำลองปัญหาด้วยกราฟการต่อและซ้อนกันของคำ ซึ่งมีโหนดแทนคำและเส้นเชื่อมแทนการต่อหรือซ้อนกันของคำ โดยมีเส้นทางสั้นสุดจากซ้ายไปขวาในกราฟนี้ แทนรายการคำพื้นฐานที่ควรถูกจัดทำดัชนีสำหรับแฟ้มผกผันเวลาการทำงานของการหาคำนี้เป็น O(n[superscript 2] ) โดยที่ n คือความยาวข้อความ ขั้นตอนวิธีนี้จะถูกใช้ทั้งในขั้นตอนการเตรียมเอกสารก่อนการทำดัชนี และการประมวลข้อคำถามก่อนการสืบค้น ผลการทดลองพบว่าจำนวนคำที่หาได้เพื่อทำดัชนีนั้นมีจำนวนประมาณ 30-50% ของจำนวนคำที่เป็นไปได้ทั้งหมดที่ปรากฏในข้อความทดสอบ นอกจากนี้งานวิจัยนี้ยังได้นำเสนอขั้นตอนวิธีในการเข้ารหัสคำทับศัพท์ เพื่อรองรับการค้นคืนคำทับศัพท์ข้ามภาษาจากอังกฤษมาไทย นั่นคือระบบสามารถค้นคืนเอกสารที่มีคำสำคัญภาษาอังกฤษ หรือคำทับศัพท์เป็นภาษาไทยของคำอังกฤษนั้น การเข้ารหัสนี้ปรับปรุงวิธีการเข้ารหัสเสียงและตารางการเข้ารหัสในระบบซาวน์เดกซ์ วิธีนี้ใช้เวลาการเข้ารหัสแปรเชิงเส้นตามความยาว จากผลที่ได้จากการทดลองพบว่าได้ค่าเรียกคืนและความแม่นยำมากกว่า 80% เมื่อจำกัดการพิจารณาเฉพาะคำที่รหัสเสียงมีความยาวเกิน 4	en
dc.description.abstractalternative	This work presents an algorithm for finding words used for indexing in a Thai-text retrieval system using inverted file structures. A dictionary is used during word separation. The algorithm can deal with text containing unknown words to the system dictionary such as transliterated words and words with typographical errors using a set of Thai syllable separation rules. The algorithm models the problems by constructing a word-adjacency-overlapping graph where vertices represent words and edges represent the word adjacency-overlapping relationships. A shortest path from the left-most vertex to the right-most vertex of the graph is a list of words reserved to be used as indices in the inverted file. The running time is O (n [superscript 2]) where n is the text length. The algorithm is used both in text preparation preprocessing before indexing and also in query processing before the actual search. Experimental results showed that the number of words obtained is approximately 30-50% of the total number of possible words appearing in the given text. In addition, this work also presents an algorithm for encoding transliterated words suitable for cross-language retrieval system. Incorporating this feature enables the system to retrieve not only documents containing the English keywords, but also documents containing the corresponding transliterated words in Thai. The encoding algorithm modifies the Soundex encoding table and algorithm whose running time is linearly proportional to the word length. Experimental results showed that a high recall and precision of more than 80% can be achieved especially when the phonetic codes are longer than four.	en
dc.description.sponsorship	ทุนงบประมาณแผ่นดิน ปี 2540	en
dc.format.extent	5976432 bytes
dc.format.mimetype	application/pdf
dc.language.iso	th	es
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.subject	แฟ้มดัชนี	en
dc.subject	ระบบการจัดเก็บและค้นข้อสนเทศ	en
dc.title	การออกแบบแฟ้มผกผันเพื่อการค้นคืนข้อความไทย	en
dc.title.alternative	Design of inverted file for Thai-text retrieval	en
dc.type	Technical Report	es
dc.email.author	Somchai.P@Chula.ac.th