การตัดคำภาษาไทยโดยใช้คุณลักษณะ

ไพศาล เจริญพรสวัสดิ์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/11711

Title:	การตัดคำภาษาไทยโดยใช้คุณลักษณะ
Other Titles:	Feature-based Thai word segmentation
Authors:	ไพศาล เจริญพรสวัสดิ์
Advisors:	บุญเสริม กิจศิริกุล สุรพันธ์ เมฆนาวิน
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. บัณฑิตวิทยาลัย
Advisor's Email:	boonserm@cp.eng.chula.ac.th, Boonserm.K@chula.ac.th ไม่มีข้อมูล
Subjects:	การแจงส่วนประโยค (ไวยากรณ์คอมพิวเตอร์) ภาษาไทย การตัดคำ
Issue Date:	2541
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	เนื่องจากลักษณะการเขียนของภาษาไทยนั้นไม่มีการใช้ตัวอักษรหรือสัญลักษณ์ที่นำมาใช้คั่นระหว่างคำ และงานต่างๆ ในด้านการประมวลผลภาษาธรรมชาตินั้นจำเป็นต้องทราบขอบเขตของคำก่อนถึงจะสามารถนำไปประมวลผลต่อไปได้ ดังเช่นการแปลภาษาไทย-อังกฤษ การสังเคราะห์เสียงภาษาไทย หรือการแก้ไขคำที่สะกดผิด เป็นต้น ทำให้การตัดคำนั้นถือได้ว่าเป็นปัญหาที่สำคัญปัญหาหนึ่งสำหรับงานด้านการประมวลผลภาษาธรรมชาติ ในการตัดคำนั้นประกอบไปด้วยปัญหาหลัก 2 ปัญหาคือ 1. ปัญหาความกำกวม 2. ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรม สำหรับแนวคิดในการตัดคำนั้นมีอยู่หลายแนวคิด เช่นการตัดคำแบบเลือกคำยาวที่สุด การตัดคำโดยเลือกแบบเหมือนมากที่สุด และการตัดคำโดยโมเดลไตรแกรม อย่างไรก็ตามแนวคิดต่างๆ เหล่านั้นไม่สามารถให้ความถูกต้องที่สูงในการแก้ปัญหาการตัดคำ เพราะว่ามีการใช้เพียงวิทยาการศึกษาสำนึก สำหรับการตัดคำโดยแบบเลือกคำยาวที่สุดและการตัดคำโดยเลือกแบบที่เหมือนมากที่สุด และสำหรับการตัดคำโดยใช้โมเดลไตรแกรมนั้นมีการพิจารณาแค่คำบริบทก่อนหน้าแค่เพียง 2 คำเท่านั้น ส่วนความถูกต้องในการแก้ปัญหาความกำกวมนั้นมีความถูกต้องประมาณ 53% และ 73% สำหรับการตัดคำโดยเลือกแบบเหมือนมากที่สุดและการตัดคำโดยใช้โมเดลไตรแกรมตามลำดับ ในวิทยานิพนธ์นี้เสนอแนวคิดการนำคุณลักษณะโดยใช้การเรียนรู้ของเครื่อง 2 แบบ คือ ริปเปอร์และวินโนว์ในการแก้ปัญหาการตัดคำภาษาไทย โดยคุณลักษณะคือข้อมูลที่อยู่รอบๆ ซึ่งสามารถนำมาประยุกต์ใช้ในการแก้ปัญหาได้ สำหรับคุณลักษณะที่นำมาใช้ในการแก้ปัญหาการตัดคำทั้ง 2 ปัญหา คือคำบริบท และสิ่งที่เกิดร่วมกันโดยมีลำดับ ในการทดลองมีการนำคลังข้อความที่มีการกำหนดหน้าที่คำจำนวน 80% เข้ามาใช้ในการเรียนรู้และส่วนที่เหลือนำมาใช้ในการทดสอบ สำหรับการวัดประสิทธิภาพนั้นได้มีการแบ่งออกเป็น 2 ส่วนคือ 1. วัดค่าความถูกต้องของการแก้ปัญหาความกำกวม 2. วัดค่าความถูกต้องของการแก้ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรม สำหรับความถูกต้องโดยการใช้ริปเปอร์และวินโนว์ในการแก้ปัญหาความกำกวมนั้นให้ความถูกต้องมากกว่า 85% และ 90% ตามลำดับ ส่วนความถูกต้องในการแก้ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรมนั้นให้ความถูกต้องมากกว่า 70% และ 80% สำหรับริปเปอร์และวินโนว์ตามลำดับ จากผลการทดลองแสดงให้เห็นว่าการตัดคำโดยใช้คุณลักษณะให้ประสิทธิภาพในการแก้ปัญหาได้ดีกว่าการตัดคำโดยใช้ไตรแกรมโมเดลและการตัดคำโดยเลือกแบบเหมือนมากที่สุด และยังแสดงให้เห็นว่าวินโนว์สามารถดึงคุณลักษณะต่างๆจากคลังข้อความ เพื่อใช้ในการแก้ปัญหาการตัดคำได้ดีกว่าริปเปอร์
Other Abstract:	In a Thai text, a delimiter for indicating the word boundary is not explicitly used. Many tasks of Natural Language Processing (NLP) such as Thai-English machine translation, Thai speech synthesis and spelling correction require boundaries of words. Therefore, word segmentation is one of the main problems in NLP. There are two main problems in word segmentation. The first is the ambiguity problem and the second is the unknown word boundary problem. Many approaches such as longest matching, maximal matching and trigram model have been proposed. However, these approaches can not give high accuracy because longest matching and maximal matching use only heuristics and trigram model consider only two previous context words for solving the problems. The accuracy in solving ambiguity problem is about 53% and 73% for maximal matching and trigram model respectively. This thesis proposes to use a feature-based approach with two learning algorithms namely RIPPER and Winnow in solving the problems in Thai word segmentation. A feature can be anything that tests for specific information in the context around the word in question, such as context words and collocations. In the experiment we train the system by using RIPER and Winnow algorithm separately, on an 80% of part-of-speech tagged corpus and the rest is used for testing. We divided the evaluation into two parts. One is the accuracy in solving the ambiguity problem and the other is the accuracy in solving the unknown word boundary problem. The accuracy using RIPPER and Winnow in solving the ambiguity problem is more than 85% and 90% respectively. On the other hand, the accuracy in solving the unknown word boundary problem is more than 70% and 80% for RIPPER and Winnow respectively. The experiment results show the feature-based approach outperform trigram model and maximal matching, and Winnow is superior to RIPPER for extracting the features from the corpus.
Description:	วิทยานิพนธ์ (วศ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2541
Degree Name:	วิศวกรรมศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	วิศวกรรมคอมพิวเตอร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/11711
ISBN:	9743323821
Type:	Thesis
Appears in Collections:	Grad - Theses

Files in This Item:

File	Size	Format
Paisarn_Ch_front.pdf	769.52 kB	Adobe PDF	View/Open
Paisarn_Ch_ch1.pdf	719.69 kB	Adobe PDF	View/Open
Paisarn_Ch_ch2.pdf	800.52 kB	Adobe PDF	View/Open
Paisarn_Ch_ch3.pdf	728.47 kB	Adobe PDF	View/Open
Paisarn_Ch_ch4.pdf	714.06 kB	Adobe PDF	View/Open
Paisarn_Ch_ch5.pdf	712.17 kB	Adobe PDF	View/Open
Paisarn_Ch_ch6.pdf	721.89 kB	Adobe PDF	View/Open
Paisarn_Ch_ch7.pdf	799.43 kB	Adobe PDF	View/Open
Paisarn_Ch_ch8.pdf	774.9 kB	Adobe PDF	View/Open
Paisarn_Ch_ch9.pdf	688.6 kB	Adobe PDF	View/Open
Paisarn_Ch_back.pdf	949.54 kB	Adobe PDF	View/Open

Show full item record