การตัดคำภาษาไทยโดยใช้คุณลักษณะ

ไพศาล เจริญพรสวัสดิ์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/11711

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	บุญเสริม กิจศิริกุล	-
dc.contributor.advisor	สุรพันธ์ เมฆนาวิน	-
dc.contributor.author	ไพศาล เจริญพรสวัสดิ์	-
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. บัณฑิตวิทยาลัย	-
dc.date.accessioned	2009-11-26T07:39:37Z	-
dc.date.available	2009-11-26T07:39:37Z	-
dc.date.issued	2541	-
dc.identifier.isbn	9743323821	-
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/11711	-
dc.description	วิทยานิพนธ์ (วศ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2541	en
dc.description.abstract	เนื่องจากลักษณะการเขียนของภาษาไทยนั้นไม่มีการใช้ตัวอักษรหรือสัญลักษณ์ที่นำมาใช้คั่นระหว่างคำ และงานต่างๆ ในด้านการประมวลผลภาษาธรรมชาตินั้นจำเป็นต้องทราบขอบเขตของคำก่อนถึงจะสามารถนำไปประมวลผลต่อไปได้ ดังเช่นการแปลภาษาไทย-อังกฤษ การสังเคราะห์เสียงภาษาไทย หรือการแก้ไขคำที่สะกดผิด เป็นต้น ทำให้การตัดคำนั้นถือได้ว่าเป็นปัญหาที่สำคัญปัญหาหนึ่งสำหรับงานด้านการประมวลผลภาษาธรรมชาติ ในการตัดคำนั้นประกอบไปด้วยปัญหาหลัก 2 ปัญหาคือ 1. ปัญหาความกำกวม 2. ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรม สำหรับแนวคิดในการตัดคำนั้นมีอยู่หลายแนวคิด เช่นการตัดคำแบบเลือกคำยาวที่สุด การตัดคำโดยเลือกแบบเหมือนมากที่สุด และการตัดคำโดยโมเดลไตรแกรม อย่างไรก็ตามแนวคิดต่างๆ เหล่านั้นไม่สามารถให้ความถูกต้องที่สูงในการแก้ปัญหาการตัดคำ เพราะว่ามีการใช้เพียงวิทยาการศึกษาสำนึก สำหรับการตัดคำโดยแบบเลือกคำยาวที่สุดและการตัดคำโดยเลือกแบบที่เหมือนมากที่สุด และสำหรับการตัดคำโดยใช้โมเดลไตรแกรมนั้นมีการพิจารณาแค่คำบริบทก่อนหน้าแค่เพียง 2 คำเท่านั้น ส่วนความถูกต้องในการแก้ปัญหาความกำกวมนั้นมีความถูกต้องประมาณ 53% และ 73% สำหรับการตัดคำโดยเลือกแบบเหมือนมากที่สุดและการตัดคำโดยใช้โมเดลไตรแกรมตามลำดับ ในวิทยานิพนธ์นี้เสนอแนวคิดการนำคุณลักษณะโดยใช้การเรียนรู้ของเครื่อง 2 แบบ คือ ริปเปอร์และวินโนว์ในการแก้ปัญหาการตัดคำภาษาไทย โดยคุณลักษณะคือข้อมูลที่อยู่รอบๆ ซึ่งสามารถนำมาประยุกต์ใช้ในการแก้ปัญหาได้ สำหรับคุณลักษณะที่นำมาใช้ในการแก้ปัญหาการตัดคำทั้ง 2 ปัญหา คือคำบริบท และสิ่งที่เกิดร่วมกันโดยมีลำดับ ในการทดลองมีการนำคลังข้อความที่มีการกำหนดหน้าที่คำจำนวน 80% เข้ามาใช้ในการเรียนรู้และส่วนที่เหลือนำมาใช้ในการทดสอบ สำหรับการวัดประสิทธิภาพนั้นได้มีการแบ่งออกเป็น 2 ส่วนคือ 1. วัดค่าความถูกต้องของการแก้ปัญหาความกำกวม 2. วัดค่าความถูกต้องของการแก้ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรม สำหรับความถูกต้องโดยการใช้ริปเปอร์และวินโนว์ในการแก้ปัญหาความกำกวมนั้นให้ความถูกต้องมากกว่า 85% และ 90% ตามลำดับ ส่วนความถูกต้องในการแก้ปัญหาคำศัพท์ที่ไม่ปรากฏในพจนานุกรมนั้นให้ความถูกต้องมากกว่า 70% และ 80% สำหรับริปเปอร์และวินโนว์ตามลำดับ จากผลการทดลองแสดงให้เห็นว่าการตัดคำโดยใช้คุณลักษณะให้ประสิทธิภาพในการแก้ปัญหาได้ดีกว่าการตัดคำโดยใช้ไตรแกรมโมเดลและการตัดคำโดยเลือกแบบเหมือนมากที่สุด และยังแสดงให้เห็นว่าวินโนว์สามารถดึงคุณลักษณะต่างๆจากคลังข้อความ เพื่อใช้ในการแก้ปัญหาการตัดคำได้ดีกว่าริปเปอร์	en
dc.description.abstractalternative	In a Thai text, a delimiter for indicating the word boundary is not explicitly used. Many tasks of Natural Language Processing (NLP) such as Thai-English machine translation, Thai speech synthesis and spelling correction require boundaries of words. Therefore, word segmentation is one of the main problems in NLP. There are two main problems in word segmentation. The first is the ambiguity problem and the second is the unknown word boundary problem. Many approaches such as longest matching, maximal matching and trigram model have been proposed. However, these approaches can not give high accuracy because longest matching and maximal matching use only heuristics and trigram model consider only two previous context words for solving the problems. The accuracy in solving ambiguity problem is about 53% and 73% for maximal matching and trigram model respectively. This thesis proposes to use a feature-based approach with two learning algorithms namely RIPPER and Winnow in solving the problems in Thai word segmentation. A feature can be anything that tests for specific information in the context around the word in question, such as context words and collocations. In the experiment we train the system by using RIPER and Winnow algorithm separately, on an 80% of part-of-speech tagged corpus and the rest is used for testing. We divided the evaluation into two parts. One is the accuracy in solving the ambiguity problem and the other is the accuracy in solving the unknown word boundary problem. The accuracy using RIPPER and Winnow in solving the ambiguity problem is more than 85% and 90% respectively. On the other hand, the accuracy in solving the unknown word boundary problem is more than 70% and 80% for RIPPER and Winnow respectively. The experiment results show the feature-based approach outperform trigram model and maximal matching, and Winnow is superior to RIPPER for extracting the features from the corpus.	en
dc.format.extent	787984 bytes	-
dc.format.extent	736962 bytes	-
dc.format.extent	819734 bytes	-
dc.format.extent	745954 bytes	-
dc.format.extent	731196 bytes	-
dc.format.extent	729258 bytes	-
dc.format.extent	739212 bytes	-
dc.format.extent	818621 bytes	-
dc.format.extent	793493 bytes	-
dc.format.extent	705123 bytes	-
dc.format.extent	972333 bytes	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.language.iso	th	es
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.subject	การแจงส่วนประโยค (ไวยากรณ์คอมพิวเตอร์)	en
dc.subject	ภาษาไทย	en
dc.subject	การตัดคำ	en
dc.title	การตัดคำภาษาไทยโดยใช้คุณลักษณะ	en
dc.title.alternative	Feature-based Thai word segmentation	en
dc.type	Thesis	es
dc.degree.name	วิศวกรรมศาสตรมหาบัณฑิต	es
dc.degree.level	ปริญญาโท	es
dc.degree.discipline	วิศวกรรมคอมพิวเตอร์	es
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.email.advisor	boonserm@cp.eng.chula.ac.th, Boonserm.K@chula.ac.th	-
dc.email.advisor	ไม่มีข้อมูล	-
Appears in Collections:	Grad - Theses

Files in This Item:

File	Size	Format
Paisarn_Ch_front.pdf	769.52 kB	Adobe PDF	View/Open
Paisarn_Ch_ch1.pdf	719.69 kB	Adobe PDF	View/Open
Paisarn_Ch_ch2.pdf	800.52 kB	Adobe PDF	View/Open
Paisarn_Ch_ch3.pdf	728.47 kB	Adobe PDF	View/Open
Paisarn_Ch_ch4.pdf	714.06 kB	Adobe PDF	View/Open
Paisarn_Ch_ch5.pdf	712.17 kB	Adobe PDF	View/Open
Paisarn_Ch_ch6.pdf	721.89 kB	Adobe PDF	View/Open
Paisarn_Ch_ch7.pdf	799.43 kB	Adobe PDF	View/Open
Paisarn_Ch_ch8.pdf	774.9 kB	Adobe PDF	View/Open
Paisarn_Ch_ch9.pdf	688.6 kB	Adobe PDF	View/Open
Paisarn_Ch_back.pdf	949.54 kB	Adobe PDF	View/Open

Show simple item record