การแยกอนุพากย์ภาษาไทยด้วยการใช้แบบจำลองซัพพอร์ตเวกเตอร์แมชชีน

นลินี อินต๊ะซาว

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/42818

Title:	การแยกอนุพากย์ภาษาไทยด้วยการใช้แบบจำลองซัพพอร์ตเวกเตอร์แมชชีน
Other Titles:	Thai Clause Segmentation Using a Support Vector Machine Model
Authors:	นลินี อินต๊ะซาว
Advisors:	วิโรจน์ อรุณมานะกุล
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะอักษรศาสตร์
Advisor's Email:	awirote@chula.ac.th
Subjects:	ซัพพอร์ตเวกเตอร์แมชชีน ภาษาไทย -- ประโยค Support vector machines Thai language -- Sentences
Issue Date:	2556
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	วัตถุประสงค์ของวิทยานิพนธ์นี้ คือ เพื่อหาลักษณ์ทางภาษาที่จะนำไปใช้ในการแยกอนุพากย์ภาษาไทยด้วยแบบจำลองซัพพอร์พเวกเตอร์แมชชีน และเปรียบเทียบลักษณ์ทางภาษาที่ใช้ ว่าส่งผลต่อประสิทธิภาพของระบบการแยกอนุพากย์อย่างไรบ้าง คลังข้อมูลที่ใช้ในการศึกษานี้เป็นภาษาเขียนทางวิชาการ มีขนาด 76,460 คำ ประกอบไปด้วย 8,102 อนุพากย์ แบบจำลองซัพพอร์ตเวกเตอร์แมชชีนที่ใช้ในการแยกอนุพากย์ในงานนี้ คือฟังก์ชั่น SMO ของโปรแกรมวีก้า (Weka) และฟังก์ชั่นเคอร์เนลที่ใช้คือโพลีโนเมียล ระบบทำการแยกอนุพากย์โดยรับข้อมูลเข้าเป็นคำเพื่อให้แบบจำลองตัดสินใจว่าคำนั้นเป็นคำขอบเขตเริ่มต้นอนุพากย์หรือไม่ การตัดสินใจของแบบจำลองอาศัยลักษณ์ทางภาษา ได้แก่ ลักษณ์หมวดคำปัจจุบัน หมวดคำก่อนหน้า หมวดคำตามหลัง รายการคำเชื่อมอนุพากย์ ความน่าจะเป็นของช่องว่างที่จะเป็นตัวแบ่งอนุพากย์ และเครื่องหมายวรรคตอน การเปรียบเทียบประสิทธิภาพของแต่ละลักษณ์ทำโดยการกำหนดชุดของลักษณ์รูปแบบต่าง ๆ แล้วนำไปทดสอบ รูปแบบของลักษณ์ที่ส่งผลต่อประสิทธิภาพของระบบมากที่สุด คือการใช้ทุกลักษณ์ร่วมกันทั้งหมด สามารถวัดค่าความถูกต้อง (F-measure) ได้ 81.17 เปอร์เซ็นต์ นอกจากนี้ เมื่อปรับค่าพารามิเตอร์ของเคอร์เนลโพลีโนเมียลให้สูงขึ้น พบว่าสามารถช่วยเพิ่มประสิทธิภาพของระบบได้ กล่าวคือ วัดค่าความถูกได้ 84.74 เปอร์เซ็นต์ เมื่อปรับค่าพารามิเตอร์ไว้ที่ D=4 แต่ก็ทำให้ค่าความแม่นยำลดลง 6 เปอร์เซ็นต์
Other Abstract:	The purposes of this study are to find out linguistic features to be used in Thai clause segmentation using support vector machine (SVM) model as well as to compare efficiency of those features on clause segmentation system. The corpus used in the study is a 76,460 word collection of Thai academic written language, consisting of 8,102 clauses. SMO, which is one of the functions in Weka, is used for training SVM. The kernel function used with SVM is polynomial kernel. The clause segmentation system uses words as inputs and decides whether a particular word is the beginning of the clause. The system's decision relies on linguistic-based features including the present word's part-of-speech, the previous word's part-of-speech, the following word's part-of-speech, lists of discourse markers, possibility of white space to be a clause separator, and punctuations. The performances of linguistic features are compared by preparing the set of feature patterns and testing those patterns. The feature pattern that performs best is the mix of all linguistic features which claims the F-measure of 81.17 percent. In addition, when changing the value of the kernel parameter to higher value, it is found that the performance of the system increases. That is, when adjusting the exponent D to the value of 4, the system claims the F-measure of 84.74 percent, but the precision has decreased by 6 percent.
Description:	สารนิพนธ์ (ร.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2563
Degree Name:	อักษรศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	ภาษาศาสตร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/42818
URI:	http://doi.org/10.14457/CU.the.2013.292
metadata.dc.identifier.DOI:	10.14457/CU.the.2013.292
Type:	Thesis
Appears in Collections:	Arts - Theses

Files in This Item:

File	Description	Size	Format
5380139422.pdf		2.84 MB	Adobe PDF	View/Open

Show full item record