การเพิ่มความถูกต้องของการแบ่งคำภาษาไทยโดยใช้แบบจำลองความน่าจะเป็นของหน้าที่อักขระ

เกศราภรณ์ ซื่อสัตย์พาณิชย์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/18715

Title:	การเพิ่มความถูกต้องของการแบ่งคำภาษาไทยโดยใช้แบบจำลองความน่าจะเป็นของหน้าที่อักขระ
Other Titles:	Increasing accuracy of Thai word segmentation using character function probabilistic models
Authors:	เกศราภรณ์ ซื่อสัตย์พาณิชย์
Advisors:	อติวงศ์ สุชาโต โปรดปราน บุณยพุกกณะ
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์
Advisor's Email:	Atiwong.S@Chula.ac.th Proadpran.P@Chula.ac.th
Subjects:	การแจกแจงรูปประโยค ชุดอักขระ (การประมวลผลข้อมูล) Parsing (Computer grammar) Character sets (Data processing)
Issue Date:	2552
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	งานวิจัยนี้นำเสนอการแบ่งคำภาษาไทยโดยใช้คอนดิชันนัลแรนดอมฟิลด์สด้วยการใช้อักขระและกลุ่มอักขระเป็นคุณลักษณะ สำหรับกลุ่มอักขระนั้นถูกจัดกลุ่มตามหน้าที่การใช้งานของอักขระ เช่น ตำแหน่งการวางอักขระในการเขียนเป็นต้น เทมเพลตคุณลักษณะในคอนดิชันนัลแรนดอมฟิลด์สนั้นใช้อักขระและกลุ่มหน้าที่อักขระมาพิจารณาหรือใช้ใน N-gram เพื่อระบุขอบเขตของคำ ได้แบ่งการทดลองเป็น 2 เทมเพลตคุณลักษณะ คือ 1. ใช้อักขระเป็นคุณลักษณะเพียงอย่างเดียว 2. ใช้อักขระและกลุ่มหน้าที่อักขระเป็นคุณลักษณะ และทำการเปรียบเทียบความถูกต้องกับการแบ่งคำด้วยแบบจำลองมาร์คอฟไทรแกรมระดับคำ โดยผลการทดลองจากงานวิจัยนี้ได้ค่า F-Measure ดีที่สุดคือ95.53% ซึ่งให้ผลดีกว่าการใช้แบบจำลองมาร์คอฟไทรแกรมที่ได้ค่า F-Measure 90.98% จากการวิเคราะห์ผลการแบ่งคำทำให้เห็นว่าการใช้หน้าที่อักขระเข้าเป็นคุณลักษณะเพิ่มจากการใช้อักขระเพียงอย่างเดียวนั้นทำให้ผลการแบ่งคำดีขึ้น ถึงแม้ว่าจะทำให้จำนวนคุณลักษณะในเทมเพลตเพิ่มขึ้นแต่ก็ยังทำให้ประสิทธิภาพการแบ่งคำดีอยู่ และการใช้อักขระช่วยให้ผลการแบ่งคำมีความเสถียรคงทนในการแบ่งคำที่ไม่เคยเห็นมาก่อนในคลังข้อความฝึกฝนมากกว่าเมื่อเทียบกับการใช้แบบจำลองมาร์คอฟไทรแกรมระดับคำ
Other Abstract:	A Thai word segmentation approach using Conditional Random Fields (CRFs) are utilized for classifying each character associated with the text string to be segmented into classes of characters categorized based on their positions in the underlying words. Characters used in the Thai writing system are attached with character functions proposed in this work. N-grams of these character functions are considered together with character N-grams within the feature templates of the CRF models in order for the models to locate characters likely to indicate word boundaries. The proposed methods yields the best F-measure score of 95.53% which is better than ones obtained based on word trigrams with score of 90.98%. We can observe from the result that word segmentation using CRFs yielding better performances than the one using word trigrams on every genre. Comparing the two types of feature templates, the one that contain character function features perform slightly better than the templates relying only on the character sequences. Although this observation is far from surprising considering the greater number of features in the feature templates relying on the character and character function sequences, the fact that the inclusion of character functions helps the word segmentation performance is still encouraging. It is also shown that character-level constraints make the result more robust to segmenting unseen words
Description:	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2552
Degree Name:	วิทยาศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	วิทยาศาสตร์คอมพิวเตอร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/18715
URI:	http://doi.org/10.14457/CU.the.2009.316
metadata.dc.identifier.DOI:	10.14457/CU.the.2009.316
Type:	Thesis
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
kessaraporn_su.pdf		2.09 MB	Adobe PDF	View/Open

Show full item record