การรู้จำชื่อเฉพาะภาษาไทย: การใช้แบบจำลองคอนดิชันนอลแรนดอมฟิลด์ส

นัชชา ถิระสาโรช

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/20802

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	วิโรจน์ อรุณมานะกุล	-
dc.contributor.author	นัชชา ถิระสาโรช	-
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. คณะอักษรศาสตร์	-
dc.date.accessioned	2012-07-13T14:39:00Z	-
dc.date.available	2012-07-13T14:39:00Z	-
dc.date.issued	2553	-
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/20802	-
dc.description	วิทยานิพนธ์(อ.ม.)---จุฬาลงกรณ์มหาวิทยาลัย, 2553	en
dc.description.abstract	วิทยานิพนธ์ฉบับนี้มีวัตถุประสงค์เพื่อพัฒนาระบบการรู้จำชื่อเฉพาะภาษาไทยโดยใช้แบบจำลองคอนดิชันนอลแรนดอมฟิลด์สโมเดล (CRFs) และศึกษาเปรียบเทียบประสิทธิภาพของระบบการรู้จำชื่อเฉพาะภาษาไทยระหว่างแบบจำลองที่รับข้อมูลเข้าเป็นพยางค์กับที่รับข้อมูลเข้าเป็นคำ งานวิจัยนี้ใช้คลังข้อมูลข่าวขนาด 367,673 คำ ประกอบด้วยชื่อเฉพาะทั้งหมด 16,179 ชื่อ แบบจำลองที่ใช้คือ CRF++ เวอร์ชัน 0.53 ทั้งระบบที่รับข้อมูลเข้าเป็นคำและพยางค์ใช้คุณสมบัติแบบเดียวกัน ได้แก่ คุณสมบัติรายการชื่อเฉพาะ คุณสมบัติคำย่อ คุณสมบัติคำบริบท คุณสมบัติคำทั่วไป คุณสมบัติค่าทางสถิติ และคุณสมบัติ unigram และ bigram การเรียนรู้ของระบบเป็นแบบ supervised learning คือมีการให้คำตอบในคลังข้อมูลสำหรับฝึกฝน คำตอบที่ใช้มีทั้งหมด 5 แบบ โดยแบบที่ 1 มีข้อมูลขอบเขตของชื่อเฉพาะน้อยที่สุดและแบบที่ 5 มีข้อมูลขอบเขตของชื่อเฉพาะมากที่สุด พบว่าแบบคำตอบที่ให้ข้อมูลมากช่วยให้ประสิทธิภาพของทั้งสองระบบดีกว่าแบบคำตอบที่ให้ข้อมูลน้อย จากผลการทดสอบระบบ พบว่า ประสิทธิภาพของระบบที่ใช้ข้อมูลตัดคำและตัดพยางค์ไม่ต่างกัน โดยมีค่าความถูกต้อง (F-measure) เท่ากัน คือ 81.30% จากคุณสมบัติทั้งหมด พบว่า คุณสมบัติ unigram และ bigram สนับสนุนระบบที่ใช้ข้อมูลตัดพยางค์มากที่สุด และคุณสมบัติรายการชื่อเฉพาะสนับสนุนระบบที่ใช้ข้อมูลตัดคำมากที่สุด เมื่อนำข้อมูลมาผ่านกระบวนการประมวลผลภายหลังแล้ว ช่วยให้ค่าความครบถ้วนของทั้งสองระบบมากขึ้นจากเดิม 77.64% เป็น 80.15% และ 80.06% ในข้อมูลตัดคำและตัดพยางค์ตามลำดับ	en
dc.description.abstractalternative	The main purpose of this study is to develop Thai named entity recognition system using Conditional Random Fields Models (CRFs) as well as comparing the performance of syllable-based system to that of word-based system. This study uses the news corpus of 367,673 words with 16,179 proper names. CRFs model applied in this research is CRF++ 0.53. Both word-based and syllable-based systems use the same set of features, including gazetteer lists, abbreviation, context clues, general words, statistics, and unigram and bigram. Supervised learning is applied to train CRFs. There are 5 patterns of answer given to the systems, the first pattern having the least information of the named entities’ boundaries and the last one having the most information. The results show that the patterns containing more information tend to improve the systems’ performances than those having less information. The testing results show that the performances of word-based and syllable-based systems are not different from each other. The recognition rates (F-measure) of these two systems are 81.30%. From all of the features used, the unigram and bigram support the syllable-based system the most, while the gazetteer lists support the word-based system the most. After post-processing, the recalls of the two systems increase from 77.64% to 80.15% and 80.06% in word-based and syllable-based models respectively.	en
dc.format.extent	1340066 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	th	es
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.relation.uri	http://doi.org/10.14457/CU.the.2010.2144	-
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.subject	การรู้จำอักขระ (คอมพิวเตอร์)	-
dc.subject	ภาษาไทย -- หน่วยคำ -- แบบจำลองทางคอมพิวเตอร์	-
dc.subject	แบบจำลองคอนดิชันนอลแรนดอมฟิลด์ส	-
dc.title	การรู้จำชื่อเฉพาะภาษาไทย: การใช้แบบจำลองคอนดิชันนอลแรนดอมฟิลด์ส	en
dc.title.alternative	Thai named entity recognition: the application of conditional random fields models	en
dc.type	Thesis	es
dc.degree.name	อักษรศาสตรมหาบัณฑิต	es
dc.degree.level	ปริญญาโท	es
dc.degree.discipline	ภาษาศาสตร์	es
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.email.advisor	Wirote.A@Chula.ac.th	-
dc.identifier.DOI	10.14457/CU.the.2010.2144	-
Appears in Collections:	Arts - Theses

Files in This Item:

File	Description	Size	Format
nutcha_ti.pdf		1.31 MB	Adobe PDF	View/Open

Show simple item record