A robust system for core thai natural language processing technologies

Can Udomcharoenchaikit

dc.contributor.advisor	Peerapon Vateekul
dc.contributor.advisor	Prachya Boonkwan
dc.contributor.author	Can Udomcharoenchaikit
dc.contributor.other	Chulalongkorn University. Faculty of Engineering
dc.date.accessioned	2021-09-22T23:25:48Z
dc.date.available	2021-09-22T23:25:48Z
dc.date.issued	2020
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/77087
dc.description	Thesis (Ph.D.)--Chulalongkorn University, 2020
dc.description.abstract	As the amount of unstructured textual data grows, it becomes increasingly important to build an intelligent system that can process it. Natural Language Processing (NLP) is a technology that allows a computer to exploit human languages to perform tasks. Deep learning models have shown excellent results across fundamental tasks in NLP, such as word segmentation, part-of-speech tagging, and named-entity recognition. However, in many situations, these proposed methods fail to perform well. For an NLP system to be robust, it must address issues such as out-of-vocabulary and spelling-mistakes. This thesis's research goal is to develop NLP models that can handle malformed texts to improve their real-world setting usability. In this thesis, I propose novel models and evaluations that focus on robustness against malformed texts. This dissertation proposes multiple novel training strategies and architectures to improve the robustness against malformed texts. This thesis explores input data manipulation strategies that diversify training data, such as UNK masking and adversarial training. It explores how sub-lexical information can improve the robustness of word embeddings. Furthermore, it examines similarity constraint techniques, such as triplet loss, which constraint the similarity between the original texts and the parallel perturbed texts. I also propose alternative evaluation schemes that reveal the weaknesses of NLP systems by introducing typographical adversarial examples to the test sets. Our adversarial evaluation schemes show that current deep learning models are not robust against misspelled inputs, and they also show that our proposed training strategies and architectures can improve the performance over malformed texts.
dc.description.abstractalternative	เมื่อข้อมูลที่เป็นข้อความภาษามีจำนวนมากขึ้นการสร้างระบบอัจฉริยะที่สามารถประมวลผลภาษามนุษย์ได้จึงมีความสำคัญมากขึ้น ระบบประมวลผลภาษาธรรมชาติเป็นเทคโนโลยีที่ช่วยให้คอมพิวเตอร์ใช้ประโยชน์จากภาษาของมนุษย์เพื่อทำงานต่าง ๆ จึงมีความจำเป็นมากขึ้น โมเดลการเรียนรู้เชิงลึกได้แสดงผลลัพธ์ที่ยอดเยี่ยมในงานพื้นฐานในการประมวลผลภาษาธรรมชาติ เช่น การตัดคำ การจำแนกชนิดของคำ และการรู้จำชื่อเฉพาะ อย่างไรก็ตามในบาง สถานการณ์วิธีการที่เสนอเหล่านี้ไม่สามารถทำงานได้ดีเท่าที่ควร เพื่อให้ระบบประมวลผลภาษาธรรมชาติมีเสถียรภาพมากขึ้น เราควรแก้ไขปัญหาที่ปรากฏขึ้นบ่อยครั้ง และมัอิทธิพลต่อประสิทธิภาพของระบบ ได้แก่ ปัญหาการรับมือกับคำศัพท์ที่ไม่เคยพบและคำสะกดผิด เป้าหมายการวิจัยของวิทยานิพนธ์นี้คือการพัฒนาแบบระบบประมวลผลภาษาธรรมชาติที่สามารถจัดการกับข้อความที่สะกดผิดเพื่อปรับปรุงโมเดลให้ใช้งานได้ดีขึ้นเมื่อนำไปใช้จริง วิทยานิพนธ์นี้เสนอโมเดลการเรียนรู้ของเครื่องและการประเมินผลแบบใหม่ที่มุ่งเน้นไปที่การเพิ่มความทนทานต่อข้อความที่มีการสะกดผิดรูปแบบ วิทยานิพนธ์ฉบับนี้เสนอกลยุทธ์และระบบประมวลผลภาษาธรรมชาติใหม่ เพื่อปรับปรุงความทนทานต่อคำสะกดผิด วิทยานิพนธ์นี้สำรวจกลยุทธ์การจัดการข้อมูลอินพุตที่ทำให้ข้อมูลอินพุตมีความหลากหลายมากขึ้น เช่นการใส่หน้ากากคำที่ไม่เคยพบ (UNK Masking) และการฝึกปรปักษ์ (Adversarial Training) วิทยานิพนธ์ฉบับนี้สำรวจว่าหน่วยของภาษาที่เล็กกว่าคำสามารถปรับปรุงความแข็งแกร่งของการฝังคำได้อย่างไร นอกจากนี้ยังตรวจสอบเทคนิคการ จำกัดความคล้ายคลึงกันระหว่างข้อความเช่นการใช้ฟังก์ชันการสูญเสียแบบชุดสาม (Triplet Loss) เพื่อจำกัดความคล้ายคลึงกันระหว่างข้อความต้นฉบับกับข้อความที่สะกดผิด นอกจากนี้ยังเสนอรูปแบบการประเมินแบบใหม่ที่เปิดเผยจุดอ่อนของระบบประมวลผลภาษาธรรมชาติ โดยการใส่ตัวอย่างปรปักษ์ (Adversarial Examples) จากการพิมพ์ผิดลงไปในชุดข้อมูลสำหรับทดสอบ แผนการประเมินแบบปรปักษ์ (Adversarial Evaluation) ที่ได้เสนอในวิทยานิพนธ์ฉบับนี้แสดงให้เห็นว่าแบบจำลองการเรียนรู้เชิงลึกในปัจจุบันไม่ทนทานเมื่อเจอข้อมูลที่สะกดผิดและยังแสดงให้เห็นว่ากลยุทธ์และสถาปัตยกรรมระบบประมวลผลภาษาธรรมชาติของเราสามารถปรับปรุงประสิทธิภาพได้เมื่อเจอข้อความที่มีการสะกดผิด
dc.language.iso	en
dc.publisher	Chulalongkorn University
dc.relation.uri	http://doi.org/10.58837/CHULA.THE.2020.127
dc.rights	Chulalongkorn University
dc.subject.classification	Computer Science
dc.title	A robust system for core thai natural language processing technologies
dc.title.alternative	ระบบแบบทนทานสำหรับเทคโนโลยีหลักในการประมวลผลภาษาธรรมชาติภาษาไทย
dc.type	Thesis
dc.degree.name	Doctor of Philosophy
dc.degree.level	Doctoral Degree
dc.degree.discipline	Computer Engineering
dc.degree.grantor	Chulalongkorn University
dc.identifier.DOI	10.58837/CHULA.THE.2020.127