Thai sentence segmentation using large language models

Narongkorn Panitsrisit

dc.contributor.advisor	Attapol Thamrongrattanarit
dc.contributor.author	Narongkorn Panitsrisit
dc.contributor.other	Chulalongkorn University. Faculty of Arts
dc.date.accessioned	2023-08-04T08:20:39Z
dc.date.available	2023-08-04T08:20:39Z
dc.date.issued	2022
dc.identifier.uri	https://cuir.car.chula.ac.th/handle/123456789/83303
dc.description	Independent Study (M.A.)--Chulalongkorn University, 2022
dc.description.abstract	Thai sentence segmentation has been on the topic of interest among Thai NLP communities. However, not much literature has explored the use of transformer-based large language models to tackle the issue. We conduct three experiments on the LST20 corpus, including (1) fine-tuning WangchanBERTa, a large language model pre-trained on Thai, across different classification tasks, (2) joint learning for clause and sentence segmentation, and (3) cross-lingual transfer using the multilingual model XLM-RoBERTa. Our findings show that WangchanBERTa outperforms other models in Thai sentence segmentation, and fine-tuning it with token and contextual information further improves its performance. However, cross-lingual transfer from English and Chinese to Thai is not effective for this task.
dc.description.abstractalternative	การตัดประโยคภาษาไทยเป็นเรื่องที่มีผู้สนใจอยู่มาก แต่การตัดประโยคโดยใช้แบบจำลองทางภาษาขนาดใหญ่ซึ่งใช้สถาปัตยกรรมทรานส์ฟอร์เมอร์ยังมีผู้ศึกษาไม่มากนัก ผู้วิจัยใช้คลังข้อมูล LST20 เพื่อทำการทดลองจำนวนสามการทดลองโดยประกอบไปด้วย (1) การปรับจูนการจำแนกคำในสถานการณ์ต่าง ๆ ด้วย WangchanBERTa ซึ่งเป็นแบบจำลองทางภาษาขนาดใหญ่ที่ฝึกฝนด้วยข้อมูลภาษาไทย (2) การใช้ Joint Learning สำหรับการตัดประโยคและอนุพากย์ และ (3) การถ่ายโอนข้ามภาษาโดยใช้ XLM-RoBERTa ซึ่งเป็นแบบจำลองหลากภาษา ผลการทดสอบพบว่า WangchanBERTa มีประสิทธิภาพดีกว่าแบบจำลองอื่นในการตัดประโยคภาษาไทย และเมื่อปรับจูนเพิ่มเติมด้วยข้อมูลคำและบริบทจะทำให้แบบจำลองดังกล่าวมีประสิทธิภาพดีขึ้น อย่างไรก็ตาม การถ่ายโอนข้ามภาษาจากภาษาอังกฤษและภาษาจีนไปยังภาษาไทยเป็นวิธีที่ไม่ได้ผลดีนักสำหรับการตัดประโยคภาษาไทย
dc.language.iso	en
dc.publisher	Chulalongkorn University
dc.relation.uri	http://doi.org/10.58837/CHULA.IS.2022.31
dc.rights	Chulalongkorn University
dc.subject.classification	Computer Science
dc.subject.classification	Arts and Humanities
dc.subject.classification	Information and communication
dc.subject.classification	Computer science
dc.title	Thai sentence segmentation using large language models
dc.title.alternative	การตัดประโยคภาษาไทยโดยใช้แบบจำลองทางภาษาขนาดใหญ่
dc.type	Independent Study
dc.degree.name	Master of Arts
dc.degree.level	Master's Degree
dc.degree.discipline	Linguistics
dc.degree.grantor	Chulalongkorn University
dc.identifier.DOI	10.58837/CHULA.IS.2022.31