Data augmentation for Thai natural language processing using different tokenization

Patawee Prakrankamanant

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/80735

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Atiwong Suchato	-
dc.contributor.author	Patawee Prakrankamanant	-
dc.contributor.other	Chulalongkorn University. Faculty of Engineering	-
dc.date.accessioned	2022-11-02T06:41:05Z	-
dc.date.available	2022-11-02T06:41:05Z	-
dc.date.issued	2021	-
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/80735	-
dc.description	Thesis (M.Eng.)--Chulalongkorn University, 2021	en_US
dc.description.abstract	Tokenization is one of the most important data pre-processing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we proposed an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluated the performance of our algorithms on different Thai text classification datasets. The results suggested our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.	en_US
dc.description.abstractalternative	การทำให้เป็นโทเค็น (tokenization) เป็นหนึ่งในขั้นตอนการดำเนินการเบื้องต้น (pre-processing) ในระบบของแบบจำลองแบ่งประเภทข้อความ (text classification model) และเป็นส่วนหนึ่งที่ส่งผลต่อประสิทธิภาพของแบบจำลอง แต่อย่างไรก็ตามการทำให้เป็นโทเค็น ไม่ใช่ปัญหาทั่วไปสำหรับ noisy text หรือ ภาษาที่ไม่มีขอบเขตของคำ (word boundary) ที่ชัดเจนเช่น ภาษาไทย ในการศึกษานี้เราได้นำเสนอวิธีการเพิ่มข้อมูล (data augmentation) เพื่อเพิ่มความคงทน (robustness) และประสิทธิภาพโดยการใช้การทำให้ เป็นโทเคนหลากหลายรูปแบบ (multi-tokenization) เราวัดผลบนแบบจำลองแบ่งประเภท ข้อความภาษาไทย จากผลการศึกษาพบว่าแบบจำลองที่ถูกเรียนรู้ด้วยการเพิ่มข้อมูลที่เรานำ เสนอนั้น สามารถคงทนต่อ การตัดคำที่ผิดพลาด และสามารถใช้ร่วมกับ การเพิ่มข้อมูลแบบอื่นด้วย	en_US
dc.language.iso	en	en_US
dc.publisher	Chulalongkorn University	en_US
dc.relation.uri	http://doi.org/10.58837/CHULA.THE.2021.98	-
dc.rights	Chulalongkorn University	en_US
dc.subject	Natural language processing (Computer science)	-
dc.subject	Thai language -- Sentences	-
dc.subject	การประมวลผลภาษาธรรมชาติ (วิทยาการคอมพิวเตอร์)	-
dc.subject	ภาษาไทย -- ประโยค	-
dc.title	Data augmentation for Thai natural language processing using different tokenization	en_US
dc.title.alternative	การเพิ่มข้อมูลสำหรับระบบประมวลภาษาธรรมชาติภาษาไทยโดยใช้การแบ่งเป็นโทเค็นที่แตกต่างกัน	en_US
dc.type	Thesis	en_US
dc.degree.name	Master of Engineering	en_US
dc.degree.level	Master's Degree	en_US
dc.degree.discipline	Computer Engineering	en_US
dc.degree.grantor	Chulalongkorn University	en_US
dc.identifier.DOI	10.58837/CHULA.THE.2021.98	-
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
Eng_Patawee Pra_The_2021.pdf		32.71 MB	Adobe PDF	View/Open

Show simple item record