การจำแนกข้อความส่อเสียดในทวิตเตอร์ด้วยการใช้ความน่าจะเป็นของทวีต

กษิดิ์เดช ทาแป้ง

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/55546

Title:	การจำแนกข้อความส่อเสียดในทวิตเตอร์ด้วยการใช้ความน่าจะเป็นของทวีต
Other Titles:	SARCASM CLASSIFICATION IN TWITTER USING PROBABILITY OF TWEETS
Authors:	กษิดิ์เดช ทาแป้ง
Advisors:	จันทร์เจ้า มงคลนาวิน
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะพาณิชยศาสตร์และการบัญชี
Advisor's Email:	Janjao.M@Chula.ac.th,janjao@cbs.chula.ac.th
Issue Date:	2559
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	ข้อความส่อเสียดเป็นปัญหาหนึ่งในการประมวลผลภาษาธรรมชาติเนื่องจากข้อความส่อเสียดจะกลับขั้วความคิดเห็นของข้อความทำให้การวิเคราะห์ความคิดเห็นของข้อความผิดไปจากความเป็นจริง งานวิจัยนี้ได้เสนอวิธีจำแนกข้อความส่อเสียดออกจากข้อความปกติ โดยประยุกต์ใช้ความน่าจะเป็นของข้อความ และใช้ข้อมูลความคิดเห็นของผู้บริโภคเกี่ยวกับเครือข่ายอินเตอร์เน็ตเครือข่ายหนึ่งบนเครือข่ายสังคมออนไลน์ทวิตเตอร์ในการศึกษา โดยเก็บรวบรวมข้อมูลผ่านช่องทาง Advance Search API เริ่มตั้งแต่วันที่ 25 มกราคม 2553 ถึงวันที่ 9 มิถุนายน 2559 ทั้งสิ้น 4,027 ข้อความ จากนั้นจึงประมวลผลข้อมูลเบื้องต้นโดยตัดข้อความที่มีความซ้ำซ้อน URL ที่ปรากฏอยู่ภายในข้อความ เครื่องหมายแฮชแท็กรวมถึงข้อความแฮชแทก เครื่องหมายอ้างถึง (@) และชื่อบุคคลที่ถูกอ้างถึง ตัวอักษรหรือตัวเลขที่ปรากฏติดกันมากกว่า 3 ตัวขึ้นไปรวมถึงอักขระพิเศษต่าง ๆ ในการศึกษาแบ่งการทดลองออกเป็นสองส่วน ส่วนที่หนึ่งเป็นส่วนการประมวลผลโดยเครื่อง ในส่วนนี้ข้อความแต่ละข้อความจะถูกแบ่งเป็นคำ และแปลงให้อยู่ในโมเดล bigram ซึ่งจะใช้ในการคำนวณความน่าจะเป็นของข้อความโดยใช้วิธีภาวะความควรจะเป็นสูงสุด (Maximum Likelihood Estimation) ในส่วนที่สองกำหนดให้บุคคลจำนวน 5 คนประเมินข้อความแต่ละข้อความว่าข้อความนั้นเป็นข้อความส่อเสียด ข้อความปกติ หรือไม่สามารถระบุได้ แล้วนำคะแนนประเมินมาหาคะแนนความน่าจะเป็นเฉลี่ย แล้วนำความน่าจะเป็นของข้อความที่ได้จากการคำนวณโดยเครื่องและคะแนนความน่าจะเป็นเฉลี่ยที่ได้จากการประเมินของมนุษย์มาตรวจสอบระดับความสัมพันธ์โดยใช้สหสัมพันธ์ของเพียร์สัน จากผลการทดลองพบว่าค่า P-Value มีค่าเป็น 0.015 ซึ่งสรุปได้ว่าความน่าจะเป็นของข้อความที่คำนวณโดยเครื่องมีความสัมพันธ์ไปในทิศทางเดียวกันกับการจำแนกข้อความส่อเสียดโดยมนุษย์
Other Abstract:	Sarcasm is one of the issues in Natural Language Processing since it inverts the real sentiment of a phrase; from positive to negative. This study proposes an approach to classify sarcastic phrases, on the topic of one telecommunication service provider in Thailand gathered from Twitter using Advance Search API. The data consists of 4,027 phrases, from 25th of January 2010 to 9th of June 2016. The phrases will be preprocessed by removing duplication, URL, hashtag as well as its content, mention (@) including the users, characters or numbers repeated more than 3 times consecutively. The experiment consists of two parts, phrase probability estimation by machine and by a group of five people. For the machine part, each phrase segmented into words, which are converted into a bigram model. The phrase probability is calculated from the bigram model using Maximum Likelihood Estimation. For the human part, each person rates each phrase whether it is sarcastic, typical or uncertain, then the average score is computed for each phrase. The relationship of the results from both parts is measured by using Pearson’s Correlation Test. The test shows that P-Value is 0.015 which can be concluded that they are correlated in the same direction.
Description:	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2559
Degree Name:	วิทยาศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	เทคโนโลยีสารสนเทศทางธุรกิจ
URI:	http://cuir.car.chula.ac.th/handle/123456789/55546
URI:	http://doi.org/10.58837/CHULA.THE.2016.84
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2016.84
Type:	Thesis
Appears in Collections:	Acctn - Theses

Files in This Item:

File	Description	Size	Format
5781506626.pdf		3.5 MB	Adobe PDF	View/Open

Show full item record