การสกัดคำสำคัญที่เป็นกระแสและคำหยุดจากเพจเฟซบุ๊กภาษาไทยโดยใช้เอ็นแกรมแบบตัวอักษร

ณัษฐพงษ์ อู่สิริมณีชัย

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/63653

Title:	การสกัดคำสำคัญที่เป็นกระแสและคำหยุดจากเพจเฟซบุ๊กภาษาไทยโดยใช้เอ็นแกรมแบบตัวอักษร
Other Titles:	Extraction of Trend Keywords and Stop Words from Thai Facebook Pages using Character n-Grams
Authors:	ณัษฐพงษ์ อู่สิริมณีชัย
Advisors:	สุกรี สินธุภิญโญ
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์
Advisor's Email:	Sukree.S@Chula.ac.th
Issue Date:	2561
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	สื่อสังคมออนไลน์สามารถใช้วิเคราะห์พฤติกรรมของผู้คนในสังคมได้ โดยสื่อสังคมออนไลน์ที่คนไทยนิยมมากที่สุดคือเฟซบุ๊ก ดังนั้นถ้าเราสามารถวิเคราะห์พฤติกรรมของผู้คนในเฟซบุ๊กได้ก็จะสามารถเข้าใจพฤติกรรมของคนไทยส่วนใหญ่ในสังคมได้ ซึ่งหนึ่งในการวิเคราะห์พฤติกรรมของผู้คนนั้น เรามักจะวิเคราะห์ผ่านกระแสที่เกิดขึ้นในสังคม ว่าผู้คนในสังคมให้ความสนใจในกระแสนั้นอย่างไร จุดเริ่มต้นของกระแสคือเมื่อไหร่ เป็นต้น ซึ่งการวิเคราะห์กระแสนั้นสามารถทำได้ผ่านการวิเคราะห์คำสำคัญที่เกี่ยวข้องกับกระแสดังกล่าว แต่วิธีการที่ใช้ในการสกัดคำสำคัญในปัจจุบันนั้นจำต้องใช้เครื่องมือตัดคำภาษาไทย ซึ่งเครื่องมือในปัจจุบันถูกฝึกสอนด้วยคลังข้อมูลภาษาที่ไม่ได้รวมเอาข้อมูลประโยคที่พบในสื่อสังคมออนไลน์อย่างเฟซบุ๊กไว้ ผลจึงทำให้เครื่องมือตัดคำมีปัญหาเมื่อพบคำที่ไม่เป็นมาตรฐาน ส่งผลต่อประสิทธิภาพของการสกัดคำสำคัญ อีกทั้งวิธีสกัดคำสำคัญในปัจจุบันรองรับการสกัดคำสำคัญที่ความยาวคงที่เท่านั้น ทำให้วิทยานิพนธ์ฉบับนี้ได้พัฒนาวิธีการสกัดคำสำคัญที่เป็นกระแสโดยไม่ใช้เครื่องตัดคำ แต่เลือกใช้อัลกอริทึมเอ็นแกรมแบบตัวอักษรเข้ามาช่วย ซึ่งทำให้สามารถสกัดคำสำคัญที่มีความยาวแบบไม่คงที่ได้ และยังใช้ลักษณะของกระแสในการสร้างฐานข้อมูลคำหยุด และกรองเฉพาะคำที่เป็นกระแสออกมา โดยเมื่อเปรียบเทียบผลกับวิธีดั้งเดิมอย่างวิธี TF-IDF และวิธี TF พบว่าวิธีที่วิทยานิพนธ์นี้นำเสนอ ได้คะแนน F1 ที่ 0.402 ซึ่งดีกว่าวิธี TF-IDF ที่ได้คะแนน F1 ที่ 0.165 และวิธี TF ที่ได้คะแนน F1 ที่ 0.183 โดยวิธีที่วิทยานิพนธ์นี้นำเสนอเหมาะเป็นอย่างยิ่งสำหรับงานที่ต้องการคำสำคัญที่มีความยาวไม่คงที่ อย่างเช่นการหากระแสในสื่อสังคมออนไลน์เฟซบุ๊ก
Other Abstract:	Social media can be used to analyze the behavior of people in society, and we often analyze it through the trends in society. The trend analysis can be done through the analysis of keywords related to the trends. But the method used to extract the trend keywords requires Thai word segmentation tools, which are trained with a Thai corpus that does not include sentence information found on social media. As a result, the word segmentation tool has problems when segmenting non-standard words, and thus affecting the efficiency of keyword extraction. In addition, the keyword extraction method supports only the fixed length method. This thesis has developed a method for extracting keywords that are trends by using the character n-grams method instead of word segmentation methods. Which makes it possible to extract keywords that are not fixed in length. In addition, we used the trend characteristics to create the stop word database, then filtered only the words that are trends. By comparing the results with the traditional methods such as TF-IDF and TF methods, it was found that the method proposed by this thesis provided F1 score of 0.402 which is better than TF-IDF method with F1 score of 0.165 and TF method with F1 score of 0.183. Finally, the method presented in this thesis is especially suitable for tasks that require non-fixed length keywords, such as finding the trends on social media, Facebook.
Description:	วิทยานิพนธ์ (วศ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2561
Degree Name:	วิศวกรรมศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	วิศวกรรมคอมพิวเตอร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/63653
URI:	http://doi.org/10.58837/CHULA.THE.2018.1254
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2018.1254
Type:	Thesis
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
6070188521.pdf		3.73 MB	Adobe PDF	View/Open

Show full item record