การพัฒนาตัวแบบแปลความหมายทางภูมิศาสตร์และจำแนกประเภทอัตโนมัติจากข้อมูลภาษาไทยบนทวิตเตอร์

ธุวชิต แฉล้มเขตต์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/83037

Title:	การพัฒนาตัวแบบแปลความหมายทางภูมิศาสตร์และจำแนกประเภทอัตโนมัติจากข้อมูลภาษาไทยบนทวิตเตอร์
Other Titles:	The development of geoparsing and automated classification from Thai Twitter text data
Authors:	ธุวชิต แฉล้มเขตต์
Advisors:	ชนินทร์ ทินนโชติ อรรถพล ธำรงรัตนฤทธิ์
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์
Issue Date:	2565
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	ทวิตเตอร์เป็นแหล่งข้อมูลข่าวสารที่มีความรวดเร็วอย่างมาก ในข้อความปริมาณมหาศาลที่มีการสื่อสารกันนั้น มีข้อมูลเกี่ยวกับสถานที่ใหม่ ๆ ทั้งชื่อและข้อความที่อธิบายตำแหน่งที่ตั้ง จึงนับเป็นแหล่งข้อมูลที่สำคัญสำหรับช่วยในการปรับปรุงฐานข้อมูลภูมิสารสนเทศในระบบสารสนเทศต่าง ๆ เช่นระบบแผนที่นำทาง ให้ทันสมัยอยู่อย่างต่อเนื่อง โดยขั้นตอนสำคัญ 2 ขั้นตอนคือ การสกัดภูมินาม เพื่อค้นหาและสกัดชื่อของสถานที่ในข้อความ และการเข้ารหัสภูมิศาสตร์ เพื่อวิเคราะห์ประมาณค่าตำแหน่งที่ตั้งทางภูมิศาสตร์ของสถานที่นั้น ในปัจจุบันการนำงานวิจัยและเครื่องมือการสกัดภูมินามที่ได้มีการพัฒนาไว้กับภาษาอื่นมาใช้กับข้อมูลภาษาไทยยังมีอยู่ค่อนข้างจำกัด และเทคนิคการเข้ารหัสภูมิศาสตร์ที่มีอยู่ก็ยังให้ค่าความถูกต้องทางตำแหน่งไม่ดีเท่าที่ควร งานวิจัยนี้พัฒนาตัวแบบเพื่อแปลความหมายทางภูมิศาสตร์ภาษาไทย โดยในการสกัดภูมินามนั้น ได้นำเทคนิคการเรียนรู้ของเครื่องได้แก่ แบบจำลอง CRF ซึ่งมีการสร้างฟังก์ชันคุณลักษณะเฉพาะทางด้านภูมิศาสตร์เพิ่มเติม โครงข่ายประสาทเทียมแบบวกกลับ ได้แก่ LSTM และ GRU และสุดท้ายคือแบบจำลองการถ่ายโอนความรู้ คือ BERT โดย BERT คือแบบจำลองที่ให้ค่าความถูกต้องโดยรวมในระดับคำที่สมบูรณ์ (F1-Phrase) อยู่ที่ 0.919 การเข้ารหัสภูมิศาสตร์เพื่อหาตำแหน่งของชื่อสถานที่ใหม่ที่สกัดได้นั้น ได้มีการพัฒนาอัลกอริทึมใหม่ขึ้นงานวิจัยนี้โดยการนำข้อมูลความสัมพันธ์เชิงพื้นที่ระหว่างชื่อสถานที่อื่น ๆ ที่ทราบตำแหน่งที่ตั้งในข้อความมา ใช้เป็นค่าถ่วงน้ำหนักในการประมาณตำแหน่งของสถานที่ใหม่ ให้ชื่อว่า Topology words ซึ่งจากผลการวิจัยพบว่า แบบจำลอง Topology words ให้ประสิทธิภาพดีที่สุดจากค่าเฉลี่ยกำลังสอง (Root mean square error) ต่ำที่สุดคือ 0.947 กิโลเมตร และเป็นค่าความถูกต้องที่ดีกว่าเทคนิคเดิม ๆ ที่มีอยู่ทั้ง DBSCAN, K-means, K-medoids และ Agglomerative clustering
Other Abstract:	Twitter is a rapid news source with a wealth of geo-referenced information. Geoparsing is the transformation of textual place names into geospatial data. For locating new locations, navigation systems and geospatial data retrieval systems are utilized. There is no such instrument for Thai language data currently. In this study, it is necessary to create a model for the geoparsing of Thai. It includes two crucial steps: Toponym recognition. geocoding In the first stage of topographic extraction, additional geographic feature functions are generated using a machine learning technique called the CRF model, the recurrent neural networks, LSTM, and GRU; and lastly, the knowledge transfer model, BERT, where BERT is the model with the highest absolute word-level accuracy (F1-Phrase). The final step is geocoding. This research extends to the estimation of a place if it cannot be determined using the existing database. An algorithm known as "topology words" incorporates the properties of referencing relationships between locations in the text. Also utilized are clustering machine learning models, including DBSCAN, K-means, K-medoids, and Agglomerative clustering. Used to designate a group of place names that will be used to estimate the location. According to the research findings, the topology word model provided the greatest performance, with the lowest root mean square error of 0.94 km.
Description:	วิทยานิพนธ์ (วศ.ด.)--จุฬาลงกรณ์มหาวิทยาลัย, 2565
Degree Name:	วิศวกรรมศาสตรดุษฎีบัณฑิต
Degree Level:	ปริญญาเอก
Degree Discipline:	วิศวกรรมสำรวจ
URI:	https://cuir.car.chula.ac.th/handle/123456789/83037
URI:	http://doi.org/10.58837/CHULA.THE.2022.876
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2022.876
Type:	Thesis
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
6071457521.pdf		4.35 MB	Adobe PDF	View/Open

Show full item record