การพัฒนาตัวแบบแปลความหมายทางภูมิศาสตร์และจำแนกประเภทอัตโนมัติจากข้อมูลภาษาไทยบนทวิตเตอร์

ธุวชิต แฉล้มเขตต์

dc.contributor.advisor	ชนินทร์ ทินนโชติ
dc.contributor.advisor	อรรถพล ธำรงรัตนฤทธิ์
dc.contributor.author	ธุวชิต แฉล้มเขตต์
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์
dc.date.accessioned	2023-08-04T07:35:26Z
dc.date.available	2023-08-04T07:35:26Z
dc.date.issued	2565
dc.identifier.uri	https://cuir.car.chula.ac.th/handle/123456789/83037
dc.description	วิทยานิพนธ์ (วศ.ด.)--จุฬาลงกรณ์มหาวิทยาลัย, 2565
dc.description.abstract	ทวิตเตอร์เป็นแหล่งข้อมูลข่าวสารที่มีความรวดเร็วอย่างมาก ในข้อความปริมาณมหาศาลที่มีการสื่อสารกันนั้น มีข้อมูลเกี่ยวกับสถานที่ใหม่ ๆ ทั้งชื่อและข้อความที่อธิบายตำแหน่งที่ตั้ง จึงนับเป็นแหล่งข้อมูลที่สำคัญสำหรับช่วยในการปรับปรุงฐานข้อมูลภูมิสารสนเทศในระบบสารสนเทศต่าง ๆ เช่นระบบแผนที่นำทาง ให้ทันสมัยอยู่อย่างต่อเนื่อง โดยขั้นตอนสำคัญ 2 ขั้นตอนคือ การสกัดภูมินาม เพื่อค้นหาและสกัดชื่อของสถานที่ในข้อความ และการเข้ารหัสภูมิศาสตร์ เพื่อวิเคราะห์ประมาณค่าตำแหน่งที่ตั้งทางภูมิศาสตร์ของสถานที่นั้น ในปัจจุบันการนำงานวิจัยและเครื่องมือการสกัดภูมินามที่ได้มีการพัฒนาไว้กับภาษาอื่นมาใช้กับข้อมูลภาษาไทยยังมีอยู่ค่อนข้างจำกัด และเทคนิคการเข้ารหัสภูมิศาสตร์ที่มีอยู่ก็ยังให้ค่าความถูกต้องทางตำแหน่งไม่ดีเท่าที่ควร งานวิจัยนี้พัฒนาตัวแบบเพื่อแปลความหมายทางภูมิศาสตร์ภาษาไทย โดยในการสกัดภูมินามนั้น ได้นำเทคนิคการเรียนรู้ของเครื่องได้แก่ แบบจำลอง CRF ซึ่งมีการสร้างฟังก์ชันคุณลักษณะเฉพาะทางด้านภูมิศาสตร์เพิ่มเติม โครงข่ายประสาทเทียมแบบวกกลับ ได้แก่ LSTM และ GRU และสุดท้ายคือแบบจำลองการถ่ายโอนความรู้ คือ BERT โดย BERT คือแบบจำลองที่ให้ค่าความถูกต้องโดยรวมในระดับคำที่สมบูรณ์ (F1-Phrase) อยู่ที่ 0.919 การเข้ารหัสภูมิศาสตร์เพื่อหาตำแหน่งของชื่อสถานที่ใหม่ที่สกัดได้นั้น ได้มีการพัฒนาอัลกอริทึมใหม่ขึ้นงานวิจัยนี้โดยการนำข้อมูลความสัมพันธ์เชิงพื้นที่ระหว่างชื่อสถานที่อื่น ๆ ที่ทราบตำแหน่งที่ตั้งในข้อความมา ใช้เป็นค่าถ่วงน้ำหนักในการประมาณตำแหน่งของสถานที่ใหม่ ให้ชื่อว่า Topology words ซึ่งจากผลการวิจัยพบว่า แบบจำลอง Topology words ให้ประสิทธิภาพดีที่สุดจากค่าเฉลี่ยกำลังสอง (Root mean square error) ต่ำที่สุดคือ 0.947 กิโลเมตร และเป็นค่าความถูกต้องที่ดีกว่าเทคนิคเดิม ๆ ที่มีอยู่ทั้ง DBSCAN, K-means, K-medoids และ Agglomerative clustering
dc.description.abstractalternative	Twitter is a rapid news source with a wealth of geo-referenced information. Geoparsing is the transformation of textual place names into geospatial data. For locating new locations, navigation systems and geospatial data retrieval systems are utilized. There is no such instrument for Thai language data currently. In this study, it is necessary to create a model for the geoparsing of Thai. It includes two crucial steps: Toponym recognition. geocoding In the first stage of topographic extraction, additional geographic feature functions are generated using a machine learning technique called the CRF model, the recurrent neural networks, LSTM, and GRU; and lastly, the knowledge transfer model, BERT, where BERT is the model with the highest absolute word-level accuracy (F1-Phrase). The final step is geocoding. This research extends to the estimation of a place if it cannot be determined using the existing database. An algorithm known as "topology words" incorporates the properties of referencing relationships between locations in the text. Also utilized are clustering machine learning models, including DBSCAN, K-means, K-medoids, and Agglomerative clustering. Used to designate a group of place names that will be used to estimate the location. According to the research findings, the topology word model provided the greatest performance, with the lowest root mean square error of 0.94 km.
dc.language.iso	th
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย
dc.relation.uri	http://doi.org/10.58837/CHULA.THE.2022.876
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย
dc.title	การพัฒนาตัวแบบแปลความหมายทางภูมิศาสตร์และจำแนกประเภทอัตโนมัติจากข้อมูลภาษาไทยบนทวิตเตอร์
dc.title.alternative	The development of geoparsing and automated classification from Thai Twitter text data
dc.type	Thesis
dc.degree.name	วิศวกรรมศาสตรดุษฎีบัณฑิต
dc.degree.level	ปริญญาเอก
dc.degree.discipline	วิศวกรรมสำรวจ
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย
dc.identifier.DOI	10.58837/CHULA.THE.2022.876