การแยกเวบเพจภาษาไทยให้เป็นหมวดหมู่แบบอัตโนมัติ

อดุลย์ ตันธุวนิตย์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/10816

Title:	การแยกเวบเพจภาษาไทยให้เป็นหมวดหมู่แบบอัตโนมัติ
Other Titles:	Automatic Thai web page categorization
Authors:	อดุลย์ ตันธุวนิตย์
Advisors:	บุญเสริม กิจสิริกุล
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์
Advisor's Email:	boonserm@cp.eng.chula.ac.th, Boonserm.K@chula.ac.th
Subjects:	เว็บไซต์ เอชทีเอ็มแอล เสิร์ชเอ็นจิน
Issue Date:	2545
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	ในปัจจุบันนี้เอกสารหรือเวบเพจบนอินเตอร์เน็ตเพิ่มขึ้นอย่างรวดเร็ว ทำให้การค้นหาเอกสารที่ต้องการทำได้ยากมาก แต่ถ้ามีการจัดหมวดหมู่ให้กับเวบเพจก่อนแล้ว จะทำให้การค้นและเข้าถึงข้อมูลที่ต้องการทำได้ง่ายขึ้น วิทยานิพนธ์นี้ศึกษาวิธีการแยกหมวดหมู่ให้กับเวบเพจภาษาไทยแบบอัตโนมัติ เพื่อนำไปใช้ร่วมกับการค้นหาข้อมูลเวบเพจภาษาไทย โดยจะแบ่งขอบเขตของการศึกษาออกเป็น 3 ส่วน คือ (1) ศึกษาถึงความสำคัญของคำในแท็กเอชทีเอ็มแอลที่มีต่อความการแยกหมวดหมู่ให้ เอกสาร (2) การลดจำนวนของคำเพื่อเพิ่มประสิทธิภาพในการแยกหมวดหมู่ให้เอกสาร และ (3) วิธีการแยกหมวดหมู่ ผลการทดลองแสดงให้เห็นว่า (1) ถ้าเพิ่มความสำคัญให้กับคำที่อยู่ในแท็กเอชทีเอ็มแอลให้มากกว่าคำในเอกสาร การแยกหมวดหมู่ให้เวบเพจภาษาไทย จะมีความแม่นยำมากขึ้น (2) การลดจำนวนคำจะเพิ่มความถูกต้องเล็กน้อย และช่วยลดเวลาในการประมวลผล (3) เอสวีเอ็ม (SVM - Support Vector Machines) มีประสิทธิภาพดีกว่าตัวแยกแยะเบย์อย่างง่าย
Other Abstract:	Nowadays the number of documents or Web pages in the Internet is increasing rapidly, and this makes searching of required documents is very difficult. If the Web pages are organized into categories, the user can more easily search and access the Web pages. This thesis studies a method of automatic Thai Web page categorization for applying to Thai search engines. The study is divided into three parts, i.e. (1) the study of significance of data in HTML tags in document categorization, (2) the method of reducing the number of words for efficient document categorization, and (3) the method of document categorization. The experimental results show that (1) if words in HTML tags are given higher significance than the other words in the documents, the categorization of Thai Web pages will be more accurate, (2) the reduction of the number of words gives slightly more accuracy and speeds up the processing time, and (3) an SVM performs better than Naive Bayes.
Description:	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2545
Degree Name:	วิทยาศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	วิทยาศาสตร์คอมพิวเตอร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/10816
ISBN:	9741712286
Type:	Thesis
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
Adul.pdf		1.2 MB	Adobe PDF	View/Open

Show full item record