การจำแนกข้อความขนาดใหญ่แบบหลายฉลากมีลำดับชั้นโดยใช้วิธีการแบบแฟลตด้วยยุทธศาสตร์ตัดเล็มแบบเอสวีเอ็ม

ณัฐชนน ผจงกิจพิพัฒน์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/52187

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	พีรพล เวทีกูล	en_US
dc.contributor.author	ณัฐชนน ผจงกิจพิพัฒน์	en_US
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์	en_US
dc.date.accessioned	2017-03-03T03:02:00Z	-
dc.date.available	2017-03-03T03:02:00Z	-
dc.date.issued	2559	en_US
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/52187	-
dc.description	วิทยานิพนธ์ (วศ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2559	en_US
dc.description.abstract	การจำแนกประเภทแบบหลายฉลากมีลำดับชั้น เป็นการจำแนกประเภทที่รวมลักษณะเฉพาะของปัญหาสองรูปแบบคือ ข้อมูลแต่ละตัวอาจจัดอยู่ได้ในหลายคลาส และคลาสเหล่านี้มีความสัมพันธ์เป็นโครงสร้างลำดับชั้น ซึ่งข้อมูลในชีวิตจริงมักจะมีลักษณะซับซ้อนเช่นนี้ การจำแนกประเภทข้อความแบบหลายฉลากมีลำดับชั้น เป็นหัวข้อการวิจัยที่ได้รับความสนใจอย่างมากในปัจจุบัน เพราะโครงสร้างลำดับชั้นใช้อธิบายความสัมพันธ์ของข้อมูลประเภทข้อความได้ดี ข้อมูลประเภทข้อความที่เราพบอยู่ทุกวันก็คือ ข้อมูลบนเว็บไซต์นั่นเอง เว็บไซต์ที่เพิ่มจำนวนขึ้นอย่างรวดเร็ว ทำให้เว็บอย่างเว็บไดเรกทอรีและวิกิพีเดียจำเป็นต้องมีระบบการจำแนกประเภทอย่างอัตโนมัติเมื่อมีหน้าเว็บใหม่เข้ามาในฐานข้อมูล ด้วยข้อมูลมหาศาลเช่นนี้ ปัญหานี้จึงถือเป็นการจำแนกประเภทขนาดใหญ่แบบหลายฉลากมีลำดับชั้น งานวิจัยหลายงานนำเสนอวิธีแก้ปัญหาการจำแนกประเภทแบบหลายฉลากมีลำดับชั้น แต่วิธีเหล่านั้นประมวลผลข้อมูลขนาดใหญ่ไม่ได้ เนื่องจากการประมวลผลอาจต้องใช้พื้นที่เก็บข้อมูลขนาดใหญ่มาก อาจใช้เวลาประมวลผลนานเกินไป หรือทำนายคลาสได้ไม่แม่นยำ บางวิธีการที่พอจะรองรับข้อมูลขนาดใหญ่ได้ก็ไม่ได้นำโครงสร้างลำดับชั้นมาใช้ให้เกิดประโยชน์ งานวิจัยนี้จึงได้นำเสนอการจำแนกข้อความขนาดใหญ่แบบหลายฉลากมีลำดับชั้นที่ปรับปรุงวิธีการ k-NN ซึ่งเป็นวิธีการแบบแฟลต และนำโครงสร้างลำดับชั้นมาใช้ด้วยการฝึกตัวจำแนกประเภท SVM ที่โหนดชั้นบนของโครงสร้างลำดับชั้น เพื่อช่วยกรองคำตอบให้มีความถูกต้องแม่นยำมากขึ้น นอกจากนี้ยังมีการตัดฟีเจอร์ที่ปรากฏน้อยครั้งออกไปเพื่อช่วยลดจำนวนฟีเจอร์ และการนำฟีเจอร์สำคัญของข้อมูลทดสอบมาช่วยเลือกข้อมูลเรียนรู้เพื่อลดข้อมูลที่จะต้องพิจารณาอีกด้วย ผลการประเมินประสิทธิภาพแสดงให้เห็นว่าวิธีที่นำเสนออยู่อันดับที่ 4 มีค่า LBMaF เท่ากับ 25.70% เมื่อทดสอบบนข้อมูลวิกิพีเดียขนาดกลาง และอยู่อันดับที่ 2 มีค่า LBMaF เท่ากับ 23.48% เมื่อทดสอบบนข้อมูลวิกิพีเดียขนาดใหญ่	en_US
dc.description.abstractalternative	Hierarchical multi-label classification is a type of classification which combines two aspects of problems; an instance may belong to more than one class, and these classes are organized into a hierarchical structure. Real world data are often complex like this. Hierarchical multi-label text classification is becoming ever more popular nowadays, because hierarchical structure can be applied to describe the relationship of textual data. Textual data which we have seen every day are web pages. As the size of web pages has been becoming extremely large, website such as Web directory and Wikipedia need the automated system to classify new web pages in their databases. This kind of problem is, therefore, a large-scale hierarchical multi-label classification. Many researches proposed various methods to deal with the problem, but these methods cannot process large-scale data. The methods may require a large storage space, may take too long to process or may have low accuracy. Meanwhile, some methods that can process large-scale data do not utilize the hierarchical structure at all. This thesis proposed large-scale hierarchical multi-label text classification method that improved k-nearest neighbor method and utilized the hierarchical structure by trained SVM at the top level of hierarchy in order to increase the precision. Furthermore, we removed features that rarely appeared in training dataset to reduce large number of features, and used important features of test data to select training data in order to reduce large number of data. The evaluation showed that our proposed method ranked fourth on Wiki-Medium dataset with 25.70% LBMaF and ranked second on Wiki-Large dataset with 23.48% LBMaF.	en_US
dc.language.iso	th	en_US
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย	en_US
dc.relation.uri	http://doi.org/10.58837/CHULA.THE.2016.979	-
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย	en_US
dc.subject	การประมวลผลข้อความ	-
dc.subject	Text processing (Computer science)	-
dc.title	การจำแนกข้อความขนาดใหญ่แบบหลายฉลากมีลำดับชั้นโดยใช้วิธีการแบบแฟลตด้วยยุทธศาสตร์ตัดเล็มแบบเอสวีเอ็ม	en_US
dc.title.alternative	LARGE-SCALE HIERARCHICAL MULTI-LABEL TEXT CLASSIFICATION USING FLAT APPROACH WITH SVM PRUNING STRATEGY	en_US
dc.type	Thesis	en_US
dc.degree.name	วิศวกรรมศาสตรมหาบัณฑิต	en_US
dc.degree.level	ปริญญาโท	en_US
dc.degree.discipline	วิศวกรรมคอมพิวเตอร์	en_US
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย	en_US
dc.email.advisor	Peerapon.V@chula.ac.th,peerapon.vateekul@gmail.com	en_US
dc.identifier.DOI	10.58837/CHULA.THE.2016.979	-
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
5670192221.pdf		2.93 MB	Adobe PDF	View/Open

Show simple item record