การเปรียบเทียบวิธีการคัดเลือกตัวแปรแบบรวมกลุ่ม สำหรับข้อมูลที่มีลักษณะการจำแนกแบบไบนารี

กรชนก ชมเชย

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/82729

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	ณัตติฤดี เจริญรักษ์	-
dc.contributor.author	กรชนก ชมเชย	-
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. คณะพาณิชยศาสตร์และการบัญชี	-
dc.date.accessioned	2023-08-04T06:41:29Z	-
dc.date.available	2023-08-04T06:41:29Z	-
dc.date.issued	2565	-
dc.identifier.uri	https://cuir.car.chula.ac.th/handle/123456789/82729	-
dc.description	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2565	-
dc.description.abstract	งานศึกษานี้เปรียบเทียบวิธีการคัดเลือกตัวแปรแบบเดียว (Single-Feature Selection) และแบบรวมกลุ่ม (Ensemble Feature Selection) ซึ่งแบ่งเป็น 2 รูปแบบคือ รูปแบบการรวมลำดับความสำคัญของตัวแปรแล้วตามด้วยการเลือกจำนวนตัวแปรที่มีความสำคัญตามเกณฑ์ที่ระบุ (Design CT: Combination followed by Thresholding) และรูปแบบการการเลือกจำนวนตัวแปรที่มีความสำคัญตามเกณฑ์ที่ระบุแล้วตามด้วยการรวมเซตของตัวแปรที่มีความสำคัญดังกล่าว (Design TC: Thresholding followed by Combination) ผู้ศึกษาได้ใช้การคัดเลือกตัวแปรจากประเภท Filter Wrapper และ Embedded โดยใช้ 10-fold cross validation ในการเปรียบเทียบค่าเฉลี่ยของ F1-score แทนประสิทธิภาพการทำนายและค่าเบี่ยงเบนของ F1-score แทนค่าความเสถียรของการทำนาย ผ่านข้อมูล 3 ชุดได้แก่ Parkinson's Disease dataset (จำนวนตัวแปรต้น(P)=ขนาดข้อมูล(N)), LSVT Voice Rehabilitation dataset (P>N) และ Colon Cancer dataset (P>>N) ใช้ XGBoost เป็นตัวแบบทำนาย จากการศึกษาภายใต้ขอบเขตดังกล่าวพบว่า การคัดเลือกตัวแปรแบบวิธีเดียวด้วย RFE จะให้ผลดีในชุดข้อมูลที่มีมิติมาก P>>N ในเกณฑ์ 2.5% 5% และ 10% แต่การคัดเลือกแบบรวมกลุ่มจะให้ผลการทำนายที่ต่างกันภายใต้ลักษณะมิติของชุดข้อมูลและเกณฑ์ที่เลือกใช้ สำหรับการรวมลำดับความสำคัญของตัวแปรในรูปแบบ Design CT ด้วยค่ากลางและค่าเฉลี่ยเลขคณิตที่เกณฑ์ log2(P) จะให้ผลการทำนายดีกว่าวิธีอื่นใน Design CT ในชุดข้อมูล P>>N แต่สำหรับชุดข้อมูล P=N และ P>N ผลการทำนายจากแต่ละวิธีใน Design CT เพิ่มประสิทธิภาพการทำนายเล็กน้อย และสำหรับ Design TC การรวมเซตของตัวแปรต้นที่มีความสำคัญด้วยวิธีอินเตอร์เซกและมัลติอินเตอร์เซกจะให้ผลดีกว่าวิธียูเนียน สำหรับชุดข้อมูล P>>N ในทุกเกณฑ์ การรวมวิธีมัลติอินเตอร์เซกใน log2(P) ที่ให้ผลดีกว่าวิธีคัดเลือกแบบอื่น ๆ ในชุดข้อมูล P>>N	-
dc.description.abstractalternative	This research study compares single-feature selection and two ensemble feature selection methods to examine their predictive performance and stability. The first method, called Design Combination followed by Thresholding (Design CT), and the second, named Design Thresholding followed by Combination (Design TC), are selected from the Filter, Wrapper, and Embedded categories of feature selection methods. The study compares the performance (Average F1-score) and stability (Standard deviation F1-score) of these methods using 10-fold cross-validation with three datasets: the Parkinson's Disease (P=N), the LSVT Voice Rehabilitation (P>N), and the Colon Cancer (P>>N), with an XGBoost model used for each dataset. The results can be summarized in three key findings. Firstly, when using single-feature selection, RFE performed well in high-dimensional P>>N dataset at 2.5%, 5% and 10% thresholds. Secondly, the Design CT method, using median and arithmetic mean for combination at log2(P) threshold, demonstrated better results than others Design CT methods in P>>N dataset. However, the results of the Design CT method resulted in only small improvements in average F1-scores for P=N and P>N datasets. Thirdly, the Design TC method, employing multi-intersection and intersection methods for combination, consistently provided superior results compared to the union method for P>>N dataset across all thresholds. Multi-intersection at log2(P) threshold provided the best result.	-
dc.language.iso	th	-
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย	-
dc.relation.uri	http://doi.org/10.58837/CHULA.THE.2022.953	-
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย	-
dc.subject.classification	Computer Science	-
dc.subject.classification	Information and communication	-
dc.subject.classification	Statistics	-
dc.title	การเปรียบเทียบวิธีการคัดเลือกตัวแปรแบบรวมกลุ่ม สำหรับข้อมูลที่มีลักษณะการจำแนกแบบไบนารี	-
dc.title.alternative	A comparison of ensemble feature selection methods for binary classification datasets	-
dc.type	Thesis	-
dc.degree.name	วิทยาศาสตรมหาบัณฑิต	-
dc.degree.level	ปริญญาโท	-
dc.degree.discipline	สถิติ	-
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย	-
dc.identifier.DOI	10.58837/CHULA.THE.2022.953	-
Appears in Collections:	Acctn - Theses

Files in This Item:

File	Description	Size	Format
6480386326.pdf		4.36 MB	Adobe PDF	View/Open

Show simple item record