การเปรียบเทียบวิธีการคัดเลือกตัวแปรแบบรวมกลุ่ม สำหรับข้อมูลที่มีลักษณะการจำแนกแบบไบนารี

กรชนก ชมเชย

DSpace Home
→
Faculty and Institute
→
Faculty of Commerce and Accountancy - Acctn
→
Acctn - Theses
→
View Item

dc.contributor.advisor	ณัตติฤดี เจริญรักษ์
dc.contributor.author	กรชนก ชมเชย
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. คณะพาณิชยศาสตร์และการบัญชี
dc.date.accessioned	2023-08-04T06:41:29Z
dc.date.available	2023-08-04T06:41:29Z
dc.date.issued	2565
dc.identifier.uri	https://cuir.car.chula.ac.th/handle/123456789/82729
dc.description	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2565
dc.description.abstract	งานศึกษานี้เปรียบเทียบวิธีการคัดเลือกตัวแปรแบบเดียว (Single-Feature Selection) และแบบรวมกลุ่ม (Ensemble Feature Selection) ซึ่งแบ่งเป็น 2 รูปแบบคือ รูปแบบการรวมลำดับความสำคัญของตัวแปรแล้วตามด้วยการเลือกจำนวนตัวแปรที่มีความสำคัญตามเกณฑ์ที่ระบุ (Design CT: Combination followed by Thresholding) และรูปแบบการการเลือกจำนวนตัวแปรที่มีความสำคัญตามเกณฑ์ที่ระบุแล้วตามด้วยการรวมเซตของตัวแปรที่มีความสำคัญดังกล่าว (Design TC: Thresholding followed by Combination) ผู้ศึกษาได้ใช้การคัดเลือกตัวแปรจากประเภท Filter Wrapper และ Embedded โดยใช้ 10-fold cross validation ในการเปรียบเทียบค่าเฉลี่ยของ F1-score แทนประสิทธิภาพการทำนายและค่าเบี่ยงเบนของ F1-score แทนค่าความเสถียรของการทำนาย ผ่านข้อมูล 3 ชุดได้แก่ Parkinson's Disease dataset (จำนวนตัวแปรต้น(P)=ขนาดข้อมูล(N)), LSVT Voice Rehabilitation dataset (P>N) และ Colon Cancer dataset (P>>N) ใช้ XGBoost เป็นตัวแบบทำนาย จากการศึกษาภายใต้ขอบเขตดังกล่าวพบว่า การคัดเลือกตัวแปรแบบวิธีเดียวด้วย RFE จะให้ผลดีในชุดข้อมูลที่มีมิติมาก P>>N ในเกณฑ์ 2.5% 5% และ 10% แต่การคัดเลือกแบบรวมกลุ่มจะให้ผลการทำนายที่ต่างกันภายใต้ลักษณะมิติของชุดข้อมูลและเกณฑ์ที่เลือกใช้ สำหรับการรวมลำดับความสำคัญของตัวแปรในรูปแบบ Design CT ด้วยค่ากลางและค่าเฉลี่ยเลขคณิตที่เกณฑ์ log2(P) จะให้ผลการทำนายดีกว่าวิธีอื่นใน Design CT ในชุดข้อมูล P>>N แต่สำหรับชุดข้อมูล P=N และ P>N ผลการทำนายจากแต่ละวิธีใน Design CT เพิ่มประสิทธิภาพการทำนายเล็กน้อย และสำหรับ Design TC การรวมเซตของตัวแปรต้นที่มีความสำคัญด้วยวิธีอินเตอร์เซกและมัลติอินเตอร์เซกจะให้ผลดีกว่าวิธียูเนียน สำหรับชุดข้อมูล P>>N ในทุกเกณฑ์ การรวมวิธีมัลติอินเตอร์เซกใน log2(P) ที่ให้ผลดีกว่าวิธีคัดเลือกแบบอื่น ๆ ในชุดข้อมูล P>>N
dc.description.abstractalternative	This research study compares single-feature selection and two ensemble feature selection methods to examine their predictive performance and stability. The first method, called Design Combination followed by Thresholding (Design CT), and the second, named Design Thresholding followed by Combination (Design TC), are selected from the Filter, Wrapper, and Embedded categories of feature selection methods. The study compares the performance (Average F1-score) and stability (Standard deviation F1-score) of these methods using 10-fold cross-validation with three datasets: the Parkinson's Disease (P=N), the LSVT Voice Rehabilitation (P>N), and the Colon Cancer (P>>N), with an XGBoost model used for each dataset. The results can be summarized in three key findings. Firstly, when using single-feature selection, RFE performed well in high-dimensional P>>N dataset at 2.5%, 5% and 10% thresholds. Secondly, the Design CT method, using median and arithmetic mean for combination at log2(P) threshold, demonstrated better results than others Design CT methods in P>>N dataset. However, the results of the Design CT method resulted in only small improvements in average F1-scores for P=N and P>N datasets. Thirdly, the Design TC method, employing multi-intersection and intersection methods for combination, consistently provided superior results compared to the union method for P>>N dataset across all thresholds. Multi-intersection at log2(P) threshold provided the best result.
dc.language.iso	th
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย
dc.relation.uri	http://doi.org/10.58837/CHULA.THE.2022.953
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย
dc.subject.classification	Computer Science
dc.subject.classification	Information and communication
dc.subject.classification	Statistics
dc.title	การเปรียบเทียบวิธีการคัดเลือกตัวแปรแบบรวมกลุ่ม สำหรับข้อมูลที่มีลักษณะการจำแนกแบบไบนารี
dc.title.alternative	A comparison of ensemble feature selection methods for binary classification datasets
dc.type	Thesis
dc.degree.name	วิทยาศาสตรมหาบัณฑิต
dc.degree.level	ปริญญาโท
dc.degree.discipline	สถิติ
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย
dc.identifier.DOI	10.58837/CHULA.THE.2022.953