การเปรียบเทียบวิธีการจัดกลุ่มสำหรับข้อมูลที่มีการแจกแจงปกติแบบผสม

จิรวรรณ ไพบูลย์วรชาติ

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/43948

Title:	การเปรียบเทียบวิธีการจัดกลุ่มสำหรับข้อมูลที่มีการแจกแจงปกติแบบผสม
Other Titles:	COMPARISON OF CLUSTERING ALGORITHMS FOR MIXTURES OF GAUSSIAN DISTRIBUTION
Authors:	จิรวรรณ ไพบูลย์วรชาติ
Advisors:	นัท กุลวานิช
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะพาณิชยศาสตร์และการบัญชี
Advisor's Email:	nat.kulvanich@gmail.com
Subjects:	การแจกแจงปกติ สถิติวิเคราะห์ Gaussian distribution
Issue Date:	2556
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	งานวิจัยนี้จึงมีวัตถุประสงค์เพื่อเปรียบเทียบประสิทธิภาพวิธีการจัดกลุ่มข้อมูล 4 วิธี คือ วิธีการจัดกลุ่มแบบลำดับชั้น, วิธีการจัดกลุ่มแบบเค-มีน, วิธีการจัดกลุ่มแบบฟัซซี่ซีมีน และ วิธีการจัดกลุ่มแบบอัลกอริทึม EM โดยทำการจำลองข้อมูลที่มีการแจกแจงปกติแบบผสม ซึ่งแบ่งออกเป็น 2 กรณี ดังนี้ 1.กรณีที่ฐานข้อมูลอยู่ในรูปแบบวงรี (Non Spherical) 2.กรณีที่ฐานข้อมูลอยู่ในรูปแบบวงกลม (Spherical or Isotopic) ทำการจำลองข้อมูลที่มีจำนวนกลุ่มของการซ้อนทับกัน 2, 3 และ 4 กลุ่ม ตัวแปรที่ใช้ในการศึกษามี 2 และ 3 ตัวแปร ขนาดข้อมูลแต่ละกลุ่มเท่ากับ 50, 100 และ 300 ตัวอย่าง กำหนดจำนวนกลุ่มของการจัดกลุ่มข้อมูลเท่ากับ 2, 3 และ 4 กลุ่ม ทำการเปรียบเทียบประสิทธิภาพของวิธีการจัดกลุ่มข้อมูลจาก 2 วิธี คือ วิธี Calinski and Harabasz index (Pseudo F) และวิธี Silhouette width พบว่า เมื่อทำการจำลองข้อมูลกรณีที่ฐานข้อมูลอยู่ในรูปแบบวงรี (Non Spherical) วิธีการจัดกลุ่มข้อมูลทั้ง 4 วิธี เป็นวิธีการจัดกลุ่มที่มีประสิทธิภาพที่ดี ตามแต่ละสถานการณ์ เมื่อทำการจำลองข้อมูลกรณีที่ฐานข้อมูลอยู่ในรูปแบบวงกลม (Spherical or Isotopic) วิธีการจัดกลุ่มแบบอัลกอริทึม EM เป็นวิธีการจัดกลุ่มที่มีประสิทธิภาพที่ดี เมื่อจำนวนกลุ่มของการซ้อนทับ และอัตราการซ้อนทับเฉลี่ยมีค่าเพิ่มขึ้น
Other Abstract:	The purpose of this research is to compare the efficiency of 4 clustering. 4 clustering are Hierarchical Clustering, K-Means Clustering, Fuzzy C-Means Clustering and Expectation-Maximization Algorithm (EM Clustering). The simulated data with mixture of Gaussian distribution can be considered into 2 cases which are Non-Spherical and Spherical. The simulations of the data with overlap 2, 3 and 4 clusters have 2 and 3 variables and the sample size of each data is 50, 100, and 300. There are 2 clustering which are Calinski and Harabasz index (Pseudo F) and Silhouette width. When the database is simulated in Non spherical, the 4 methods clustering are the effective methods base on situation. However, when the database is simulated in spherical, EM is the most effective method which depends on the increasing of overlapping number and the average of overlap.
Description:	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2556
Degree Name:	วิทยาศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	สถิติ
URI:	http://cuir.car.chula.ac.th/handle/123456789/43948
URI:	http://doi.org/10.14457/CU.the.2013.1398
metadata.dc.identifier.DOI:	10.14457/CU.the.2013.1398
Type:	Thesis
Appears in Collections:	Acctn - Theses

Files in This Item:

File	Description	Size	Format
5581518026.pdf		5.83 MB	Adobe PDF	View/Open

Show full item record