An unsupervised learning algorithm for robust clustering and estimating the feasible number of clusters

Ureerat Wattanachon

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/13864

Title:	An unsupervised learning algorithm for robust clustering and estimating the feasible number of clusters
Other Titles:	ขั้นตอนวิธีการเรียนรู้แบบไม่มีผู้สอนสำหรับการเกาะกลุ่มข้อมูลอย่างทนทานและการประมาณจำนวนกลุ่มที่เป็นไปได้
Authors:	Ureerat Wattanachon
Advisors:	Chidchanok Lursinsap
Other author:	Chulalongkorn University. Faculty of Science
Advisor's Email:	lchidcha@chula.ac.th
Subjects:	Cluster analysis Learning -- Computer simulation
Issue Date:	2006
Publisher:	Chulalongkorn University
Abstract:	Data clustering is a discovery process that groups a set of data such that the data points in the same cluster are as similar as possible and the data points of different clusters are as dissimilar as possible. Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on pre-defined parameters specified by users. These algorithms can breakdown if the choice of parameters is incorrect with respect to the data set being clustered. Most of these algorithms work very well for compact and hyperspherical clusters. In this dissertation, the new hybrid clustering algorithm called “Self-Partition and Self-Merging” (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase, and then removes the noisy data and the noisy subclusters in the second phase. In the third phase, the dense subclusters are continuously merged to form the larger clusters based on the inter-distance and intra-distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set. Moreover, the SPSM algorithm is able to cluster the data sets of arbitrary shapes and different density very efficiently, tolerate to noise, and provide better clustering results than the existing clustering algorithms. The computational complexity of the SPSM algorithm is O(N [superscript 2]), where N is the number of data points.
Other Abstract:	การเกาะกลุ่มข้อมูลเป็นวิธีการที่ใช้ในการแบ่งกลุ่มเซตของข้อมูล โดยที่ข้อมูลที่อยู่ในกลุ่มเดียวกันจะมีลักษณะที่คล้ายคลึงกันและข้อมูลที่อยู่ต่างกลุ่มกัน จะมีลักษณะที่แตกต่างกันที่สุดที่เป็นไปได้ ซึ่งขั้นตอนวิธีการเกาะกลุ่มข้อมูลที่มีอยู่ อาทิเช่น single-link, k-mean, CURE และ CSM ได้ออกแบบให้มีการแบ่งกลุ่มข้อมูลโดยขึ้นอยู่กับค่าพารามิเตอร์ที่ต้องกำหนดโดยผู้ใช้ ขั้นตอนวิธีการเหล่านี้ไม่สามารถแบ่งกลุ่มข้อมูลได้ถูกต้อง ถ้าค่าพารามิเตอร์ที่ต้องกำหนดไม่เหมาะสมกับลักษณะของข้อมูลที่จะทำการแบ่งกลุ่ม ขั้นตอนวิธีเหล่านี้ส่วนใหญ่สามารถแบ่งกลุ่มได้ดีกับข้อมูลที่มีลักษณะการเกาะตัวภายในกลุ่มที่แน่นและเป็นทรงกลม ดังนั้นวิทยานิพนธ์นี้จึงเป็นการนำเสนอขั้นตอนวิธีการเกาะกลุ่มแบบผสมผสานแบบใหม่ที่เรียกว่า “Self-Partition and Self-Merging” หรือ เอสพีเอสเอ็ม (SPSM) ซึ่งในเฟสแรก ขั้นตอนวิธีเอสพีเอสเอ็มจะทำการแบ่งกลุ่มเซตของข้อมูลออกเป็นกลุ่มย่อยเล็กๆ หลายกลุ่ม จากนั้นในเฟสสองจะทำการตัดข้อมูลและกลุ่มย่อยเล็กๆ ที่เป็นสัญญาณรบกวนออกไป ส่วนในเฟสสามกลุ่มย่อยเล็กๆ ที่มีลักษณะการเกาะตัวของข้อมูลที่แน่น จะทำการรวมกลุ่มเหล่านี้เข้าด้วยกันเรื่อยๆ ให้ได้กลุ่มที่ใหญ่ขึ้น โดยการรวมกลุ่มจะพิจารณาจากระยะห่างระหว่างกลุ่มและระยะห่างภายในกลุ่ม จากผลการทดลองพบว่า ขั้นตอนวิธีเอสพีเอสเอ็ม มีประสิทธิภาพในการจัดการกับข้อมูลที่มีสัญญาณรบกวน นอกจากนี้ ขั้นตอนวิธีเอสพีเอสเอ็มยังสามารถแบ่งกลุ่มเซตของข้อมูลที่มีลักษณะรูปร่างและความหนาแน่นของข้อมูลที่แตกต่างกัน รวมทั้งยังคงทนต่อข้อมูลที่มีสัญญาณรบกวน และให้ผลการเกาะกลุ่มที่ดีกว่าการเกาะกลุ่มจากขั้นตอนวิธีการเกาะกลุ่มอื่นๆ ส่วนความซับซ้อนของเวลาที่ใช้ของขั้นตอนวิธีเอสพีเอสเอ็ม คือ O(N [superscript 2]) เมื่อ N แทนจำนวนข้อมูลทั้งหมด
Description:	Thesis (Ph.D.)--Chulalongkorn University, 2006
Degree Name:	Doctor of Philosophy
Degree Level:	Doctoral Degree
Degree Discipline:	Computer Science
URI:	http://cuir.car.chula.ac.th/handle/123456789/13864
URI:	http://doi.org/10.14457/CU.the.2006.1788
ISBN:	9741434251
metadata.dc.identifier.DOI:	10.14457/CU.the.2006.1788
Type:	Thesis
Appears in Collections:	Sci - Theses

Files in This Item:

File	Description	Size	Format
Ureerat_Wa.pdf		10.44 MB	Adobe PDF	View/Open

Show full item record