การออกแบบและพัฒนาโปรแกรมโอซีอาร์ภาษาไทย

ชาญฤทธิ์ สันตินานาเลิศ

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/11679

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	บุญเสริม กิจศิริกุล	-
dc.contributor.author	ชาญฤทธิ์ สันตินานาเลิศ	-
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์	-
dc.date.accessioned	2009-11-13T04:20:21Z	-
dc.date.available	2009-11-13T04:20:21Z	-
dc.date.issued	2542	-
dc.identifier.isbn	9743338721	-
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/11679	-
dc.description	วิทยานิพนธ์ (วท.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2542	en
dc.description.abstract	วิทยานิพนธ์ฉบับนี้มีวัตถุประสงค์ เพื่อออกแบบและพัฒนาโปรแกรมโอซีอาร์ภาษาไทย เพื่อใช้ในการรู้จำตัวอักษรพิมพ์ในเอกสารภาษาไทยที่พิมพ์จากเครื่องคอมพิวเตอร์ด้วยแบบตัวอักษรมาตรฐานวิทยานิพนธ์ฉบับนี้นำเสนอวิธีการต่างๆ เพื่อใช้ในโปรแกรมโอซีอาร์ภาษาไทยคือ วิธีการประมวลผลภาพ, วิธีการตัดแยกตัวอักษร, วิธีการแยกลักษณะสำคัญของตัวอักษรแบบ เค-แอล ทรานส์ฟอร์ม, วิธีการแยกแยะตัวอักษรแบบแบคพรอพาเกชันนิวรอลเน็ตเวิร์ก และวิธีการแก้ไขคำที่สะกดผิดแบบไตรแกรมของประเภทของคำ ขั้นตอนในการทำงานของโปรแกรมโอซีอาร์ภาษาไทยที่พัฒนาขึ้นนี้ประกอบด้วย ขั้นตอนการนำเอกสารเข้าสู่โปรแกรม, ขั้นตอนการประมวลผลภาพ, ขั้นตอนการตัดแยกบรรทัด, ขั้นตอนการตัดแยกตัวอักษร, ขั้นตอนการรู้จำตัวอักษร, ขั้นตอนการแก้ไขผลลัพธ์ที่ได้จากขั้นตอนการรู้จำ, ขั้นตอนการสร้างบรรทัดและขั้นตอนการแก้ไขคำผิด ในวิทยานิพนธ์ฉบับนี้ ได้นำภาพตัวอักษรและภาพของเอกสารที่ได้จากการพิมพ์ด้วยเครื่องพิมพ์เลเซอร์ที่ความละเอียด 600 จุดต่อนิ้ว นำเอกสารมาอ่านผ่านเครื่องสแกนเนอร์ที่ความละเอียด 300 จุดต่อนิ้ว ซึ่งประกอบด้วยตัวอักษรแบบ AngsanaUPC, BrowalliaUPC, CordiaUPC, DilleniaUPC, EucrosiaUPC และ FreesiaUPC แต่ละแบบประกอบด้วยตัวอักษรขนาด 14, 16, 18, 20, 22, 24, 28 และ 36 จุด โดยในการเรียนรู้นั้นใช้ภาพของตัวอักษรจำนวน 8544 ตัวอักษร และในการทดสอบการรู้จำใช้ภาพของเอกสารจำนวน 48 เอกสาร ซึ่งประกอบด้วยตัวอักษรจำนวน 71832 ตัวอักษร ได้ผลการรู้จำซึ่งยังไม่ได้แก้ไขคำผิดมีความผิดพลาดเฉลี่ยร้อยละ 1.85 ผลการรู้จำหลังจากแก้ไขคำผิดที่ไม่เป็นคำแล้วมีความผิดพลาดเฉลี่ยร้อยละ 1.47 และผลการรู้จำหลังจากแก้ไขคำผิดที่ไม่เป็นคำและคำผิดที่เป็นคำแล้วมีความผิดพลาดเฉลี่ยร้อยละ 1.50	en
dc.description.abstractalternative	The objective of this thesis is to design and develop Thai-Optical Character Recognition (Thai-OCR) for recognizing printed characters in Thai documents, which are printed from a computer with standard fonts. The thesis employs several methods for Thai-OCR that are image pre-processing, character segmentation, K-L transform for feature extraction, backpropagation neural networks for character classification and part of speech trigram (pos trigram) for error correction. The process of the developed Thai-OCR program is composed of image acquisition, image processing, line segmentation, character segmentation, character recognition, character correction, text line reconstruction and error correction. In this thesis, character and document images are generated from a laser printer at 600 dots per inch and then are scanned with a scanner at 300 dots per inch. They compose of characters in 6 fonts: AngsanaUPC, BrowalliaUPC, CordiaUPC, DilleniaUPC, EucrosiaUPC and FreesiaUPC each font composed of size 14, 16, 18, 20, 22, 24, 28 and 36 points. In training process 8544 characters are used and in testing process 48 documents composed of 71832 characters are used. The error rate of recognition without error correction technique is 1.85%, the error rate of recognition with non-word error correction is 1.47% and the error rate of recognition with both non-word and real-word error correction is 1.50%.	en
dc.format.extent	811055 bytes	-
dc.format.extent	865115 bytes	-
dc.format.extent	902103 bytes	-
dc.format.extent	824616 bytes	-
dc.format.extent	789663 bytes	-
dc.format.extent	702356 bytes	-
dc.format.extent	2128537 bytes	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/pdf	-
dc.language.iso	th	es
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.subject	ภาษาไทย -- ตัวอักษร	en
dc.subject	การประมวลผลภาพ	en
dc.subject	นิวรัลเน็ตเวิร์ค (คอมพิวเตอร์)	en
dc.subject	การรู้จำอักขระ (คอมพิวเตอร์)	en
dc.subject	แบคพรอพาเกชัน (ปัญญาประดิษฐ์)	en
dc.subject	การรู้จำอักขระด้วยวิธีการทางแสง	en
dc.title	การออกแบบและพัฒนาโปรแกรมโอซีอาร์ภาษาไทย	en
dc.title.alternative	Design and development of a Thai-OCR program	en
dc.type	Thesis	es
dc.degree.name	วิทยาศาสตรมหาบัณฑิต	es
dc.degree.level	ปริญญาโท	es
dc.degree.discipline	วิทยาศาสตร์คอมพิวเตอร์	es
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.email.advisor	boonserm@cp.eng.chula.ac.th, Boonserm.K@chula.ac.th	-
Appears in Collections:	Eng - Theses

Files in This Item:

File	Size	Format
Charnlit_Sa_front.pdf	792.05 kB	Adobe PDF	View/Open
Charnlit_Sa_ch1.pdf	844.84 kB	Adobe PDF	View/Open
Charnlit_Sa_ch2.pdf	880.96 kB	Adobe PDF	View/Open
Charnlit_Sa_ch3.pdf	805.29 kB	Adobe PDF	View/Open
Charnlit_Sa_ch4.pdf	771.16 kB	Adobe PDF	View/Open
Charnlit_Sa_ch5.pdf	685.89 kB	Adobe PDF	View/Open
Charnlit_Sa_back.pdf	2.08 MB	Adobe PDF	View/Open

Show simple item record