Please use this identifier to cite or link to this item:
word-based language model
word segmentation algorithm
|Issue Date: ||2005|
|Abstract: ||OCR (Optical Character Recognition，光学字符识别技术)作为方便有效的字体识别技术，在办公自动化、信息恢复、数字图书馆等方面发挥着日益重要的作用。在OCR识别的过程中，由于文字和图像结构复杂多变，单字的识别率受到了一定程度的影响。为了提高识别率，需要利用其它信息对OCR识别的结果进行后处理工作。语言模型在OCR后处理，特别是在中文的文字识别后处理方面有着广泛的应用。本文详细分析了前期工作中采用的语言模型以及相关算法，分别讨论了基于字和基于词的语言模型，分析了它们各自的优点和缺点。经过详细的分析，采用了基于词的语言模型取代基于字的语言模型，接着提出了基于多信息的分词方法。在图的搜索中，采用了N-best搜索算法取代Viterbi算法。本文的测试数据分为两类：第一类为无分割错误测试数据(一个测试集)，总共15000条中文手写地址；第二类为含分割错误测试数据（三个测试集），总共58269条中文手写地址。经过改进，在无分割错误测试集上，手写地址的整体识别率由原来的83.73%上升到了96.84% ，错误率下降了80.58%；在含分割错误测试集上，手写地址的整体识别率由原来的28.56%上升到了74.15% ，错误率下降了63.82%，大大提高了系统的性能。|
OCR（Optical Character Recognition）is a convenient and efficient tool for office automation and information retrieval, and is becoming more and more important in today’s office and library environment. During OCR processing, the recognition rate of isolate characters is limited because of the complex structure of character images. In order to improve the recognition rate, some other information besides image is required for post-processing of the OCR results. Language model is widely used in OCR post-processing, especially Chinese. In this thesis, the language model and related algorithms used in former system are analyzed. Character-based language model and word-based language model are both discussed. Their advantage and disadvantage are also presented. After analyzing, word-based language model is adopted instead of character-based language model. And then Multi-information based segmentation approach is proposed. Finally we use N-best instead of Viterbi as search algorithm. Two kinds of Experiments are made: one is a test set including none segmentation errors，which has 15000 handwritten Chinese addresses; the other one includes three test sets that containing segmentation errors, which have 58269 handwritten Chinese addresses in all. After improvement, in none segmentation errors data set, recognition rate of the whole address increase from 83.73% to 96.84%, this means 80.58% errors reduction. And in the segmentation-involved data sets, recognition rate of the whole address increase from 28.56% to 74.15%, this means 63.82% errors reduction. It has greatly improved the performance of the OCR system.
|Source URI: ||http://oaps.lib.tsinghua.edu.cn/handle/123456789/121|
|Source Fulltext: ||http://oaps.lib.tsinghua.edu.cn/bitstream/123456789/121/1/024%e9%be%99%e7%bf%802001011607-2003.pdf|
|Appears in Collections:||Outstanding Thesis of Undergraduate Students 本科生优秀毕业论文（2005）|
Files in This Item:
There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.