翻译：端到端的神经网络图像序列识别及其在场景文本识别中的应用

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用

Abstract

Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

基于图像的序列识别已成为计算机视觉领域的长期研究课题。在本文中，我们研究了场景文本识别问题，这是基于图像的序列识别中最重要和最具挑战性的任务之一。提出了一种新颖的神经网络架构，它将特征提取，序列建模和转录集成到一个统一的框架中。与以前的用于场景文本识别的系统相比，所提出的体系结构具有四个独特的特性：（1）与大多数现有的算法（其组件分别经过训练和调整）相比，它是端对端可训练的。（2）它自然地处理任意长度的序列，不涉及字符分割或水平尺度归一化。（3）它不限于任何预定义的词典，并且在无词典和基于词典的场景文本识别任务中均表现出色。（4）生成有效但小得多的模型，这对于实际应用场景更实用。在包括IIIT-5K，街景文字和ICDAR数据集在内的标准基准上进行的实验证明了该算法优于现有技术的优势。此外，该算法在基于图像的乐谱识别任务中表现良好，显然证明了其通用性。

1. Introduction

Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: imagebased sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as &34; or 15 characters such as &34;. Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

最近，社区看到了神经网络的强大复兴，这主要是由于深度神经网络模型（尤其是深度卷积神经网络（DCNN））在各种视觉任务中的巨大成功所激发。但是，与深度神经网络有关的最新著作大多数都致力于对象类别的检测或分类[12，25]。在本文中，我们关注计算机视觉中的一个经典问题：基于图像的序列识别。在现实世界中，稳定的视觉对象（例如场景文本，手写和乐谱）倾向于以顺序而不是孤立的形式出现。与一般对象识别不同，识别此类类似序列的对象通常需要系统预测一系列对象标签，而不是单个标签。因此，这种对象的识别自然可以被看作是序列识别问题。类序列对象的另一个独特属性是它们的长度可能会急剧变化。例如，英语单词可以由2个字符组成，例如&34;，也可以由15个字符组成，例如&34;。因此，像DCNN [25，26]这样最流行的深度模型不能直接应用于序列预测，因为DCNN模型通常对具有固定尺寸的输入和输出进行操作，因此无法生成可变长度的标签序列。

Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect inpidual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequencelike objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.

对于特定的类似序列的对象（例如场景文本），已经尝试解决该问题。例如，[35，8]中的算法首先检测单个字符，然后使用DCNN模型识别这些检测到的字符，该模型使用标记的字符图像进行训练。此类方法通常需要训练强大的字符检测器，以准确地从原始文字图像中检测并裁剪出每个字符。其他一些方法（例如[22]）将场景文本识别视为图像分类问题，并为每个英语单词（总共90K个单词）分配一个类别标签。事实证明，这种训练有素的模型具有大量的类，很难将其推广到其他类型的类似序列的对象，例如中文文本，乐谱等，因为此类序列的基本组合数量可以大于一百万。总之，当前基于DCNN的系统不能直接用于基于图像的序列识别。

Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.

递归神经网络（RNN）模型是深度神经网络家族的另一个重要分支，主要设计用于处理序列。 RNN的优点之一是，在训练和测试中，RNN都不需要序列对象图像中每个元素的位置。但是，通常必须执行将输入对象图像转换为图像特征序列的预处理步骤。例如，Graves等。 [16]从手写文本中提取出一组几何或图像特征，而Su和Lu [33]将单词图像转换为连续的HOG特征。预处理步骤独立于流水线中的后续组件，因此无法以端到端的方式训练和优化基于RNN的现有系统。

Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almazan` et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.

几种不基于神经网络的常规场景文本识别方法也为该领域带来了有见地的想法和新颖的表示形式。例如，Almazan`等。 [5]和Rodriguez-Serrano等。 [30]提出将单词图像和文本字符串嵌入到一个公共的向量子空间中，并将单词识别转换为检索问题。姚等。 [36]和戈多等。 [14]使用中级特征进行场景文本识别。尽管在标准基准上取得了令人满意的性能，但是这些方法通常比以前基于神经网络的算法[8，22]以及本文提出的方法要好。

The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.

本文的主要贡献是一种新颖的神经网络模型，该网络模型是专门为识别图像中类似序列的对象而设计的。所提出的神经网络模型是DCNN和RNN的组合，因此被称为卷积递归神经网络（CRNN）。对于类似序列的对象，CRNN与传统的神经网络模型相比具有几个明显的优势：1）可以直接从序列标签（例如单词）中学习，不需要详细的注释（例如字符）； 2）它具有直接从图像数据中学习信息表示的DCNN的特性，既不需要手工功能也不需要预处理步骤，包括二值化/分割，组件定位等； 3）具有RNN的相同属性，能够产生一系列标签； 4）它不受序列状物体长度的限制，在训练和测试阶段都只需要高度标准化即可； 5）与现有技术相比，它在场景文本（单词识别）上表现出更好或极具竞争力的表现[23，8]； 6）它包含的参数比标准DCNN模型少得多，占用的存储空间也更少。

2. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top。

如图1所示，CRNN的网络架构从下到上由三个部分组成，包括卷积层，循环层和转录层。

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.

在CRNN的底部，卷积层会自动从每个输入图像中提取特征序列。在卷积网络之上，构建了一个递归网络，用于对由卷积层输出的特征序列的每一帧进行预测。采用CRNN顶部的转录层，将循环层的每帧预测转换为标记序列。尽管CRNN由不同类型的网络体系结构（例如CNN和RNN）组成，但可以使用一个损失函数进行联合训练。

Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

图1.网络架构。该体系结构包括三个部分：1）卷积层，从输入图像中提取特征序列； 2）循环层，预测每个帧的标签分布； 3）转录层，它将每帧的预测翻译成最终的标记序列。

2.1. Feature Sequence Extraction

In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.

在CRNN模型中，卷积层的组件是通过从标准CNN模型中获取卷积层和最大池化层（除去完全连接的层）而构造的。这样的组件用于从输入图像中提取顺序特征表示。在送入网络之前，所有图像都需要缩放到相同的高度。然后，从卷积层分量产生的特征图中提取特征向量序列，该卷积层是循环层的输入。具体地，特征序列的每个特征向量在特征图上按列从左到右生成。这意味着第i个特征向量是所有地图的第i列的串联。我们设置中每列的宽度固定为单个像素。

As the layers of convolution, max-pooling, and elementwise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original im- age (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.

当卷积层，最大池化层和元素激活函数在局部区域上运行时，它们是平移不变的。因此，特征图的每一列对应于原始图像的一个矩形区域（称为接收场），并且这些矩形区域从左到右与它们在特征图上相应列的顺序相同。如图2所示，特征序列中的每个向量都与一个接收场相关联，并且可以被视为该区域的图像描述符。

Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.

图2.接收场。提取的特征序列中的每个向量都与输入图像上的一个接收场相关联，并且可以视为该场的特征向量。

Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequencelike object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.

作为强大，丰富和可训练的深度卷积特征已被广泛用于各种视觉识别任务[25，12]。某些先前的方法已经使用CNN来学习对诸如场景文本之类的序列对象的鲁棒表示[22]。然而，这些方法通常通过CNN提取整个图像的整体表示，然后收集局部深层特征以识别序列状对象的每个组成部分。由于CNN要求将输入图像缩放到固定大小，以满足其固定的输入尺寸，因此，由于序列长度较大，因此不适合用于类似序列的对象。在CRNN中，我们将深层特征传达到顺序表示中，以便不变于序列状对象的长度变化。

2.2. Sequence Labeling

A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution y_t for each frame x_t in the feature sequence x=x_1,…,x_T . The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.

一个深层的双向递归神经网络被构建在卷积层的顶部，作为递归层。循环层针对特征序列x=x_1,…,x_T中的每个帧x_t预测标签分布y_t。循环层的优点是三方面的。首先，RNN具有在序列中捕获上下文信息的强大功能。与单独处理每个符号相比，使用上下文提示进行基于图像的序列识别更加稳定和有用。以场景文本识别为例，宽字符可能需要几个连续的帧才能完整描述（请参阅图2）。此外，某些模棱两可的字符在观察其上下文时更容易区分，例如通过对比字符高度来识别“ il”要比分别识别每个字符要容易。其次，RNN可以将误差差分反向传播到其输入即卷积层，从而使我们能够在统一网络中共同训练递归层和卷积层. 第三，RNN可以对任意长度的序列进行操作，从开始到结束。

Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

图3.（a）LSTM基本单元的结构。 LSTM由单元模块和三个门组成，即输入门，输出门和忘记门。（b）我们在本文中使用的深度双向LSTM的结构。将向前（从左到右）和向后（从右到左）LSTM组合在一起将产生双向LSTM。堆叠多个双向LSTM会导致深度双向LSTM。

A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame x_t in the sequence, it updates its internal state ht with a non-linear function that takes both current input xt and past state h_t-1 as its inputs: h_t = g(x_t,h_t-1). Then the prediction y_t is made based on ht. In this way, past contexts 〖{x_(t^&39;<t) are captured and utilized for prediction. Traditional RNN unit, however, suffers from the vanishing gradient problem [7], which limits the range of context it can store, and adds burden to the training process. Long-Short Term Memory [18, 11] (LSTM) is a type of RNN unit that is specially designed to address this problem. An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.

传统的RNN单元在其输入和输出层之间具有自连接的隐藏层。每次收到序列中的帧x_t时，它都会使用非线性函数更新其内部状态h_t，该函数将当前输入x_t和过去状态ht-1都作为其输入：h_t = g(x_t,h_t-1)。然后，基于h_t做出预测y_t。通过这种方式，捕获过去的上下文〖{x_(t^&39;<t)并将其用于预测。然而，传统的RNN单元遭受梯度消失的困扰[7]，这限制了它可以存储的上下文范围，并增加了训练过程的负担。长期内存[18，11]（LSTM）是一种RNN单元，专门设计用于解决此问题。 LSTM（图3所示）由一个存储单元和三个乘法门组成，即输入，输出和忘记门。从概念上讲，存储单元存储了过去的上下文，而输入和输出门则允许该单元长时间存储上下文。同时，可以通过忘记门清除单元中的存储器。 LSTM的特殊设计使其可以捕获长期依赖关系，这种依赖关系经常发生在基于图像的序列中。

LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].

LSTM是定向的，它仅使用过去的上下文。但是，在基于图像的序列中，来自两个方向的上下文都是有用的并且彼此互补。因此，我们遵循[17]，将两个LSTM（一个向前和一个向后）组合成双向LSTM。此外，可以堆叠多个双向LSTM，从而产生如图3.b所示的深层双向LSTM。较之较浅的结构，较深的结构可以实现更高级别的抽象，并且在语音识别任务中已经实现了显着的性能提升[17]。

In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called &34;, as the bridge between convolutional layers and recurrent layers.

在循环层中，误差差沿图3.b所示箭头的相反方向传播，即反向传播时间（BPTT）。在循环层的底部，将传播的差异序列连接成图，将将特征图转换为特征序列的操作反转，然后反馈到卷积层。实际上，我们创建了一个自定义网络层，称为&34;，作为卷积层和循环层之间的桥梁。

2.3. Transcription

Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexiconfree mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.

转录是将RNN进行的每帧预测转换为标签序列的过程。在数学上，转录是要根据每帧预测找到具有最高概率的标记序列。实际上，存在两种转录方式，即无词典和基于词典的转录。词典是预测受其约束的一组标签序列，例如拼写检查字典。在无词典模式下，无需任何词典即可进行预测。在基于词典的模式下，通过选择概率最高的标签序列来进行预测。

2.3.1 Probability of label sequence 标签序列的概率

We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. The probability is defined for label sequence l conditioned on the per-frame predictions y =y_1,...,y_T , and it ignores the position where each label in l is located. Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of inpidual characters.

我们采用Graves等人提出的在连接主义时间分类（CTC）层中定义的条件概率。 [15]。该概率是针对以每帧预测 y =y_1,...,y_T为条件的标签序列l定义的，它忽略了l中每个标签所处的位置。因此，当我们以这种可能性的负对数似然度为目标来训练网络时，我们只需要图像及其相应的标签序列，从而避免了为各个字符标注位置的麻烦。

The formulation of the conditional probability is briefly described as follows: The input is a sequence y =〖 y〗_1,...,〖 y〗_T where T is the sequence length. Here, each 〖 y〗_t ϵR^(|L^&39;=L∪ , where L^&39; |)都是集合L^&39;包含任务中的所有标签（例如，所有英文字符）以及以表示的“空白”标签。在序列DD上定义了序列到序列的映射函数B，其中T是长度。 B首先删除重复的标签，然后删除“空白”，从而将π映射到l上。例如，B将“ --hh-e-l-ll-oo-”（“-”代表“空白”）映射到“ hello”。然后，将条件概率定义为B映射到l上的所有π的概率之和：

p(l│y)=∑_(π:B(π)=1)▒〖p(π│y) 〗 (1)

where the probability of π is defined as p(π│y)=∏_(t=1)^T▒y_(π_t)^t , y_(π_t)^t is the probability of having label π_t at time stamp t. Directly computing Eq. 1 would be computationally infeasible due to the exponentially large number of summation items. However, Eq. 1 can be efficiently computed using the forward-backward algorithm described in [15].

其中π的概率定义为p(π│y)=∏_(t=1)^T▒y_(π_t)^t ，y_(π_t)^t是在时间戳t处具有标签π_t的概率。直接计算式由于求和项的数量成指数增加，因此1在计算上是不可行的。但是，等式。使用[15]中描述的前向-后向算法可以有效地计算图1。

2.3.2 Lexicon-free transcription 无词典的转录

In this mode, the sequence l^* that has the highest probability as defined in Eq. 1 is taken as the prediction. Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. The sequencel^* is approximately found by l^* ≈ B(〖arg max〗_π p(π|y)), i.e. taking the most probable label π_t at each time stamp t, and map the resulted sequence onto l^* .

在这种模式下，将具有等式1中定义的最高概率的序列l^*作为预测。由于没有可精确计算的精确算法，因此我们使用[15]中采用的策略。序列l^*由l^* ≈ B(〖arg max〗_π p(π|y))近似找到，即在每个时间戳t处取最可能的标记πt，并将得到的序列映射到l^*上。

2.3.3 Lexicon-based transcription 2.3.3基于词典的转录

In lexicon-based mode, each test sample is associated with a lexicon D. Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq. 1, i.e. l^*=〖arg max〗_(I∈D) p(l|y). However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. This indicates that we can limit our search to the nearest-neighbor candidates N_δ (l^&39; is the sequence transcribed from y in lexicon-free mode:

在基于词典的模式下，每个测试样本都与一个词典D相关联。基本上，通过选择词典中方程式1中定义的条件概率最高的序列来识别标签序列，即l^*=〖arg max〗_(I∈D) p(l|y)。但是，对于大型词典，例如在使用5万个单词的Hunspell拼写检查字典[1]时，要在词典上进行详尽搜索，即为词典中的所有序列计算等式1并选择概率最高的序列，将非常耗时。为了解决这个问题，我们观察到在2.3.2中描述的通过无词典转录预测的标签序列在编辑距离度量标准下通常接近于真实情况。这表明我们可以将搜索范围限制为最邻近的候选对象N_δ (l^&39;是在无词典模式下从y转录的序列：

l^* ≈ B(〖arg max〗_( l∈N_δ (l^&39;)can be found efficiently with the BK-tree data structure [9], which is a metric tree specifically adapted to discrete metric spaces. The search time complexity of BK-tree is O(log |D|), where |D| is the lexicon size. Therefore this scheme readily extends to very large lexicons. In our approach, a BK-tree is constructed offline for a lexicon. Then we perform fast online search with the tree, by finding sequences that have less or equal to δ edit distance to the query sequence.

可以使用BK树数据结构[9]有效地找到候选N_δ (l^&39;k&39;s&39;p&34; k&34; s&34; p&39;i&39;l&34; i&34; l&34;Full&34;完整&34;none&34;无&34;50&34;1k&34;50k&34;Full&34;None&34; 50&34; 1k&34; 50k&34;完整&34;无&34;Clean&34;Synthesized&34;Clean&34;Real-World&34; Clean&34;清洁&34;合成&34;真实世界&34;fragment accuracy/average edit distance&34;片段准确性/平均编辑距离"）评估演奏。

Tab.4 summarizes the results. The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. The CRNN, on the other hand, uses convolutional features that are highly robust to noises and distortions. Besides, recurrent layers in CRNN can utilize contextual information in the score. Each note is recognized not only itself, but also by the nearby notes. Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.

表4总结了结果。 CRNN大大优于两个商业系统。 Capella Scan和PhotoScore系统在Clean数据集上的表现相当不错，但在合成和真实数据上的性能却大大下降。主要原因是他们依靠可靠的二值化来检测人员线和便条，但是由于不良的光照条件，噪声破坏和背景混乱，二值化步骤通常无法在合成的和真实的数据上进行。另一方面，CRNN使用对噪声和失真具有高度鲁棒性的卷积特征。此外，CRNN中的循环层可以利用分数中的上下文信息。每个音符不仅可以自己识别，还可以被附近的音符识别。因此，可以通过将它们与附近的音符进行比较来识别某些音符，例如对比他们的垂直位置。

The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.

结果显示了CRNN的普遍性，因为它可以轻松应用于其他基于图像的序列识别问题，而所需的领域知识最少。与Capella Scan和PhotoScore相比，我们基于CRNN的系统仍是初步的，缺少许多功能。但是，它为OMR提供了一种新方案，并且在音高识别方面显示出了令人鼓舞的功能。

4. Conclusion

In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CRNN is able to take input images of varying dimensions and produces predictions with different lengths. It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each inpidual element (e.g. characters) in the training phase. Moreover, as CRNN abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model. All these properties make CRNN an excellent approach for image-based sequence recognition.

在本文中，我们提出了一种新颖的神经网络架构，称为卷积递归神经网络（CRNN），它融合了卷积神经网络（CNN）和递归神经网络（RNN）的优点。 CRNN能够拍摄不同尺寸的输入图像，并产生不同长度的预测。它直接在粗糙级别的标签（例如单词）上运行，在训练阶段无需为每个单独的元素（例如字符）提供详细的注释。此外，由于CRNN放弃了常规神经网络中使用的完全连接的层，因此它导致了更加紧凑和有效的模型。所有这些特性使CRNN成为基于图像的序列识别的绝佳方法。

The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.

与传统方法以及其他基于CNN和RNN的算法相比，现场文本识别基准上的实验表明CRNN具有优异或极具竞争力的性能。这证实了所提出算法的优点。此外，CRNN在光学音乐识别（OMR）的基准上明显优于其他竞争对手，这证明了CRNN的普遍性。

Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.

实际上，CRNN是一个通用框架，因此可以应用于涉及图像序列预测的其他领域和问题（例如汉字识别）。进一步加快CRNN的速度，使其在实际应用中更加实用是另一个值得未来探索的方向。

原文： An End-to-End Trainable Neural Network for Image-based Sequence Its Application to Scene Text Recognition （arXiV 1507.05717）