p.p1 networks (DNNs), that have evolved into complex structures

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.5px Helvetica}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Helvetica}
p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica}
p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.0px Helvetica}
span.s1 {font: 24.0px Helvetica}
span.s2 {color: #002486}

Overview of Three Dierent Structures of Artificial Neural Networks for
Speech Recognition
Automatic speech recognition (ASR) is the translation,
through some methodologies, of human
speech into text by machines and plays an important
role nowadays. In this research review
we examine three dierent artificial neural network
architectures that are used in speech recognition
field and we investigate their performance
in dierent cases. We analyze the state-of-art
deep neural networks (DNNs), that have evolved
into complex structures and they achieve significant
results in a variety of speech benchmarks.
Afterward, we explain convolutional neural networks
(CNNs) and we explore their potential in
this field. Finally, we present the recent research
in highway deep neural networks (HDNNs) that
seem to be more flexible for resource constrained
platforms. Overall, we critically try to compare
these methods and show their strengths and limitations.
Each method has its benefits and applications
and from them we try to draw some
conclusions and give some potential future directions.
I. Introduction
Machine Learning (ML) is a field of computer science
that gives the computers the ability to learn through
dierent algorithms and techniques without being programmed.
ASR is closely related with ML because it uses
methodologies and procedures of ML 1, 2, 3. ASR has
been around for decades but it was not until recently that
there was a tremendous development because of the advances
in both machine learning methods and computer
hardware. New ML techniques made speech recognition
accurate enough to be useful outside of carefully controlled
environments, so it could easily be deployed in many electronic
devices nowadays (i.e. computers, smart-phones)
and be used in many applications such as identifying and
authenticating a user via of his/her voice.
Speech is the most important mode of communication
between human beings and that is why from the early part
of the previous century, eorts have been made in order
to make machines do what only humans could perceive.
Research has been conducted through the past five decades
and the main reason was the desire of making tasks automated
using machines 2. Many motivations using dierent
theories such as probabilistic modeling and reasoning,
pattern recognition and artificial neural networks aected
the researchers and helped to advance ASR.
The first single advance in the history of ASR occurred
in the middle of 70’s with the introduction of the
expectation-maximization (EM) 4 algorithm for training
hidden Markov models (HMMs). The EM technique gave
the possibility to develop the first speech recognition systems
using Gaussian mixture models (GMMs). Despite all
the advantages of the GMMs, they are not able to model
eciently data that lie on or near a nonlinear surface in the
data space (i.e. sphere). This problem could be solved by
artificial neural networks because they can capture these
non-linearities in the data but the computer hardware of
that era did not allow us to build complex neural networks.
As a result, in the beginning most speech recognition systems
were based on HMMs. Later the neural network and
hidden Markov model (NN/HMM) hybrid architecture 5
was used for ASR systems. After 2000s and over the last
years the improvement of computer hardware and the invention
of new machine learning algorithms made possible the
training for DNNs. DNNs with many hidden layers have
been shown to achieve comparable and sometimes much
better performance than GMMs in many dierent databases
(with speech data) and in a range of applications 6. After
the huge success of DNNs, researchers try other artificial
neural architectures such as recurrent neural networks with
long short-term memory units (LSTM-RNNs) 7, deep
belief networks and CNNs, and it seems that each one of
them has its benefits and weaknesses.
In this literature review we present three types of artificial
neural networks (DNNs, CNNs, and HDNNs). We
analyze each method, we explain how they are used for
training and what are their advantages and disadvantages.
Finally we compare these methods in the context of ASR,
identifying where each one of them is more suitable and
what are their limitations. Finally, we draw some conclusions
from these comparisons and we carefully suggest
some probable future directions.
II. Methods
 A. Deep Neural Networks
DNNs are feed-forward artificial neural networks with
more than one layer of hidden units. Each hidden
layer has a number of units (or neurons) each of which
Informatics Research Review (s1736880)
takes all outputs of the lower layer as input and passes them
through a linearity. After that we apply a non linear activation
function (i.e. sigmoid function, hyperbolic tangent
function, some kind of rectified linear unit function (ReLU
8, 9), or exponential linear unit function (ELU 10)) for
the final transformation of our initial inputs. Sometimes, for
a multi-class classification problem, the posterior probability
of each class can be estimated using an output softmax
layer. For the training process of DNNs we usually use the
back propagation technique 11. For large training sets, it
is typically more convenient to compute derivatives on a
mini-batch of the training set rather than the whole training
set (this is called stochastic gradient descent). As cost function
we often use the cross-entropy (CE) in order to have a
comparison meter between the output of the network and
the actual output but the choice of the cost function actually
depends on the case.
The diculty to optimize DNNs with many hidden
layers along with overfitting problem force us to use pretraining
methods. One such a popular method is to use
the restricted Boltzmann machines (RBMs) 12. If we
use a stack of RBMs then we can construct a deep belief
network (DBN) (you should not be confused with dynamic
Bayesian network). The purpose of this is to add an initial
stage of generative pretraining. The pretraining is very
important for DNNs because it reduces overfitting and it
also reduces the time required for discriminative fine-tuning
with propagation.
DNNs in the context of ASR play a major role. Many
architectures have been used by dierent research groups in
order to gain better and better accuracy in acoustic models.
You can see some methodologies in the article 6 that it
presents some significant results and shows that DNNs in
general achieve higher speech recognition accuracy than
GMMs on a variety of speech recognition benchmarks such
as TIMIT and some other large vocabulary environments.
The main reason is that they take advantage from the fact
that they can handle the non-linearities in the data and so
they can learn much better models comparing to GMMs.
However, we have to mention that they use many model parameters
in order to achieve a good enough speech accuracy
and this is sometimes a drawback. Furthermore, they are
complex enough and need many computational resources.
Finally, they have been criticized because they do not preserve
some specific structure (we can use dierent structures
until we achieve a significant speech accuracy), they
are dicult to be interpreted (because they have not some
specific structure) and they possess limited adaptability (we
use dierent approaches for dierent cases). Besides all
of these disadvantages they remain the state-of-the-art for
speech recognition the last few years and they have given
us the most reliable and consistent results overall.
 B. Convolutional Neural Networks
Convolutional neural networks (CNNs) can be regarded
as DNNs with the main dierence that instead of using
fully connected hidden layers (as it happens in DNNs; full
connection with all the possible combinations among the
hidden layers) they use a special network structure, which
consists of convolution and pooling layers 13, 14, 15. Basic
rule is that the data have to be organized as a number
of feature maps in order to be passed in each convolutional
layer. One significant problem we have when we want to
transform our speech data in feature maps concerns frequency
because we are not able to use the conventional
mel-frequency cepstral coecient (MFCC) technique 16.
The reason is that this technique does not preserve the locality
of our data (in the case of CNNs), although we want
to preserve locality in both frequency and time. Hence, a
solution is the use of mel-frequency spectral coecients
(MFSC features) 15.
Our purpose with MFSC technique is to form the input
feature maps without loosing the property of locality in our
data. Then we can apply the convolution and pooling layers
with their respective operations to generate the activations
of the units in those layers. We should mention that each
input feature map is connected to many feature maps and
the feature maps share the weights. Thus, firstly, we use
the convolution operation to construct our convolutional
layers and afterwards, we apply the pooling layer in order
to reduce the resolution of the feature maps. This process
continues depending on how deep we want to be our network
(maybe we could achieve higher speech accuracy with
more layers on this structure or maybe not). You can see
the whole process and the usage of convolution and pooling
layers in the paper 15. Moreover, as it happens for DNNs
with RBMs, there is a respective procedure CRBM 17 for
CNNs that allow us pretraining our data in order to gain
in speech accuracy and reduce the overfitting eect. In the
paper 15, the authors also examine the case of a CNN
with limited weight sharing for ASR (LWS model) and they
propose to pretrain it modifying the CRBM model.
CNNs have three major properties: locality, weight
sharing, and pooling. Each one of them has the potential
to improve speech recognition performance. These properties
can reduce the overfitting problem and they can add
robustness against non-white noise. In addition, they can
reduce the number of network weights to be learned. Both
locality and weight sharing are significant factors for the
property of pooling which is very helpful in handling small
frequency shifts that are common in speech signals. These
shifts may occur from dierences in vocal tract lengths
among dierent speakers 15. In general, CNNs seem to
have a relative better performance in ASR taking advantage
from their special network structure.
 C. Highway Deep Neural Networks
HDNNs are depth-gated feed-forward neural networks
18. They are distinguished from the conventional
DNNs for two main reasons. Firstly they use much less
model parameters and secondly they use two types of gate
functions to facilitate the information flow through the hidden
Informatics Research Review (s1736880)
HDNNs are multi-layer networks with many hidden
layers. In each layer we have the transformation of the
initial input or of the previous hidden layer with the corresponding
parameter of the current layer (they are combined
in a linear way) followed by a non-linear activation function
(i.e. sigmoid function). The output layer is parameterized
with the parameter and we usually use the softmax function
as the output function in order to obtain the posterior probability
of each class given our initial inputs. Afterwards,
given the target outputs, the network is usually trained by
gradient descent to minimize a loss function such as crossentropy
(CE function). So, we can see that the architecture
and the process are the same as in DNNs that we described
in subsection of DNNs.
The dierence from the standard DNNs is that highway
deep neural networks (HDNNs) were proposed to enable
very deep networks to be trained by augmenting the hidden
layers with gate functions 19. This augmentation happens
through the transform and carry gate functions. The first
scales the original hidden activations and the latter scales
the input before passing it directly to the next hidden layer
Three main methods are presented for training, the
sequence training, the adaptation technique and the teacherstudent
training in the papers 18, 20, 21. Combining these
methodologies with the two gates it is demonstrated how
important role the carry and the transform gate play in the
training. The main reason is that the gates are responsible
to control the flow of the information among the hidden
layers. They allow us to achieve comparable speech recognition
accuracy to the classic DNNs but with much less
model parameters because we have the ability to handle the
whole network through the parameters of the gate functions
(which are much less comparing to the parameters of the
whole network). This outcome is crucial for platforms such
as mobile devices (i.e. voice recognition on mobiles) due to
the fact that we have not many disposal resources in these
 D. Comparison of the Methods
These methods, that we presented, have their benefits and
limitations. In general, DNNs behave very well and in
many cases they have enough better performance compared
to GMMs on a range of applications. The main reason is
that they take advantage from the fact that they can handle
much better the non linearities in the data space. On the
other hand, their biggest drawback compared with GMMs
is that it is much harder to make good use of large cluster
machines to train them on massive data 6.
As far as the CNNs are concerned, they can handle
frequency shifts which are dicult to be handled within
other models such as GMMs and DNNs. Furthermore, it is
also dicult to learn such an operation as max-pooling in
standard artificial neural networks. Moreover, CNNs can
handle the temporal variability in the speech features as
well 15. On the other hand, the fine-tuning of the pooling
size (carefully selection of pooling size) is very important
because otherwise we may cause phonetic confusion, especially
at segment boundaries. Despite the fact that CNNs
seem to have better accuracy than DNNs with less parameters,
computationally are more expensive because of the
complexity of the convolution operation.
HDNNs are considered to be more compact than regular
DNNs due to the fact that they can achieve similar
recognition accuracy with many fewer model parameters.
Furthermore, they are thinner than DNNs and this is because
through the gate functions we can control the behavior
of the whole network using only the parameters of the
gates (which are much less comparing to the parameters of
the whole network). Moreover, with HDNNs we can update
our whole model by simply updating the gate functions
using adaptation data and in this way we can gain considerably
in speech recognition accuracy. Although they are
considered useful for resource constrained platforms, their
final model parameters are still large enough. So, we cannot
conclude much for their general performance because they
are a recent proposal and it is needed more research to see
their overall benefits and limitations. However, the main
idea is to use them in order to have comparable ASR accuracy
with DNNs and simultaneously to reduce the model
III. Conclusions
Overall, we can say that DNNs are the state-of-theart
today because they behave very well on a range
of speech recognition benchmarks. However, other architectures
of artificial neural networks such as CNNs have
achieved comparable performance in the context of ASR.
Besides that, research continues to be conducted in this
field in order to find new methods, learning techniques
and architectures that will allow us to train our data sets
more eciently. This means less parameters, less computational
power, less complex models, more structured models.
Ideally we would like to have a whole general model that
covers a lot of cases and not many dierent models that
applied in dierent circumstances. On the other hand, this
is probably dicult, so just distinct methodologies and specific
techniques for dierent cases may be our temporary or
even more our unique solution. In this direction, HDNNs or
other architectures may be used to deal with specific cases.
Many future directions have been suggested the last
few years for research in order to advance ASR. Some
probable suggestions are the use of unsupervised learning
or reinforcement learning for acoustic models. Another
potential direction is to search for new architectures or
special structures in artificial neural networks or inventing
new learning techniques and at the same time improving
our current algorithms.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now