Flow chart for

Recognizing Speech/Voice from User input

1-Input of User:

In the form of Analog Acoustic Signal, system takes voice of

user as input.

2-Digitization:

To digitize the signals of acoustic..

3-Phonetic Breakdown:

Chunks/breaks the signals into ‘phonemes’.

4-Statistical

Modeling:

Using statistics model mapping the phonemes ti their

phonetic representation.

5-Matching:

In dictionary & phonetic representation, system returns

n-best list of words plus a confidence score.

6-Grammer:

Phrases or words that are union to constraint the

input/output range in voice application.

SPEECH RECOGNITION through RECURRENT NEURAL NETWORKS

In combination with Hidden Markov modelsH.M.M, Neural NetworksN.N

have a very long history in speech recognition. Speech/Voice is an essential

dynamic process, it seems natural to entertain

Recurrent Neural Networks RNNs as

alternative model.

Instead of joining RNNs with HMMs, it is legal to train RNNs ”end to end” for

speech recognition(S.R). This methodology enterprise the larger state house/space/room

and richer dynamics of RNNs compared to HMMs, and avoids the problem of using

incorrect alignments as training targets.

RECURRENT NEURAL NETWORKS:

Recurrent neural network is one in which every layer shows another step in

time or another step in some sequence, and that every time step gets one

input and predicts one output. However, the network use the same ‘transition

function’ for every time step, Therefore learning to predict the output

sequence from the input sequence for sequences of any length.

For a standard recurrent neural network, in order to do

prediction we iterate the following equations:

hi=?Whhhi?1+Whxxi+bh

y^i =Wyhhi

The hidden layer at step ‘ i ‘ is given by hi. Likely, ‘xi’ is the input

layer at time step, i, y^i

is layer of output at time step i, and

the W? is

the weight matrices. This formula of recurrent networks RNNs is equivalent to

having one-hidden-layer feed-forward network at every time step. At every time

step part of the input layer can also consider to b hi?1.

Long Short Term Memory(L.ST.M):

In order to modify the Training Algorithm, we can update/ modify/change

the Network Architecture to make it helpful, easier and simple to train. To

improve training capacity, techniques such as “Hessian-Free-Optimization(HFO)” applied to them. In theory Standard RNNs are

powerful, they c.an be very diffiicult/hard to train.One of the reason training

networks is difficult is that errors which are computed in engaging ”Backpropoagation”.

They are multiplied by each or every other once a time step. If errors are

large it quickly become very large due

to repeated multiplication.Else, the error quickly removed, becoming very small.

An another architecture built with Long Short Term Memory LSTM cells attempts

to negate this issue.

Bidirectional R.N.Ns:

“In speech recognition, specifically,

the sound after & before a given point delivers the information about the

sound at a particular/specific point in the sequence”. In a standard RNN,

the output at a given time ‘t’ depends exclusively on the inputs x0 through xt.

Before time step ‘t’ and after time step t,

many sequences have information relevant to output yt, while this makes sense in some

contexts in order to utilize this information, we need a modified architecture.

There are many possible approaches:

Windowed Feed Forward Network (W.F.F.N):

Use a standard feed forward network (SFFN)

instead of using RNN and simply use a window around the output. Being easier to

train this has a benefits, it limits applicability because we importantly have

a window of exactly that size, and because we don’t use information far away

from the output therefore the size of the window is limiting.

RNN (having delays):

After seeing inputs 0 to t+d, predict time step ‘t’, instead of predicting time

step ‘t’ after seeing inputs 0 to t, where ‘d’ is some fixed delay. This is

very close to a standard R.N.N, but also lets you look a few steps in the future

for contextual info.

Bidirectional RNN:

Going backwards in time, add another set of hidden-layers to your recurrent

network. These two hidden layers are entirely separate and do not interact with

every other, except for the fact that they are both used to compute the output.

You need to run propagation forward in time from time 0 till end to compute

the forward-hidden-layers, by your given weights, and run it backward in time from

the time end to 0 to compute the backward-hidden-layers. Lastly, at every time

step compute the output using the values at both of the hidden layers for a

given time step.

Bidirectional RNNs has two graphics that are very helpful

for understanding them and the differences from these other approaches. First ,we

can visualize what part of a sequence every type of network can utilize in order

to predict a value at time tc:

The windowed approach is tagged MLP multi-layer

perception. normal RNNs square measure tagged RNN, and utilize data right up

to tc. Delayed RNNs forward and backward will use all their history, with an

additional window around tc. Finally, Bidirectional-RNNsBRNNs will use the

whole sequence for his or her prediction. So, we are going to specialise in

that approach. It will be trained just like a regular RNN. it’s slightly

distinct once dilated in time.

Here, B.R.N.N. dilated in time, showing solely the time steps around time step

t. The input striped feeds to each of those, and each of them feed to the

output of the RNN black. In the middle we’ve 2 hidden states gray, one

propagating forwards and one propagating backwards in time.

Training an Acoustic Model:

First goal for speech recognition(S.R) is to form a classifier which can

convert from a sequence of sounds into a sequence of phonemes. Suppose that we

have Associate in Nursing input sequence ‘x’ sound data and a desired output

sequence ‘y’ phonemes. If our output sequence is temporary For E.g: number

of spoken words, maybe 10 to twenty sounds, our input sequence area unit

aiming to be for for much longer, as we’ll would like to sample every sound

persistently to be able to distinguish them. Therefore, ‘x’ and ‘y’ area unit

aiming to be of distinct lengths, that poses a problem from our customary RNN

style in that we have a tendency to tend to predict one output for one input.

we have many decisions for correcting this disadvantage.

The first alternative is to align the output sequence ‘y’

with the input sequence, every part ‘yi’ of the output sequence is placed on

some relating part xi. After that, the network is trained to output ‘yi’ at

time step i with input xi and output a ‘ empty ‘ part on time steps that

there is no output. These sequences are same to be “aligned”, since

we have placed every output part ‘yi’ in its correct temporal location.

Probability Distribution:

Given associate input sequence ‘x’ of length T, the network generates some

output ‘y’ that parameterizes a likelihood distribution over the area of all

attainable labels. Suppose L? be our alphabet L with an additional image

representing a “empty/blank”. The output layer of our network is

needed to be a softmax layer, that assigns a likelihood to each component of

L?. Suppose yin be the likelihood allotted by the network to seeing n?L? at time t=i.

The output generated by the network is understood as a

“path”. The likelihood of a given path ? given inputs x will then

be written because the product of all its constituent elements:

P?|x=?t=1Tyt?t, where ?t is the tth element of the

path ?

If we tend to traverse the trail by removing all blanks and

duplicate supposeters, we tend to get some label. Note that we tend to take

away duplicate supposeters additionally to blanks; effectively, this suggests

we actually care concerning transitions from blanks to supposeters or from a

supposeter to a different supposeter. Suppose label? be the label concerning

a path ?. Therefore, the chance of seeing a specific label ? given the input

sequence ‘x’ will be written because the total of all the trail possibilities

over the ways that get us that label:

P?|x=?label?=?P?|x=?label?=??t=1Tyt?t

Output Decoding:

Given the probability distribution P?we are able to compute a label ? for

an enter collection ‘x’ by taking the maximum probably label. therefore, for

the reason that L less than or equals to T,

is the set of sequences of duration much less than or equal to ‘t’ with

supposeters drawn from the alphabet L, we can express our favored classifier

hx as follows:

hx=argmax??L?TP?|x

Best Path Decoding:

the primary traditional decoding strategy is best path

decoding, which assumes that the most in all likelihood path corresponds to the

maximum possibly label. This is not always real: suppose we’ve got 1 path with

chance 0.1 regarding label A, and 10 paths with opportunity zero.05 each

referring to label B. genuinely, label B is most advantageous universal, since

it has an usual possibility of zero.five. nice direction decoding might pick

label A, which has a higher probability than any direction for label B.

first-rate route interpreting in all fairness easy to compute, truely study the

maximum active output at whenever step, concatenate them, and convert them to a

label via erasing duplicates & blanks. considering at each step we select

the most active output, the resulting direction is the maximum in all

likelihood one.

Prefix Search Decoding:

As an alternative to the naive fine route decoding approach,

we can carry out a search within the label area the usage of heuristics to

guide our seek and determine whilst to prevent. One particular set of

heuristics provide an algorithm called prefix seek deciphering, that’s

stimulated by the forward/backward set of rules for hidden Markov fashions. The

instinct behind prefix search decoding is that in preference to looking amongst

all labels, we will observe prefixes of strings. We preserve growing the

prefixes by way of appending the most possibly element until it’s far more

probable that the prefix ends the string consists only of that prefix, at

which factor we prevent.

Training a Linguistic Model:

The connectionist temporal type version we described above does a good job

as an acoustic model i.e it could be

trained to predict the output phonemes based at the enter sound information.

however, it does not account for the fact that the output is in reality human

language, and no longer just a stream of phonemes. we are able to augment the

acoustic version with a “linguistic” version, one which relies upon

completely on the man or woman stream,

and not at the sound facts.This 2d model is also finished as RNN, called RNN

transducer. the use of the identical structure as we defined in the first

section we teach RNN to do 1-step prediction. particularly, if we’ve got a

facts collection d = G1,G2,…,Gk of characters, we train our neural network to

expect d from an input sequence b,G1,G2,…,Gk?1, wherein b is a blank

individual.

Now we’ve got two models, one RNN that does

individual-degree prediction, and another that does sound-based totally

prediction. If ‘ft’ is the output of the acoustic version at time ‘t’ and ‘gu’

is the output of the man or woman primarily based version at man or woman u, we

will integrate those right into a single function:ht,u=expft+gu.

Minor Modifications:

Ultimately, we have all the components we need to create our

final network. Our final network greatly resembles the RNN tranducer network we

discussed above. even as i.e the usual components propose a modification.

observe that the function, ht,u=expft+gu successfully multiplies the

softmax outputs of ‘toes’ and ‘gu’. as a substitute,feeding the hidden layers

that feed into ‘toes’ and ‘gu’ to every other unmarried-hidden-layer neural

network, which computes ht,u. They find that this decreases desupposeion

errors throughout speech recognition.

Conclusion:

We used Long Short Term Memory L.S.T.M devices in deep

multi-hidden-layer bidirectional recurrent neural networks BRNNs as our

base layout. we will be inclined to labored through 2 workable decryption

algorithms for classic RNN networks and derived the the objective function in addition to the way

wherein we compute so that you can train

our networks. We checked out RNN tranducers, an technique used to augment with

the linguistic model or any model that just models output-output

relationships. Note that we skipped over a number of things related to

decoding data from the RNN transducer network.