Abstract—In this work, codon frequency ratio has been considered

Abstract—In
this work, codon frequency ratio has been considered to specify gene
sequences. Back propagation neural network and support vector machine have been
used to classify hypertension gene sequences. A single gene sequence has been
converted into 19 dimensional feature vectors and then the Back Propagation
Neural Network and Support Vector Machine have been trained using the feature
vectors of all gene sequences. Since frequency ratios are not dependent on
length, sequences with various lengths are used here. Accuracy rates of Support
Vector Machine and Back Propagation Neural Network have been compared where
different no. of samples at the training and testing phase has been used. In
this experiment, the accuracy rate has been increased proportionally with the
size of samples. For maximum case, the accuracy of Back Propagation Neural is
better than Support Vector Machine.

 

Keywords—Back propagation neural
network; Support vector machine; Gene sequence; Codon frequency; Accuracy;
Sensitivity; Precision

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

I.          
INTRODUCTION

 

Analyzing and classifying various data and
information related to biology and genetics through computer stimulation have
become new interest of bioinformatics. Various diseases are diagnosed from
phenotype or risk factors. Similarly hypertension is generally predicted from a
set of risk factors like weight, blood pressure, stress, smoking, alcohol
consumption, and family history and so on. But these are only outcome of
protein synthesis which is the result of decoded genetic code. Gene is a
segment of DNA that codes for a protein which is cause of a trait (skin tone,
eye color or anything). Gene is stretch of DNA 1. Codon is a sequence of
three adjacent nucleotides constituting the genetic code that determines the
insertion of a specific amino acid in a polypeptide chain during protein
synthesis or signal to protein synthesis 2. Frequency of codon is codon usage
bias and 4 letters (A, G, C, T) are used to determine genetic code. So we have
64 codons where 61 codons encode amino acid and 3 codons stop the process named
stop codon. The objectives of this research are using codon bias frequency to
classify hypertension gene sequences making feature vector of 19 attributes and
comparing the classification results between two classifiers to find out the
better one.

 

 

 

 

 

 

II.        
RELATED
WORKS

 

In the field of Biotechnology, Artificial
Intelligence has been playing a key role from the late 90s. For various
prediction and detection using classification method, various neural network
algorithms have been implemented. The combination of neural network algorithm
and genetic algorithm gives better result for various cases. BPNN was proposed
for pre-diagnosis of hypertension from phenotype or risk factors 3. BPNN was
implemented to predict mutational disease like breast cancer along with
Bioinformatics techniques 4. Support Vector Machine (SVM) was used in
Neuro-psychiatric Disease 5. Here protein dataset available at NCBI were
features such as primary structures as training parameters in ADHD (Attention
Deficit Hyperactivity Disorder), Dementia, Mood Disorder, OCD (Obsessive
Compulsive Disorder) and Schizophrenia output. Data classification is done by
Durgesh K. Srivastava, Lekha Bhambhu for various datasets like Diabetes data,
Heart Data, Satellite Data and Shuttle data sets using kernel.

 

Through alignment after blasting, similarities or
matching of gene sequences can be found. The codon usage statistics were first
introduced for this purpose by Staden and McLachlan in 1982 6. Codon usage
bias varies from species to species as variable and has primary relation with
functions of gene78. In fact, with the same functionality of gene sequences
have different codon usage for different species 8. Protein tertiary, gene
expression level and tRNA have very close relations with codon bias pattern
9-13. Though maximum analysis related to codon usage are on “deep-branching
species” such as viruses 14, bacteria 101315, yeast1617, Caenorhabditiselegans
18, and Arabidopsis thaliana 1920, some are done on mammals 2122. In
The National Centre for Biotechnology Information (NCBI), Gene Sequences
responsible for Hypertension and so many diseases are available at FASTA format
23.

 

III.      
METHODOLOGY

 

There are three steps in
this methodology. They are-

 

·        
Preparing user defined dataset from
available gene sequences

 

·        
Implementing BPNN and SVM classifier at
training phase

 

·        
Comparing the results of two classifiers
at testing phase.

Gene Sequence from Database

 

Sort Codons

 

Prepare Dataset

 

 

SVM

 

BPNN

 

 

 

 

 

 

Take Decision

 

Take
Decision

 

 

 

 

Fig.1 Methodology of this work

 

A.  Preparation of Dataset

 

A no. of human gene sequences are collected which
are responsible for hypertension from NCBI 24 and some which are not. After
that, codon frequency ratios for 64 different codons are calculated using “R”
software. Only codons with near about 20% frequency for human cell are
considered as feature and are added to feature vector. Then 64 different codon
frequency ratios of these codon frequencies are calculated using “R” software.
Lengths and codons with frequency more than 20% (total 18 codons) are features
for the dataset.

B.  Back Prapagation Neural Network

 

The working and learning method of this supervised
learning algorithm are as similar as human brain. This algorithm can learn from
error. BPNN has three layers named input layer, hidden layer and output layer.
Hidden layer is connected with both input and output layer through a no. of
nodes that are called neurons.

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig.2 Back Propagation Neural
Network

 

Working procedure can be
described as below25-

 

·        
Each input unit (xi,
i = 1, 2, 3 …, n) receive input signal xi,
and sent the signal to all unit in their above layer (hidden units).

·        
Each hidden layer unit (zj
, j = 1, 2, 3 …, p) sum the input weight signal

?

 

with
using its activation function to get output signal value

 

 

and
sent that signal to all units at its above layer (output units).

 

·        
Each unit output (yk
, k = 1, 2, 3 …. m) sum input weight signal,

?

 

with using its
activation function to get output signal

 

 

Feed forward process is the first part
in back-propagation algorithm, which is used to send input signal into above
layer and the next step is to do error calculation.

 

·        
Each output unit (yk
, k = 1, 2, 3 …, m) accept one pair target pattern with input training pattern,
count an error,

(        )

 

count weight correction
(used in update weight wjk),

 

 

count bias correction
(used in update bias value wok)

 

 

and sent to unit below
it.

 

·        
Each hidden unit (zj,
j = 1, 2, 3 …, p) sum delta input (from its above layer units),

?

 

multiply with activation
output to count an error,

 

 

count weight
error (used for update     );

 

 

and count its bias correction (used for
update voj),

 

 

C.  Support Vector Machine

 

Support Vector Machine (SVM) is a supervised
classification method that classifies the classes with an optimal hyper plane
maximizing the boundary between the

classes. The data points
those are closest to the hyper plane; known as support vectors. SVM maps input
vector to a higher dimensional space where a maximal separating hyper plane is
constructed. Two parallel hyperplanes are constructed on each side of the hyper
plane that separate the data. The separating hyper plane maximizes the distance
between the two parallel hyperplanes 26.

 

+                           
Support Vector

+

Support

+  +

 

Vector

 

+

 

 

Support

 

 

Vector

 

 

 

 

Fig.3 Support Vector Machine

 

·        
Kernel Selection -Training
vectors xi
are
mapped into a higher dimensional space by the function ?. Then
SVM finds a linear separating hyper-plane with the maximal margin in this
higher dimension space. C >

 

0 is the penalty
parameter of the error term. Furthermore, the kernel function is K(xi,
xj)
? ?(xi)T
?(xj).
In this work, RBF kernel has been used.

 

·        
Model Selection – To
select model, some parameters (i.e. cost parameter, C and kernel
parameters, ?, d) has been tuned.

 

·        
Cross-validation – From
the training dataset, some sub sets have been selected at the time of
test. One of the subsets is used to test the training process to increase
accuracy. C, ?, d have been calculated from this procedure for fine tuning.
This full process is called cross-validation.

 

·        
LIBSVM – This
library has been used to implement SVM 27. At first it has built a
model from training dataset then this trained network has been used to classify
test dataset.

 

IV.    RESULT
ANALYSIS

 

In this experiment, the
performances of SVM and BPNN have been measured for sample size 60, 100 and
150. The True Positive Rate (TPR), False Negative Rate (FNR), False Positive
Rate (FPR) and True Negative Rate (TNR) have been calculated for SVM and BPNN classifier.

 

TABLE I. VALUE
OF
TPR, TNR, FPR AND FNR FOR
SVM AND
BPNN

 

Sample

 

SVM

 

 

BPNN

 

 

No.

 

 

 

 

 

 

 

 

 

TPR

FNR

FPR

TNR

TPR

FNR

FPR

TNR

 

 

 

 

 

 

 

 

 

 

 

 

 

60

15.00

35.00

35.00

15.00

75.00

16.67

8.33

0.00

 

 

 

 

 

 

 

 

 

 

 

100

75.00

15.00

5.00

5.00

70.00

20.00

10.00

0.00

 

 

 

 

 

 

 

 

 

 

 

150

38.33

11.67

11.67

38.33

90.00

3.33

3.33

3.33

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a)                                        (b)

 

 

 

 

 

 

 

 

 

(c)

 

Fig. 4 (a) TPR, TNR, FPR and FNR of SVM
and BPNN for 60 samples, (b) TPR, TNR, FPR and FNR of SVM and BPNN for 1000
samples and (c) TPR, TNR, FPR and FNR of SVM and BPNN for 150 samples.

 

In the above figure, the true positive rate and true
negative rate of BPNN is better than SVM. Minimum error has been found for BPNN
classifier.

 

 

 

 

 

 

 

 

 

 

(a)                                        (b)

 

 

 

 

 

 

 

 

 

 

(c)

 

Fig.5 (a) ROC curve for 60 samples, (b) ROC curve
for 100 samples and (c) ROC curve for 150 samples

 

For ROC Curve, along X axis we have got false
positive rate and along Y axis we have had true positive rate. As a result, the
top most and left most corner has been considered as „ideal point?. Here,
“steepness” of ROC curves is maximizing the true positive rate while minimizing
the false positive rate (error type I). The performance of BPNN has

been
better for sample size 60 and 150 (see fig. 5). The accuracy, sensitivity and
precision have been calculated for SVM and BPNN classifier (see Table II).

 

TABLE II. ACCURACY, SENSITIVITY AND PRECISION VALUE
FOR
SVM AND

BPNN.

 

 

 

SVM

 

 

BPNN

 

 

No. Of Samples

 

 

 

 

 

 

 

Accuracy

 

 

 

 

 

 

 

Sensitivity

Precision

Accuracy

Sensitivity

Precision

 

 

 

 

 

 

 

 

 

60

30%

30%

30%

75%

81.82%

90%

 

 

 

 

 

 

 

 

 

100

50%

50%

50%

80%

94.73%

94.73%

 

 

 

 

 

 

 

 

 

150

76.66%

83.3%

93.75%

93.33%

77.77%

83.33%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a)                                        (b)

 

 

 

 

 

 

 

 

 

 

(c)

 

Fig. 6 Training Phase, (a) Accuracy,
Sensitivity and Precision of SVM and BPNN for 60 samples, (b) Accuracy,
Sensitivity and Precision of SVM and BPNN for 100 samples and (c) Accuracy,
Sensitivity and Precision of SVM and BPNN for 150 samples.

 

High rates of these three parameters have
represented a well-trained network (see table II and fig. 6). From the fig. 5
it has been seen that ROC curve of BPNN is sharper to TP. Accuracy rate has
been increased with sample for BPNN and SVM. But for a specific no of sample,
Accuracy rate has been always higher for BPNN. For Sensitivity and Precision,
SVM has had lower rate than BPNN for 60 and 100 samples. Though sensitivity and
precision have increased for 150 samples, accuracy rate has been smaller. From
the above result and discussion it can be concluded that, for classification
purpose of gene sequences using codon frequency ratios, BPNN has worked better
than SVM. But it should be noticed that, SVM has used two different data sets
for training and testing phase where BPNN has had only one data set for both.
BPNN has chosen randomly data from dataset for training,

validation
and tasting purpose. As a result it has helped BPNN to choose data from all
over the data set not only from selected range of separate data sets. It was
expected that, SVM would be better classifier for this work. As data have been
inter related so closely, the drawing of hyper plane is not efficient. On the
other hand, limited no. of samples has been a fact also. May be for a data set
including 1000 or more than that no. of samples, SVM has a chance to give
better result than BPNN.

 

V.         
CONCLUSION

 

After using various methods like single nucleotide
polymorphism (SNP) or mutation and amino acid to define gene sequences, codon
frequency has been implemented as a new approach here. Though for many days
codons have been considered as vital attribute of a gene of a specific species
but codon bias frequency has not been used to describe gene from full user
defined dataset. In this work, codon bias frequency ratio has been used because
length may differ but ratios codons must be nearly similar as protein
production and synthesis should be similar for a specific task.