Abstract—Inthis work, codon frequency ratio has been considered to specify genesequences. Back propagation neural network and support vector machine have beenused to classify hypertension gene sequences.
A single gene sequence has beenconverted into 19 dimensional feature vectors and then the Back PropagationNeural Network and Support Vector Machine have been trained using the featurevectors of all gene sequences. Since frequency ratios are not dependent onlength, sequences with various lengths are used here. Accuracy rates of SupportVector Machine and Back Propagation Neural Network have been compared wheredifferent no. of samples at the training and testing phase has been used. Inthis experiment, the accuracy rate has been increased proportionally with thesize of samples. For maximum case, the accuracy of Back Propagation Neural isbetter than Support Vector Machine. Keywords—Back propagation neuralnetwork; Support vector machine; Gene sequence; Codon frequency; Accuracy;Sensitivity; Precision I. INTRODUCTION Analyzing and classifying various data andinformation related to biology and genetics through computer stimulation havebecome new interest of bioinformatics.
Various diseases are diagnosed fromphenotype or risk factors. Similarly hypertension is generally predicted from aset of risk factors like weight, blood pressure, stress, smoking, alcoholconsumption, and family history and so on. But these are only outcome ofprotein synthesis which is the result of decoded genetic code. Gene is asegment of DNA that codes for a protein which is cause of a trait (skin tone,eye color or anything).
Gene is stretch of DNA 1. Codon is a sequence ofthree adjacent nucleotides constituting the genetic code that determines theinsertion of a specific amino acid in a polypeptide chain during proteinsynthesis or signal to protein synthesis 2. Frequency of codon is codon usagebias and 4 letters (A, G, C, T) are used to determine genetic code. So we have64 codons where 61 codons encode amino acid and 3 codons stop the process namedstop codon. The objectives of this research are using codon bias frequency toclassify hypertension gene sequences making feature vector of 19 attributes andcomparing the classification results between two classifiers to find out thebetter one. II. RELATEDWORKS In the field of Biotechnology, ArtificialIntelligence has been playing a key role from the late 90s. For variousprediction and detection using classification method, various neural networkalgorithms have been implemented.
The combination of neural network algorithmand genetic algorithm gives better result for various cases. BPNN was proposedfor pre-diagnosis of hypertension from phenotype or risk factors 3. BPNN wasimplemented to predict mutational disease like breast cancer along withBioinformatics techniques 4. Support Vector Machine (SVM) was used inNeuro-psychiatric Disease 5.
Here protein dataset available at NCBI werefeatures such as primary structures as training parameters in ADHD (AttentionDeficit Hyperactivity Disorder), Dementia, Mood Disorder, OCD (ObsessiveCompulsive Disorder) and Schizophrenia output. Data classification is done byDurgesh K. Srivastava, Lekha Bhambhu for various datasets like Diabetes data,Heart Data, Satellite Data and Shuttle data sets using kernel.
Through alignment after blasting, similarities ormatching of gene sequences can be found. The codon usage statistics were firstintroduced for this purpose by Staden and McLachlan in 1982 6. Codon usagebias varies from species to species as variable and has primary relation withfunctions of gene78. In fact, with the same functionality of gene sequenceshave different codon usage for different species 8. Protein tertiary, geneexpression level and tRNA have very close relations with codon bias pattern9-13. Though maximum analysis related to codon usage are on “deep-branchingspecies” such as viruses 14, bacteria 101315, yeast1617, Caenorhabditiselegans18, and Arabidopsis thaliana 1920, some are done on mammals 2122. InThe National Centre for Biotechnology Information (NCBI), Gene Sequencesresponsible for Hypertension and so many diseases are available at FASTA format23. III.
METHODOLOGY There are three steps inthis methodology. They are- · Preparing user defined dataset fromavailable gene sequences · Implementing BPNN and SVM classifier attraining phase · Comparing the results of two classifiersat testing phase. Gene Sequence from Database Sort Codons Prepare Dataset SVM BPNN Take Decision Take Decision Fig.1 Methodology of this work A. Preparation of Dataset A no. of human gene sequences are collected whichare responsible for hypertension from NCBI 24 and some which are not. Afterthat, codon frequency ratios for 64 different codons are calculated using “R”software. Only codons with near about 20% frequency for human cell areconsidered as feature and are added to feature vector.
Then 64 different codonfrequency ratios of these codon frequencies are calculated using “R” software.Lengths and codons with frequency more than 20% (total 18 codons) are featuresfor the dataset.B. Back Prapagation Neural Network The working and learning method of this supervisedlearning algorithm are as similar as human brain. This algorithm can learn fromerror. BPNN has three layers named input layer, hidden layer and output layer.
Hidden layer is connected with both input and output layer through a no. ofnodes that are called neurons. Fig.
2 Back Propagation NeuralNetwork Working procedure can bedescribed as below25- · Each input unit (xi,i = 1, 2, 3 …, n) receive input signal xi,and sent the signal to all unit in their above layer (hidden units). · Each hidden layer unit (zj, j = 1, 2, 3 …, p) sum the input weight signal ? withusing its activation function to get output signal value andsent that signal to all units at its above layer (output units). · Each unit output (yk, k = 1, 2, 3 …. m) sum input weight signal, ? with using itsactivation function to get output signal Feed forward process is the first partin back-propagation algorithm, which is used to send input signal into abovelayer and the next step is to do error calculation. · Each output unit (yk, k = 1, 2, 3 …, m) accept one pair target pattern with input training pattern,count an error, ( ) count weight correction(used in update weight wjk), count bias correction(used in update bias value wok) and sent to unit belowit.
· Each hidden unit (zj,j = 1, 2, 3 …, p) sum delta input (from its above layer units), ? multiply with activationoutput to count an error, count weighterror (used for update ); and count its bias correction (used forupdate voj), C. Support Vector Machine Support Vector Machine (SVM) is a supervisedclassification method that classifies the classes with an optimal hyper planemaximizing the boundary between theclasses. The data pointsthose are closest to the hyper plane; known as support vectors. SVM maps inputvector to a higher dimensional space where a maximal separating hyper plane isconstructed.
Two parallel hyperplanes are constructed on each side of the hyperplane that separate the data. The separating hyper plane maximizes the distancebetween the two parallel hyperplanes 26. + Support Vector + Support + + Vector + – – – Support Vector Fig.3 Support Vector Machine · Kernel Selection -Trainingvectors xiaremapped into a higher dimensional space by the function ?. ThenSVM finds a linear separating hyper-plane with the maximal margin in thishigher dimension space.
C > 0 is the penaltyparameter of the error term. Furthermore, the kernel function is K(xi,xj)? ?(xi)T?(xj).In this work, RBF kernel has been used. · Model Selection – Toselect model, some parameters (i.e. cost parameter, C and kernelparameters, ?, d) has been tuned.
· Cross-validation – Fromthe training dataset, some sub sets have been selected at the time oftest. One of the subsets is used to test the training process to increaseaccuracy. C, ?, d have been calculated from this procedure for fine tuning.This full process is called cross-validation. · LIBSVM – Thislibrary has been used to implement SVM 27. At first it has built amodel from training dataset then this trained network has been used to classifytest dataset.
IV. RESULTANALYSIS In this experiment, theperformances of SVM and BPNN have been measured for sample size 60, 100 and150. The True Positive Rate (TPR), False Negative Rate (FNR), False PositiveRate (FPR) and True Negative Rate (TNR) have been calculated for SVM and BPNN classifier. TABLE I. VALUEOFTPR, TNR, FPR AND FNR FORSVM ANDBPNN Sample SVM BPNN No. TPR FNR FPR TNR TPR FNR FPR TNR 60 15.
00 35.00 35.00 15.
00 75.00 16.67 8.33 0.00 100 75.
00 15.00 5.00 5.00 70.00 20.00 10.00 0.00 150 38.
33 11.67 11.67 38.33 90.00 3.33 3.
33 3.33 (a) (b) (c) Fig. 4 (a) TPR, TNR, FPR and FNR of SVMand BPNN for 60 samples, (b) TPR, TNR, FPR and FNR of SVM and BPNN for 1000samples and (c) TPR, TNR, FPR and FNR of SVM and BPNN for 150 samples. In the above figure, the true positive rate and truenegative rate of BPNN is better than SVM. Minimum error has been found for BPNNclassifier.
(a) (b) (c) Fig.5 (a) ROC curve for 60 samples, (b) ROC curvefor 100 samples and (c) ROC curve for 150 samples For ROC Curve, along X axis we have got falsepositive rate and along Y axis we have had true positive rate. As a result, thetop most and left most corner has been considered as „ideal point?. Here,”steepness” of ROC curves is maximizing the true positive rate while minimizingthe false positive rate (error type I). The performance of BPNN hasbeenbetter for sample size 60 and 150 (see fig. 5).
The accuracy, sensitivity andprecision have been calculated for SVM and BPNN classifier (see Table II). TABLE II. ACCURACY, SENSITIVITY AND PRECISION VALUEFORSVM ANDBPNN.
SVM BPNN No. Of Samples Accuracy Sensitivity Precision Accuracy Sensitivity Precision 60 30% 30% 30% 75% 81.82% 90% 100 50% 50% 50% 80% 94.73% 94.
73% 150 76.66% 83.3% 93.
75% 93.33% 77.77% 83.33% (a) (b) (c) Fig. 6 Training Phase, (a) Accuracy,Sensitivity and Precision of SVM and BPNN for 60 samples, (b) Accuracy,Sensitivity and Precision of SVM and BPNN for 100 samples and (c) Accuracy,Sensitivity and Precision of SVM and BPNN for 150 samples. High rates of these three parameters haverepresented a well-trained network (see table II and fig.
6). From the fig. 5it has been seen that ROC curve of BPNN is sharper to TP. Accuracy rate hasbeen increased with sample for BPNN and SVM. But for a specific no of sample,Accuracy rate has been always higher for BPNN.
For Sensitivity and Precision,SVM has had lower rate than BPNN for 60 and 100 samples. Though sensitivity andprecision have increased for 150 samples, accuracy rate has been smaller. Fromthe above result and discussion it can be concluded that, for classificationpurpose of gene sequences using codon frequency ratios, BPNN has worked betterthan SVM.
But it should be noticed that, SVM has used two different data setsfor training and testing phase where BPNN has had only one data set for both.BPNN has chosen randomly data from dataset for training,validationand tasting purpose. As a result it has helped BPNN to choose data from allover the data set not only from selected range of separate data sets. It wasexpected that, SVM would be better classifier for this work. As data have beeninter related so closely, the drawing of hyper plane is not efficient.
On theother hand, limited no. of samples has been a fact also. May be for a data setincluding 1000 or more than that no. of samples, SVM has a chance to givebetter result than BPNN. V. CONCLUSION After using various methods like single nucleotidepolymorphism (SNP) or mutation and amino acid to define gene sequences, codonfrequency has been implemented as a new approach here.
Though for many dayscodons have been considered as vital attribute of a gene of a specific speciesbut codon bias frequency has not been used to describe gene from full userdefined dataset. In this work, codon bias frequency ratio has been used becauselength may differ but ratios codons must be nearly similar as proteinproduction and synthesis should be similar for a specific task.