Xylanase
has a wide range of potential biotechnological applications. Recently the
interest in xylanase has markedly increased due to the potential industrial
uses, particularly in pulping and bleaching processes (Beg et al. 2001; Diaz et al. 2004; Oliveira et al. 2006). The
thermo-alkalophilic conditions of xylanase-aided bleaching (60-80ºC, pH 8-10)
combined with a high level of activity, demand a set of characteristics of
xylanases, not usually found in native enzymes. An alternative for obtaining
new thermostable enzymes is the modification of presently used xylanases to be
more stable in extreme conditions. During the last twenty years, rational
site-directed mutation (Moreau et al. 1994)
and irrational directed evolution (Fenel et al.
2006) have become a routine approach for engineering xylanases to
achieve this goal. Although, the so-called ‘semi-rational’ approach, which used
computational techniques to perform The support vector machine (SVM) is a new and very promising classification and regression method developed by Vapnik (1998). It has been shown that SVM has two distinct features. Firstly, it has high generalization ability. Secondly, it requires only small size of training samples. According to some literatures, SVM has shown promising results on several biological problems and is becoming established as a standard tool in bioinformatics (Ward et al. 2003; Cai et al. 2004; Chen et al. 2006). In the present investigation, SVM, as a novel machine learning technique, was used to establish a model for predicting the optimum temperature of xylanase in G/11 family. During the process, the uniform-design method was applied to optimize the running parameters of SVM. The aim was to establish a new QSPR model and to confirm the possibility of predicting the optimum temperature of xylanases. The performances of SVM were better than that of back propagation neural network (BPNN) and the reported models, and may be useful for computer virtual screening in engineering for more thermostable new xylanases.
To reduce the redundancy, we downloaded the sequences of xylanases from UniProt, for it contains records with full manual annotation or computer-assisted, manually-verified annotation performed by biologists and based on published literature and sequence analysis (Bairoch et al. 2005). The optimum temperatures of xylanases obtained from Liu’s work (Liu et al. 2006) have been shown in Table 1. Altogether, 25 xylanase sequences and their corresponding optimum temperatures were obtained.
SVM can be applied in both
classification and regression; here we used support vector regression (SVR). In
SVR, the basic idea is to map the data x is the input
vector, _{i}d is the desired value, and _{i}n is the total
number of data patterns), and SVM approximate the function using Equation 1:
Where Ф
The first term in Equation
2 is the empirical error (risk). They are measured by the Ɛ-insensitive loss
function given by Equation 3. On the other hand, the second term in Equation
2 is the regularization term. To obtain the estimations
of Finally, the regression function given by Equation 1 has the following explicit form Where _{i}, a_{i}* are the
introduced Lagrange multipliers and they satisfy the equality a_{i} · a_{i}*
= 0,a_{i} ≥ 0,a_{i}* ≥ 0, and the kernel function K corresponding to
Linear and radial basis function (RBF, Gaussian function) kernels are two commonly used kernels in SVR (Smola and Schölkopf, 1998) and are given by Linear kernel
RBF
kernel Where γ is a
constant, the parameter of the kernel, it controls the width of the Guassian
kernel (although itself is not the width) and therefore, controls the
generalization ability of SVM. The generalization performance of SVR depends on
a good setting of parameters:
Uniform design (UD) was
first proposed by Fang (1980), based
on theoretic accomplishments in number-theoretic method. Generally speaking, UD
is a form of ‘space filling’ design. Suppose that the experimental domain
consists of h(x) over the experimental domain Eh(x) can
be estimated by the sample mean, where ρ is a set of n experimental points over the domain. The famous Koksma–Hlawka inequality gives
the upper error bound of the estimate of Eh(x),Where The overall performances of SVM and BPNN were evaluated in terms of the root-mean-square error (RMSE) and mean absolute error (MAE), which was defined as below. Where y
The performance and robustness of the models was evaluated by cross-validation. The jackknife test (leave-one-out, LOO) was used; it was deemed the most rigorous and objective with the least arbitrariness, as demonstrated by an incisive analysis in a recent review (Chou and Shen, 2007). We used 24 data points to train the models and tested it with the left one. This was repeated 25 times, leaving in turn a different data point out of the training set and using it to validate the resulting models.
To
analyze the 20 amino acid compositions of xylanases, Bioedit software was used (version 5.0.9), and then each xylanase in the data set was
characterized by a vector x
Similar to other
multivariate statistical models, the performances of SVM for regression depend
on the combination of several parameters. They are penalty value For linear kernel, there
are only two parameters,
For the
RBF kernel, there are three parameters From the results of the Table
3, we can see that different combination of the three parameters might
result in different MAE and RMSE values. When According to Table 2 and Table 3, one can observe that many different combinations of parameters resulted in the same LOO cross-validation and training errors, which means that SVMs are not so sensitive to parameters. Meanwhile, the RBF kernel is superior to linear kernel, which was in accordance with some former researches for support vector regression tasks (Xue et al. 2004; Liu et al. 2005).
Recently, a few studies
have shown that SVM yielded better results than alternative machine learning
techniques such as BPNN. In this study, we have compared the performance of SVM
and BPNN with the same datasets. The architecture of BPNN was also optimized by
UD and the results were shown in Table 4. During the process, the
maximum iterations were appointed as 1000. According to Table 4, the
learning rate ( According
to Table 4, one can observe that different combinations of parameters
resulted in different LOO cross-validation and training MAEs and RMSEs. This
means that BPNN may be more sensitive to its running parameters when compared
with SVMs, especially RBF SVM. And also, the LOO cross-validation results of
BPNN were widely different when different sets of training and LOO
cross-validation were employed. The maximum and minimum LOO cross-validation
MAEs were 38.89ºC and 0.2ºC, respectively, while the corresponding MAEs of RBF
SVM were 27.44ºC and 0.03ºC. The predicted errors of all the 25 runs of BPNN
and SVM are shown in Figure 2. For linear SVM, 13 samples had small
differences to their experimental optimal temperatures (│ To validate the prediction
models, we showed two examples. Firstly, we cloned the xylanase gene of
Some
important parameters ( Recently two linear models for both single residue and dipeptides and optimum temperature of xylanase in the G/11 family were established based on stepwise regression (Liu et al. 2006).The training RMSEs of their models were 5.03ºC and 1.91ºC, respectively, and they calculated the maximal and minimal optimum temperature of xylanase as 120.84ºC and 10.83ºC. From these results we can conclude that the model we established here was much more accurate. This indicated that the relationship between amino acid composition and the xylanase optimum temperature was very complicated and one might not gain the satisfactory results based on the simple linear models, while SVM is a more powerful tool for prediction of nonlinearities. Using the crystal information of xylanase, one can pinpoint the residues that may suitable for mutations. Consequently, saturation mutagenesis (where all 20 native amino acids are tested at each pinpointed position) can be applied to generate large, virtual libraries of mutants. Then, our model, for predicting the xylanase optimal temperatures, can be used for pre-screening the virtual libraries. The optimal sequences were chosen based on their predicted optimal temperatures; the mutants were then generated experimentally by mutagenesis and recombination. Therefore, the model can decrease the sequence space, while maintaining broad diversity, to a number easily amenable to experimental screening. As analyzed above, SVM only showed a minor improvement over BPNN in our study, the large variation (from 0.03ºC to 27.44ºC) in prediction indicated that it should be used with some cautions. At the same time, the MAE of LOO cross-validation was 6.88ºC, and the mean absolute percent error was 12.8%, one can see that is not good enough for directing xylanase engineering. Perhaps further improvement may be achieved by collecting more data sets of higher quality. It should be possible to increase the number of data entries and eliminate the noisy data entries from the updated databases. We think when the MAE of LOO cross-validation was within 5ºC, it may be good enough for directing xylanase engineering and our results were close to this object. BAIROCH,
A.; APWEILER, R.; WU, C.H.; BARKER, W.C.; BOECKMANN, B.; FERRO, S.; GASTEIGER, E.; HUANG, H.; LOPEZ, R.; MAGRANE, M.; MARTIN, M.J.; NATALE,
D.A.; O'DONOVAN, C.; REDASCHI, N. and YEH, L.S. (2005). The universal protein
resource (UniProt). BEG, Q.A.; KAPOOR, M.;
MAHAJAN, G. and HOONDAL, S. (2001). Microbial xylanases and their industrial
applications: A review. CAI, C.Z.; HAN, L.Y.; JI,
Z.L. and CHEN, Y.Z. (2004). Enzyme family classification by support
vector machines. CHEN, C.; TIAN, Y.X.; ZOU,
X.Y.; CAI, P.X. and MO, J.Y. (2006). Using pseudo-amino acid composition and
support vector machine to predict protein structural class. CHICA, R.A.; DOUCET, N. and PELLETIER, J.N. (2005). Semi-rational approaches to engineering enzyme activity: Combining the benefits of directed evolution and rational design. CHOU, K.C. and SHEN, H.B.
(2007). Recent progresses in protein subcellular location prediction. CORTES, C. and VAPNIK, V.
(1995). Support-vector networks. DIAZ, M.; RODRIGUEZ, S.;
FERNÁNDEZ-ABALOS, J.M.; RIVAS, J.D.L.; RUIZ-ARRIBAS, A.; SHNYROV, V.L. and
SANTAMARÍA, R.I. (2004). Single mutations of residues outside
the active center of the xylanase Xys1Δ from FANG, K.T. (1980). The
uniform design: Application of number-theoretic methods in experimental design. FANG, K.T. and YANG, Z.H.
(2000). On uniform design of experiments with restricted mixtures and
generation of uniform distribution on some domains. FENEL, F.; ZITTING, A.J.
and KANTELINEN, A. (2006). Increased alkali stability in FRANK, E.; HALL, M.;
TRIGG, L.; HOLMES, G. and WITTEN, I.H. (2004). Data mining in
bioinformatics using FU, X.P.; WANG, W.Y. and ZHANG, G.Y.
(2012). Construction of an expression vector with elastin-like polypeptide tag
to purify the xylanase with non-chromatographic method. HAYES, R.J.; BENTZIEN, J.;
ARY, M.L.; HWANG, M.Y.; JACINTO, J.M.; VIELMETTER, J.; KUNDU, A. and DAHIYAT,
B.I. (2002). Combining computational and experimental screening for
rapid optimization of protein properties. LIANG, Y.Z.; FANG, K.T.
and XU, Q.S. (2001). Uniform design and its applications in chemistry and
chemical engineering. LIU,
H.X.; YAO, X.J.; XUE, C.X.; ZHANG, R.S.; LIU, M.C.; HU, Z.D. and FAN, B.T.
(2005). Study of quantitative structure-mobility relationship of the peptides
based on the structural descriptors and support vector machines. LIU, L.; DONG, H.; WANG,
S.; CHEN, H. and SHAO, W. (2006). Computational analysis of di-peptides
correlated with the optimal temperature in G/11 xylanase. MILDVAN, A.S. (2004).
Inverse thinking about double mutants of enzymes. MOREAU, A.; SHARECK, F.; KLUEPFEL, D. and MOROSOLI, R. (1994). Increase in
catalytic activity and thermostability of the xylanase A of OLIVEIRA, L.A.; PORTO, A.L.F. and
TAMBOURGI, E.B. (2006). Production of xylanase and protease
by SMOLA, A.J. and SCHÖLKOPF, B. (1998). VAPNIK, V. (1998). WARD, J.J.; MCGUFFIN,
L.J.; BUXTON, B.F. and JONES, D.T. (2003). Secondary structure prediction with
support vector machines. XUE, C.X.; ZHANG, R.S.; LIU,
H.X.; LIU, M.C.; HU, Z.D. and FAN, B.T. (2004). Support
vector machines-based quantitative structure-property relationship for the
prediction of heat capacity. YAO, X.J.; PANAYE, A.; DOUCET,
J.P.; ZHANG, R.S.; CHEN, H.F.; LIU, M.C.; HU, Z.D. and FAN, B.T. (2004). Comparative
study of QSAR/QSPR correlations using support vector machines, radial basis
function neural networks, and multiple linear regressions. ZHANG, L.; LIANG, Y.Z.;
JIANG, J.H.; YU, R.Q. and FANG, K.T. (1998). Uniform design applied to
nonlinear multivariate calibration by ANN. Note: Electronic Journal of Biotechnology is not responsible if on-line references cited on manuscripts are not available any more after the date of publication. |