Bioinformatics involves an integrated approach involving the use of information technology, computer science to biology and medicine as professional and knowledge fields. It encompasses the knowledge associated with information systems, artificial intelligence, databases, and algorithms, soft computing, software engineering, image processing, modeling and simulation, data mining, signal processing, computation theory and information, system an d control theory, discrete mathematics, statistics and circuit theory. On the other hand, machine learning entails a sub-division of artificial intelligence and operates with technical skills to permit computers to adapt to certain responses and initiate actions (Zhang et al., 2009). Machine learning entails a range technical knowledge that looks at the scientific application of search engines, natural language processing, bioinformatics, medical diagnosis and cheminformatics, analysis of the stock market, game playing and computer vision.
The development of machine learning has been as a matter of necessity given the fact that, current knowledge that needs some levels of sophistication and technological advancement has been on the consistent growth. These are included in the revolution in the genomic field that entails amino acid sequencing and nucleotide sequences. To accomplish the possibility of storing these essential informati0on, machine learning has been imperative has led t-o the building of several sophisticated interfaces that researchers can manipulate to establish access to the available databases. In general, it is evident that the abilities of computers in learning these large numbers of application has not only provided solutions to a great deal of technological problems, but has also provided a prolific ground for knowledge acquisition (Fasconi & Nato, 2003).
Machine Learning Approaches
The process of machine learning involves the adoption of certain approaches that assist in .performing separate objectives or functions. These involve two most applied learning scenarios with their distinct criterion functions (Zhang et al., 2009). These two approaches are commonly referred to as supervised and unsupervised selection approaches/criteria.
This may also be intimated to as discrimination or prediction classification. In this approach, algorithms are developed to levels priori-defined. The construction of algorithm takes place in the dataset training followed by comprehensive tests on independent data set to examine the algorithmic accuracy and efficiency. In the process of regression and classification, a group of support vectors that are related to methods of supervised learning. Such related vector machines include among others linear classification, which develops a straight line providing a distinct boundary between two dimensions (zhang et al., 2009). These lines may also be referred to as hyper lines which have replaced the use of the dot product for reasons of fitting in the maximum-margin. A decision tree structure may also be applies whereby classifications are represented by the leaves while feature conjunctions that direct to the classification are represented by the branches. Decision tree algorithm may be efficiently changed into a paradigm of rules of production. The supervised appr5oach also entails the use of artificial Neural Networks, a group of nodes that are interconnected that process information through the use of computational model. The information that flows through the network whether external or internal may change ANN's structure. The relationship that exists between inputs and outputs can be modeled by the use of ANN. Multi-Layer Perception (MLP) and the Radial basis function (RBF) are the most used algorithms of the ANN.
It involves two distinct ways applicable in designing of selection criteria. They are identified on their metric of performance illustrated as classification driven criterio0n and fidelity driven criterion. Fidelity driven criterion is dependent upon the bulk of the, original information stored or discarded after the reduction of the feature dimension. Unsupervised approach operates on the basis of cluster analysis in which the method of clustering separates objects, into a number of predetermined groups assuming a pattern that increases a specific functionality.
The term neural network can be understood in two different aspects. That of the biological concept that has links with the nervous systems kin the neuroscience. The second describes interconnecting artificial neuron networks built on the principles of the biological neurons. In concept classification, multi-layer perception presents an instrumental method for such tasks. Multi-layer perception denotes a feed forward neural network which has a single or several layers that are found between input and output layers. This explains the flow of data in a unidirectional form moving from the input to the output layer. The back propagation algorithm of learning enables the training of this network. The use of multi-layer perception is applied in varied ranges of patterns for classification, prediction, recognition, and approximation. Solving linear problems may cause proble3ms using other means or perceptions but when this is applied the linear problems are easily solved.
The back propagation is used in the training of the multi-layer perception to enable it to accurately describe feed forward network. The training of the network may be carried in either of the types of the network training referred to as sequential mode, which is online, per pattern or stochastic, and batch mode which is offline, or per-epoch. The sequential mode has very limited storage for every connection weighted, a presentation order which is very random and per pattern means of updating indicating weight space search as stochastic hence low local minimal risk. It also has the ability to capitalize on any redundancy aspect during the training set; its implementation is also very simple. When the batch mode is used, there is enhanced high learning speed as compared to the sequential mode and it is very simple to parallelize. However, the use of multi-layer perception with activation functions that are not linear present complicated surfaces of error void of any minimum (Gurney, 2003)
The term random forest is used to describe an ensemble of learning that comprises a bagging of a decision tree that has not been modified or not pruned that also exhibit a randomized identification of features in every split. Decision tree refer to combined individual learners and is popularly adopted in the exploration of data. An example of decision tree is known as CART (classification and regression tree). The random forest integrates the idea of feature selection and bagging which is used in the construction of a group of decision trees with a supervised variation. The random forest is useful in that it builds a learning algorithm that is precise and accurate hence giving a classifier which is accurate, when used on very large or expansive databases it passes over efficiently. A large number of input variables may be used with no deletion of any variable, provides an estimation of the type of variables that may be used in the classification. It also has the ability to initiate internal indiscriminate estimation of the generalization error during the continuation of forest building.it is easy to estimate the data which is missing and retains accuracy should there be a large chunk of missing data, in class population unbalanced data sets, random forests provide effective methods of error balancing. The relationship between classification and the variables are possible to identify easily due to the ability to compute prototypes for the sake of deriving such information. It is easy to cluster, locate outliers and provide vital data information due to the possibilities of proximity computation between pairs of cases. Random forest gives an experimental method that can be used in the identification of interactions between variables.
On the other hand, random forests sometimes over fit in certain datasets that shows regression functionality and noisy classification. It is also clear that humans experience a lot of difficulties in the interpretation of the random forests classification. Of importance to note is the fact that a data set of 200 random forests has been developed so that an intuitive visualization of a model space could be developed.