solution of the heterogeneous data integration problem is presented with the explanation if the criteria to be employed in the approval of the validity. The tools to be used are also indicated.
The proposed solution is to use semantic web technologies (Semantic Data Integration Middleware (SIM) Architecture) for the initial integration process (Cardoso,2007) and then couple it with broker architecture to improve integration and interoperability while solving the problem of multi-level impedance (Kashyap and Sheth,2002).
For an elaborate diagram see figure the figure below.
Integration via the semantic web technologies According to Barnett and Standing (2001) the rapid developments in the business environments due to the adoption of internet-based technologies have resulted in the need to implement improved business models, development of improved network systems as well as alliances and the implementation of creative marketing strategies. The strategy to be developed for integrating heterogeneous data must take into account the organization-specific data and the general information based on the internet. The whole idea is to come up with a semantic web that is beneficial to individuals and organizations alike. In efforts geared towards the gaining of competitive advantage, organizations employ business-mediated channels in an effort to create internal and external. This is through the formulation of technology convergent strategies (through heterogeneous data integrations) and the organizing of resources based on knowledge and the existing relationships between the knowledge based as pointed out by Rayport and Jaworski (2001). The internal and external value is created on the basis of the information available and the organization of the resources related to knowledge and the corresponding relationships. This requires organizations to identification of the various data assets. The data assets could be in the form of relational databases, plain text files, web pages, XML files, and Electronic Data Interchange (EDI) document and web services. The proposed solution for this project should be able to integrate information from autonomous, heterogeneous and distributed (HAD) data schema. As pointed out by Ouskel and Sheth (1999) three forms of heterogeneity can be achieved. These are syntactic heterogeneity in which the technology used in the support of data sources is different (such as databases and webpages). In order to provide transactional data, it is important to make use of The Extensible Markup Language since it effectively provide consistent and reliable ML streams and web services (XML,2005). The second type of heterogeneity that is to be achieved is schematic heterogeneity which involves data source schemas that possess different structures. Semantic heterogeneity is the last form of data stream that is to be achieved by the proposed solution. XML is to be used in order to provide syntactic interoperability (Busler,2003). Its downside is that it lacks the required semantics for the current web environment (Shabo et al., 2006). The proposed solution should be capable of solving the semantic heterogeneity problem by enabling the autonomous, heterogeneous and distributed systems to share as well as exchange information in a manner that is semantically viable as pointed out by Sheth (1998). The solution is to employ the capabilities of semantic web via the concept of shared ontology. One of the main impacts of employing semantic web services is their ability to impact the organizational need for data integration from semantically dissimilar sources. The fact that semantic web services have successfully been deployed in Bioinformatics, Digital Libraries and the rest is a great motivator for the success of this project. The solution to data integration in this project entails the use of Semantic data Integration Middleware (SIM) and its consequent integration with the broker architecture to improve integration and interoperability. This is as a means of solving multi-level impedance for top notch unified data integration. Semantic data Integration Middleware (SIM)
This is a special data integration technique with a basis on single query. The technique effectively integrates the information that resides in different data sources having dissimilar structures, formats, schemas as well as semantics. The data wrapper or rather extractor knowledge is used in the transformation of data to semantic knowledge. The middleware extractor is ontology-based and multi-sourced as pointed out by Silva and Cardoso (2006). The SIM is made up of two main modules; 1) Semantic Transformation module and 2) the Syntactic-to-Semantic Transformation module (Cardoso,2007).
3.2 Semantic Transformation module
The Semantic Transformation module is responsible for the integration of the data that resides in various different data sources that possess dissimilar formats, schema and structure.
Syntactic-to-Semantic Transformation module
This module is used to map the maps XML Schema documents to the already available OWL ontology. It is also responsible for the automatic transformation of the XML instance documents onto the separate instances of the mapped ontology as pointed out by Rodrigues et al.,(2006). This module is critical for the operation of transforming XML-based syntactic data to a semantic one by means of OWL.
3.2.1 The Semantic data Integration Middleware (SIM) architecture
The Semantic data Integration Middleware (SIM) architecture is important for the process of integrating heterogeneous information since it is used in solving the problem of semantics that is inherent in the XML data schema and representation. Our choice of semantic data representation emanates from the fact that it marks the most current and most efficient state of data representation (Cardoso, 2007,p.2).The SIM architecture is illustrated in the figure below;
Figure 1: The SIM architecture (Source- Cardoso,2007).
The Semantic data Integration Middleware (SIM) architecture has four main layers. These are; the source of data, the Schematic transformation layer, the Syntactic-to-Semantic transformation layer and finally the ontology layer. The correlation between these layers is indicated in the diagram above.
Sources of data (D)
The data sources are the ones that dictate the scope of the information integration system. The diversity of the data source provides an enhance level of data visibility. The Semantic data Integration Middleware (SIM) architecture connects the formats of the database like the unstructured (such as plain text and web pages) semi-structured (XML) and structured databases (such as relational databases). The data sources can include other unmentioned formats.
The schematic transformation
The schematic transformation of data source (D) to XML is executed a module that integrates the data from different sources having different structures, formats, database schema as well as semantics. The module employs a data extractor that is multi-sourced in the transformation of the available data to XML.
The transformation from Syntactic to Semantic
This process is carried out by a module that employs the JXML2OWL framework so as to map the XML Scheme to the already available OWL ontologies. The module transforms the instance of XML into separate independent documents that are appropriately mapped into the ontology.
The Ontologies (OWL)
The Semantic data Integration Middleware (SIM) architecture brings about the capability of extracting data from various sources having different data types (structured, semi-structured or semi-structured) and then wrap the outcome in a Web Ontology Language (OWL) format (OWL, 2004). The importance of this is that it provides a homogenous data access to otherwise heterogeneous data sources. The adoption of OW ontology is based on its preference by the World Wide Web Consortium (W3C).
The semantic model
NIST (1993) described a semantic data model as a conceptual data model within which semantic data is included. The implication of this is that the model is a description of the meaning of the various instances. The semantic model is therefore an abstraction that is utilized in the definition of the instance data (stored symbols) correlate to the real world situations. In order to effectively conceptualize a given areas in a format that is machine readable, an ontology such as OWL is employed. The function of the ontology is the promotion as well as the facilitation of system interoperability to enhance intelligent processing and reuse the available knowledge. The ontology therefore provides a shared understanding of a given domain.
The schema of ontology defines both the data structure and the semantics. The extraction process can proceed without a schema. The ontology is important for the creation of the mapping between the schema and the data sources. The ontology also provides the specification of the query. As Rodrigues et al. (2006) pointed out the framework employed is JXML2OWL which has two subsystems; JXML2OWL Mapper and the JXML2OWL API. The JXML2OWL API is a reusable library that is also both generic and open source that is used to map the XML schemes to the OWL ontologies.The Mapper on the other hand is special application that is Java based and has a graphical user interface (GUI).
The documents that can effectively be mapped by the JXML2OWL to the OWL ontology are DTD, XMK and XSD. The process of mapping takes some time in a series of steps. The initial step is the creation of a new mapping project as well as the loading of XML schema and the OWL ontology. Should the XML schema be missing, then the JXML2OWL would come up with an appropriate schema. This step is followed by the creation of class mapping by the user. The mapping takes place between…