Paper Example Undergraduate 3,194 words

Integrating Heterogeneous Data Using Web Services

Last reviewed: February 8, 2011 ~16 min read

IEEE-Computer Science -- Literature Review

IEEE-Computer Science

Integration Approaches Practices

The work of Ziegler and Dittrich (nd) reports that integration is "becoming more and more indispensable in order not to drown in data while starving for information." The goal of data integration is "to combine data from different sources by applying global data model and by detecting and resolving schema and data conflicts so that a homogenous, unified view can be provided." (Ziegler and Dittrich, nd) There are two reasons for data integration:

(1) Given a set of existing data sources, an integrated view is created to facilitate data access and reuse through a single data access point;

(2) Given a certain information need, data from different complementing sources is to be combined to gain a more comprehensive basis to satisfy the information need. (Ziegler and Dittrich, nd)

Foundations of the SIRUP Approach

Foundations of the SIRUP approach are stated to include the following principles:

(1) Semantic Perspectives -- "a user defined conceptual model of an application domain with explicit queryable semantics for all entities and relationships appearing in it."

(2) Bipartite Integration Process -- generally two primary roles: data providers and data users. It is reported that there are two distinct phrases in the integration process of the SIRUP approach: (a) a data provision phase where administrators of local data sources explicitly declare the data and its semantics that is offered for integration; and (b) a Semantic Perspective modeling phase where users who know their application domain for which data is to be integrated define the desired Semantic Perspective.

(3) IConcepts -- An IConcept is short for Intermediate Concept and is a basic conceptual building block that acts as a linking element between data providers and data users interested in data for their information needs. Each IConcept has a queryable link to at least one concept of an ontology to explicitly define the semantics of the real-world concept it represents. Data sources are stated to provide attributes for an ontological concept represented by a particular IConcept. Through this, the data sources are able to declare which attribute data they are capable and willing to provide concerning a given IConcept. For each of the attributes it is reported that additional structural metadata is provided. (Ziegler and Dittrich, nd, paraphrased) IConcept provide data providers with a way to specifically identify the semantics and structure of the data offered for integration that is user-specific. IConcept is for data users "an access point to retrieve data from different data sources referring to the same real-world concept." (Ziegler and Dittrich, nd) IConcepts are additionally reported to conceal technical and structural heterogeneity from data users and assist in resolving semantic conflicts according to the perception of the user of the application domain.

(4) User Concepts -- a user-specific concept that is built through selection and combination of user specific copies of IConcepts.

(5) Semantic Multidatasource Language -- a declarative language is provided for provision of data in addition to specification of User Concepts and Semantic Perspectives. This language is reported to provide support for querying of explicit semantics and metadata assigned to User Concepts and IConcepts.

(6) Ex-ante View Definition -- users can specify views only on top of already existing schemas and this approach is referred to as 'ex-post view definition' because the view is "created after a schema is defined." (Ziegler and Dittrich, nd)

(7) Pragmatic Data Integration -- approaches that integrate data against one or more global ontologies and assume an ideal world in which data for all ontology concepts is available. (Ziegler and Dittrich, nd)

Model Management

The work of Bernstein, Halevy, and Pottinger (nd) entitled "A Vision for Management of Complex Models" reports on the challenges that are met in the construction of applications for database systems (DBMSs) and how this is inclusive of "the manipulation of models." Models are described as "a complex discrete structure that represents a design artifact, such as an SML DTD, web-site schema, interface definition, relational scheme, database transformation script, workflow definition, semantic network, software configuration or complex document." (Bernstein, Halevy, and Pottinger, nd)

The use of models is inclusive of management of the changes that take place in models and the data transformation from one to the other, which is reported to make a requirement of "an explicit representations of mappings between models." (Bernstein, Halevy, and Pottinger, nd) It is the belief of Bernstein, Halevy, and Pottinger that the DBMS could be made easier to use through "making 'model' and 'mapping' first-class objects with high-level operations that simplify their use…" which is referred to as "model management." (Bernstein, Halevy, and Pottinger, nd)

Bernstein, Halevy, and Pottinger state that their work in writing makes two primary contributions:

(1) It argues that general-purpose model management functions are needed to reduce the amount of programming required to manipulate models; and (2) It proposes a data model that captures model management functions. (nd)

According to Bernstein, Halevy, and Pottinger (nd) the data model is comprised by "formal structures for representing models and mappings between models and of algebraic operations on those structures. " Model management applications presently while being functionally advanced through relational and OO DBMSs "still include a lot of complex code for navigating graph-like structures. Producing, understanding, tuning, and maintaining navigational code is a serious drag on programmer productivity, making model management applications expensive to build." (Bernstein, Halevy, and Pottinger, nd)

Proposed by Bernstein, Halevy, and Pottinger is to raise the "level of abstraction beyond current DBMSs through introduction of "high levels operations on models and model mappings." (Bernstein, Halevy, and Pottinger, nd) Examples are "matching, merging, selection and composition" all of which are not particularly novel operations. (Bernstein, Halevy, and Pottinger, nd) The following model examples and mappings are stated to illustrate the "pervasiveness and scope of model management." (Bernstein, Halevy, and Pottinger, nd) Those are stated as follows:

(1) mapping an XML schema of one application to that of another in order to guide the exchange of XML instances between the applications;

(2) mapping a web site's content to its page layout in order to drive the generation of web pages;

(3) mapping data sources into data warehouse tables in order to generate programs that transform production data and load it into a data warehouse; mapping the DB schema of one software release into that of the next release, to guide the migration of DBs;

(4) mapping source make files into target make files in order to drive the transformation of make scripts and thereby help port complex applications from one programming environment to another; and (5) mapping the components of a complex application to the components of a system where it will be deployed in order to drive the generation of installation, upgrade, and de-installation programs. (Bernstein, Halevy, and Pottinger, nd)

Construction of generic functions in model creation and mappings enables them to be manipulated as single objects serving to create a better environment for the tasks just stated previously. The glue provided between the systems is reported to be provisioned by "simple adapters that:

(1) import or export a model in the model management system from or to a schema in the target platform; or (2) interpret a mapping in the model management system to transform instances of one target model to those of another." (Bernstein, Halevy, and Pottinger, nd) It is stated there are many challenges in identifying architectures that are sound for system coupling.

The leverage of building model management functionality is stated to be "highly generic…[and]…widely applicable." (Bernstein, Halevy, and Pottinger, nd) Model management applications are described as "metadata management" and it is stated that the primary effort in building such an application is "in manipulating descriptions of a thing of interest, rather than the thing itself." (Bernstein, Halevy, and Pottinger, nd) The question is posed as to whether keywords are actually data or if they are metadata and it is stated that model management "takes a different cut at the problem. It focuses attention on a particular kind of metadata, structure and mathematical semantics of descriptive information." (Bernstein, Halevy, and Pottinger, nd)

Stated to be a primary goal of model management is the provision of support for managing change in models and for mapping data between models. Therefore, it is believed that model mappings must be manipulated as first-class citizens. Key elements underlying the approach of Bernstein, Halevy, and Pottinger (nd) to model mappings include:

(1) the need to manipulate model mappings much as models are manipulated;

(2) mapping consists of connections between instances of two models, which are often different types;

(3) there may be more than one mapping between a given pair of models;

(4) a mapping may relate a set of objects in on model to a set of objects in another via a language for building complex expressions;

(5) mappings must be able to nest because this enables the reuse of mappings: a mapping on a model M. To be used a component of a mapping on models that contain M. (nd)

Databases to Dataspaces (Franklin, Halevy and Maier, 2005)

Franklin, Halevy and Maier (2005) write in the work entitled "From Databases to Dataspaces: A New Abstraction for Information Management" that a Database Management System (DBMS) is a generic repository for the storage and querying of structured data." The offerings of a DBMS includes a "suite of interrelated services and guarantees that enables developers to focus on the specific challenges of their applications, rather than on the recurring challenges involved in managing and accessing large amounts of data consistently and efficiently." (Franklin, Halevy and Maier, 2005) Data management scenarios today can rarely be fitted into a "conventional relational DBMS or into any other single data model or system." (Franklin, Halevy and Maier, 2005)

Figure 1

A Space of Data Management Solutions

The above illustration in Figure 1 shows the existing data management solutions categorized according to two dimensions. Administrative proximity is reported to indicate "how close the various data sources are in terms of administrative control." (Franklin, Halevy and Maier, 2005) Near is reported to mean that the sources are "under the same or at least coordinated control" and Far is reported to indicate "a lower coordination tending towards none at all." (Franklin, Halevy and Maier, 2005)

The closer the administrative control of a group of data sources then the stronger are the guarantees of such as permanence and consistency which can be provided by the data management system. (Franklin, Halevy and Maier, 2005, paraphrased) Semantic Integration is reported as a measure of "how closely the schemas of the various data sources have been matched." (Franklin, Halevy and Maier, 2005) The DBMS represents just one point solution in the DBMS environment. The 'data integration systems' and 'data exchange systems' are stated to offer "many of the purported services of dataspace systems." (Franklin, Halevy and Maier, 2005) The distinction is stated to be that the data integration systems "require semantic integration before any services can be provided." (Franklin, Halevy and Maier, 2005)

It is reported that the goal of 'Personal Information Management' has as its goal to provide "easy access and manipulation of all of the information on a person's desktop, with possible extension to mobile devices, personal information on the Web, or even all the information accessed during a person's lifetime." (Franklin, Halevy and Maier, 2005) Scientific data management involves monitoring, observation and forecasting and can also be used in "running atmospheric and fluid-dynamics models that simulate past, current and near-future conditions." It is reported that the computations require importing data and model outputs from other groups…" (Franklin, Halevy and Maier, 2005)

Dataspace Systems

Dataspaces are described as a "set of participants and relationships." (Franklin, Halevy and Maier, 2005) The participants in a dataspaces are stated to be "individual data sources" which can be relational databases, XML repositories, text databases, web services and software packages…" which may be "stored or streamed…" (Franklin, Halevy and Maier, 2005) Some participants are stated to support "expressive query languages" while other are stated to be "opaque and offer only limited interfaces for posing queries such as structured filed, web services or other software packages." (Franklin, Halevy and Maier, 2005)

The dataspace system should have the capacity to model any type of relationship between two or more participants. Dataspaces may be nested within each other as well. It should be understood that participants in a dataspace will not be able to provide the interfaces needed to support all DSSP functions therefore, the need will exist to extend data sources variously. The following is an example dataspace and the components of a dataspace system.

Figure 2

Example Dataspace and the Components of a Dataspace System

Source: Franklin, Halevy and Maier (2005)

The components of the dataspace system includes the catalog and browse. The catalog includes information about all the participants in the dataspace and the relationships among them. The catalog accommodates a great many sources and supports various levels of information about their structure and capabilities. The DSSP should also support a "model-management environment that allows creating new relationships and manipulation of existing ones.

Query Systems

Search and query should offer the following capabilities:

(1) query everything; and (2) structured query. (Franklin, Halevy and Maier, 2005)

Meta-data queries should be supported by the system, which includes:

(1) the source of an answer or how the answer was computed;

(2) timestamps on the data items participating in the answer's computation;

(3) specification of which dataspace data items may depend on a specific data items and the ability to support hypothetical queries. (Franklin, Halevy and Maier, 2005)

Finally, all of the search and query services must be supported in a way that can be applied in real-time streaming or modified data sources. A DSSP is stated to have a storage and indexing component for the following purposes:

(1) To create efficiently queryable associations between data objects in different participants,

(2) to improve accesses to data sources that have limited access patterns,

(3) to enable answering certain queries without accessing the actual data source, and (4) to support high availability and recovery. (Franklin, Halevy and Maier, 2005)

In addition, the index should be adaptable to "heterogeneous environments." (Franklin, Halevy and Maier, 2005) The goal of the discovery component is also addressed and it is stated that a DSSP "should be able to imbue such a participant with additional capabilities, such as a schema, a catalog, keyword search and update monitoring." (Franklin, Halevy and Maier, 2005) In addition, the source extension components "supports value-added information held by the DSSP, but not present in all of the initial participants." (Franklin, Halevy and Maier, 2005)

Data Integration Systems -- Collaborative Approach

The work of Doan and McCann (nd) entitled "Building Data Integration Systems: A Mass Collaboration Approach" reports that building data integration systems is primarily accomplished by hand in what is described as a "very labor intensive and error prone process." (nd) Doan and McCann additionally report "numerous research activities have been conducted on data integration, both in the AI and database communities." (nd)

There has been a great deal of progress made in the development of conceptual and algorithmic frameworks: query optimization, constructing semi-automatic tools for schema matching, wrapper construction, and object matching; and field data integration systems on the internet." (Doan and McCann, nd) Doan and McCann report that the basic idea in their work is "to have users contribute facts and rules in some specified language." (nd) Their work differs from others in several ways:

You’re 82% through this paper. Sign up to read the full paper.

Sign Up Now — Instant Access Already a member? Log in
130,000+ paper examples AI writing assistant Citation generator Cancel anytime
Cite This Paper
PaperDue. (2011). Integrating Heterogeneous Data Using Web Services. PaperDue. https://www.paperdue.com/essay/integrating-heterogeneous-data-using-web-121486

Always verify citation format against your institution’s current style guide requirements.