Enterprise Information Integration (EII) represents a new category of
software that enables disparate data silos to be integrated into a single
virtual database for applications. This approach gives developers a powerful
tool for simplifying data integration and building flexible applications. If you
haven't heard of EII yet, you will soon as the industry rallies around this
concept and more EII projects reach deployment.
This article explains why you should consider EII, describes its advantages
for data integration, and discusses the different approaches to implementing
EII. The article provides a framework for comparing EII products and choosing a
suitable product to begin simplifying data integration in your environment.
Data Integration Issues The presence of multiple,
disparate data sources is a fact of life in enterprise IT environments. The
number and types of data sources have only grown over time and made data
integration a prominent aspect of IT investments.
To combat the inflexibility induced by information fragmentation, developers
have many tools for data integration (see Figure 1). There are adapters for
accessing data sources, transformation engines for reformatting data, and data
warehouses for aggregating data from multiple sources. To integrate data
sources, developers use these tools to program the integration requirements into
applications.
Although this approach to data integration works, it requires a programmatic
approach that has the following deficiencies:
- Low-level programming: The developer needs to
create and maintain the integration code. The tools provide only the building
blocks for integration. The developer must write the low-level code to implement
the integration requirements, which greatly increases development and
maintenance costs.
- Multiple data source APIs: Each data source has
its own API and data format. The developer must understand each data source in
sufficient detail to manage the data integration. This often requires multiple
specialists to implement and maintain the integration, which again drives up
complexity.
- Inconsistent integration framework: The developer
has to manage all the integration issues such as the relationships between data
sources and data formats. This leads to one-off solutions and inconsistencies
that are difficult to maintain.
- Tight-coupling: The programmatic approach creates
hard-coded dependencies between applications and data sources. This severely
hampers maintenance since updating a data source can potentially break many
applications. Even minor updates must be carefully considered and orchestrated.
- Limited reusability: The integration code tends to
be application centric and data source centric. This limits the reusability of
investments in data models and integration rules.
A Smarter
Approach to Data Integration A better solution for data integration
is EII. EII supports data integration by enabling multiple data silos to be
represented to applications as a single virtual database. Instead of integrating
data in the application code, the data integration function is pushed out of the
application layer into a new EII tier that sits between applications and data
sources. This new tier is dedicated to managing integration tasks such as
connecting to a data source, transforming data, and integrating data.
In this framework, the developer or analyst creates a logical data model in
the EII server that represents the business view of information (aka the data
integration requirements). The target data source's physical data models are
mapped into the logical data model to create a virtual database schema.
Applications interact with data sources through the EII server based on the
logical data model. The EII server automatically translates application requests
into queries against one or more data sources, integrates the data, and produces
results according to the logical view.
EII's holistic approach to data integration addresses the drawback of the
toolkit approach (see Figure 2). Instead of providing tools for each task, EII
establishes a framework that automates the low-level details and exposes a
high-level, declarative interface for specifying data integration requirements.
Multiple data sources can be integrated without writing any application code.
The developer defines the desired logical data model and maps the data source
into the logical model using a GUI tool.
This end-to-end approach generates several advantages:
- No programming: The declarative approach
eliminates the need to create and manage integration code. The EII server
automatically generates the low-level code to orchestrate the data integration.
- Unified API: Applications access data sources
through the EII server's API instead of through data source APIs. Developers do
not need to know how every data source works.
- Automation: The EII server manages the disparities
between data formats and APIs. The developer uses the GUI tool to create the new
data model and the EII server automatically generates the correct API calls and
data transformations.
- Loose coupling: Inserting an EII layer between
applications and data sources effectively decouples the usual dependencies
between them. The mapping between the logical data model and the data source
simply needs to be updated to reflect any changes to the data source. This level
of flexibility provides some interesting benefits:
- An application can migrate to a different database of the same or different
architecture.
- A database schema can quickly evolve to reflect new business requirements.
- Multiple versions of the logical data models can be created to support
incremental system migration.
- Common data model: A common data model can be
established for the enterprise. It helps different groups understand and share
the available information. The investment in a common model compounds over time
to improve the strategic utilization of information assets.
- Reusability: The logical data model can be
aggressively reused in other projects. This reduces the need for additional data
integration and centralizes ongoing administration. Reusability changes the data
integration focus from code management to strategic information management.
Approaches to EII How do you get started with EII?
Like all emerging markets, EII offers many different approaches to implementing
the solution. In general, the available products can be differentiated based on
the underlying logical data model, the data transformation framework, and the
query interface.
Logic Data Model The logical data model is at
the heart of every EII solution. Physical data models from target data sources
are mapped into the logical data model. This model serves as the schema of the
virtual database. The types of logical models include relational, object, and
XML. Depending on the approach, the EII servers would appear to applications as
an object database, a relational database, or an XML database.
Transformation Framework The transformation
framework dictates how data from multiple sources are transformed into the
logical data model. The approaches to transformation include SQL, XQuery, and
proprietary. A visual data mapping tool is typically provided to shield the
developers from low-level transformation code; however, if the mapping tool
doesn't accommodate specific integration requirements, the developer may need to
edit or write transformation code directly.
Query Interface The query interface provides
access to data. It dictates how applications read, query, insert, update, and
delete the underlying data represented by the logical data model. It also
dictates the data format presented to applications. EII query interfaces include
SQL, XQuery, and proprietary.
Although the data model, transformation framework, and query interface can be
mixed and matched, in practice, specific transformation frameworks and specific
query interfaces work best with specific data models. Therefore, the general
approaches to EII can be grouped into three main categories: relational, object,
and XML (see Table 1).
Relational Approach Products that take the
relational approach use the relational data model as their logical data model.
All data sources are represented as a series of tables, and SQL is used to
transform the data into the logical model. With this approach, the EII server is
a virtual relational database and applications use SQL to interact with the
integrated data.
MetaMatrix is an example of this approach.
Object Approach Products that take the object approach
use objects as the logical data model. All data sources are represented as
objects, and automatically generated code is used to transform the data into the
logical model. With this approach, the EII server is a virtual object database
and applications use a proprietary interface to obtain data objects.
Journée is an example of this approach.
XML Approach Products that take the XML approach
use XML as the logical data model. All data sources are represented as XML
document collections and XQuery is used to transform the data into the logical
model. With this approach, the EII server is a virtual XML database and
applications use XQuery to interact with the integrated data.
BEA Liquid Data for WebLogic, Ipedo, and Nimble are examples of this
approach.
Each approach to EII has its advantages and disadvantages. The primary
distinctions are based on the data modeling flexibility, the query flexibility,
and the result-processing requirements. These differences affect the suitability
of each approach to specific applications and developer predilections.
Data Modeling Flexibility The distinction in modeling
flexibility deals with the expressiveness of the data model that can be created
from multiple data sources (see Table 2). This is largely dependent on the
underlying logical data model and the ability of the transformation framework to
mold data structures into the new logical model.
In this regard, the object and XML approach have an advantage over the
relational approach because of their ability to support hierarchical data
relationships. Their logical data model can directly represent hierarchical data
whereas the relational approach must decompose the data structure into tables.
In addition, the object and XML approach can represent hierarchical data sources
like XML, whereas the relational approach requires the developer to reconstruct
the hierarchical data in application code.
This modeling advantage is important to applications that use data from
nonrelational data-sources sources such as message queues, Web services, XML
documents, EJBs, and applications.
Query Flexibility Another key distinction between each
approach is query flexibility. This deals with the level of filtering and data
processing that can be specified in a query and executed by the EII tier.
The SQL and XQuery query languages have an advantage over simply retrieving
objects by criteria. The language approach better supports projection,
aggregation, and joins, which enable fine-grained results to be produced by a
query. In contrast, query returns of object collections may generate unnecessary
data and require additional processing in application code to aggregate and join
data.
XQuery has a further advantage over SQL and Object with its enhanced data
manipulation facilities. It provides a functional programming language that can
express complex transformations against any data structure. It supports built-in
and external functions, conditional processing, scripting, and the ability to
transform results into any text or binary format.
The distinction in query flexibility is important because the more data
processing the EII server can perform, the less code the developer needs to
create and maintain. Query flexibility also has implications on performance. A
query that returns just the desired data, minimizes network traffic and improves
the response time. To achieve the smallest result set, the query must maximize
the level of data selection, projection, aggregation, joins, and transformation
that can be expressed and processed in a single query.
Result Processing The last primary distinction between
the three approaches is the result-processing model. The relational approach
generates tabular data, the object approach generates native program objects,
and the XML approach generates XML documents.
The object approach has an advantage over the relational and XML approach
since a native object representation of data is the most convenient to work
with. The developer directly accesses the data by simply calling the specific
data object's methods. In the relational and XML approach, the developer must
work with generic data structures such as JDBC ResultSets or DOM objects. These
require additional work over the object approach to read, update, create, and
delete data.
Selecting an Approach The capabilities and limitations of
each approach affect their suitability to particular environments, data
processing requirements, and data source integration requirements. The best
approach truly depends on the unique set of requirements. The following are some
general guidelines for selecting an approach.
The relational approach works very well with established data programming
practices. Developers can get started quickly and leverage traditional
techniques and know-how. Although the relational data model is less flexible,
the majority of enterprise data resides in relational databases and many
nonrelational data sources can be coaxed into a relational format. The pain of
accommodating a few nonrelational data sources may be outweighed by the
approachability of relational development.
The object approach provides a flexible, logical data model for integrating
diverse data sources. For complex environments, this capability greatly
simplifies data integration and produces detailed views of enterprise data
assets. In addition, the object interface makes data programming more
convenient. Fans of object-to-relational mapping tools and object databases will
appreciate the object approach. Unfortunately, binding data to program objects
is also the weakness of the object approach. This makes query processing less
flexible and requires the application developer to process more data. The
resulting inefficiency makes the object approach unsuitable for ad hoc query
processing where the data binding can't be tuned for performance.
The XML approach is the most cutting-edge architecture for data integration.
Using XML as the logical data model and XQuery as the query interface provides a
flexible platform for EII. XML effectively models many enterprise data sources
and XQuery provides powerful data processing capabilities. Together these
minimize the integration code in applications. These advantages are the most
apparent in an environment with heterogeneous data sources and mixed application
architectures. Although the XML approach provides a great deal of flexibility,
the technology is less mature and requires a shift toward XML-centric
application development. For developers ready to experiment and blaze new
trails, the payoff will be worthwhile.
Other Considerations The comparisons between different
approaches to EII provide a good starting point for exploring the market
offering, but they don't tell the whole story. In addition to exhibiting many
different approaches to the solution, emerging markets also exhibit products
with different levels of feature coverage. This affects the capability and
performance of EII products beyond the architectural differences. Understanding
the following key features will help to further narrow the list of qualified
candidates.
Adapters Also known as connectors, these
software components make a data source available to the EII server. This is the
first item of any checklist since there is no solution if your target data
sources are not supported. Most vendors get the bulk of their adapters from
third-party providers. The level of integration between the EII server and the
adapter can vary, so look for adapters that are well integrated into the
graphical data mapping tool and the query interface.
If your target data source is not supported, most vendors offer a development
kit for building custom adapters. The effort will vary but it won't be trivial,
so consider this option carefully.
Update In addition to accessing data, the
ability to propagate updates to the underlying data sources is critical. A
bidirectional EII solution completes the virtual database illusion to keep the
application decoupled from data integration. Look for the ability to update
logical data entities made from multiple data sources and look for strong
transaction management capabilities.
Security The logical data model in an EII system
will cross many different data sources, so a good security model is needed to
manage access to the integrated information. Look for fine-grained access
management and integration with enterprise directory servers like LDAP
directories.
Caching Integrating data from multiple sources
can be computationally expensive. An EII server's query response time can be an
issue depending on the number of data sources, the number of joins, and the
level of transformation required to generate a result according to the logical
data model. Caching query results is a good strategy for reducing the processing
requirements and improving performance.
There are two kinds of caches. The more conventional is a read cache that
saves prior query results. The EII server scans the cache for past results
identical to the query before going to back-end data sources and building new
results. To avoid stale data, a read cache periodically clears the cached
results. Alternatively, the read cache can also be synchronized with the source,
which enables cached results to be selectively cleared or refreshed. This keeps
the cache hit rate high to reduce query response times.
The other type of cache is a query cache. This approach caches all the data
in the logical data format and directly processes arbitrary queries against the
cached data. A query cache is like a data mart except with active
synchronization. This approach eliminates all the physical-to-logical data
mapping and all queries against the underlying data sources during a request.
For sophisticated logical data models, a query cache can greatly improve
performance. However, a query cache's storage and synchronization requirements
can be enormous. Careful data partitioning is critical to keep the cache size
manageable and filled with sufficiently current data.
A read cache and a query cache can be used in combination to better match
data usage requirements. A read cache is best for situations with predictable
and repetitive queries, while a query cache is best for environments with ad hoc
queries against relatively static data sources.
Summary Enterprise information integration is still very
new but the value proposition is realizable. It can dramatically reduce the
complexity of working with multiple data sources and produce much more flexible
applications.
A starting point for evaluating the suitability of EII products is to
consider the three general approaches to EII: relational, object, and XML. Each
approach has its advantages and disadvantages, so the best solution depends on
the specific requirements. In general:
- The XML and object approaches provide greater data integration capabilities
but do not fit as well into traditional database application development
environments.
- The XML and relational approaches provide more flexible query facilities but
require more work than the object approach to process query results.
- XML with XQuery provides the most powerful modeling and querying facility
but the technology is still evolving.
Once the choices have been
narrowed, consider the availability of adapters, support for updates, the
security model, and the caching model to make the next cuts. This should
identify a suitable product to begin leveraging EII.
|