Index of Oracle Data Integrator articles

March 11, 2013, 7:48 pm

≫ Next: ODI: Restricting Visibility of Work Repositories

Based on an ELT architecture, Oracle Data Integrator is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes, to SOA-enabled data services.

ODI

Most Recent Articles
(up to 30)
Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI) (06/27/2014 - benjamin.perez-goytia@oracle.com)
ODI Agents: Standalone, JEE and Colocated (04/16/2014 - Christophe Dupupet)
Understanding Where to Install the ODI Standalone Agent (04/14/2014 - Christophe Dupupet)
Understanding the ODI JKMs and how they work with Oracle GoldenGate (01/21/2014 - Christophe Dupupet)
Using Oracle Data Pump in Oracle Data Integrator (ODI) (01/09/2014 - benjamin.perez-goytia@oracle.com)
Using ODI with a Development Topology that Doesn’t Match Production Topology (10/29/2013 - Christophe Dupupet)
Exceptions Handling and Notifications in ODI (10/28/2013 - Christophe Dupupet)
Implementing Early Arriving Facts in ODI, Part I: Proof of Concept Overview (09/26/2013 - benjamin.perez-goytia@oracle.com)
Implementing Early Arriving Facts in ODI, Part II: Implementation Steps (09/26/2013 - benjamin.perez-goytia@oracle.com)
Using an External Database for your XML Schema, Part II: Optimizing the Use of the External Schema (09/20/2013 - Christophe Dupupet)
Using an External Database for your XML Schema, Part I: Understanding the ODI XML JDBC Driver (09/20/2013 - Christophe Dupupet)
Correlating SQL statement in DB Sessions with ODI Sessions in ODI 11.1.1.7 (09/04/2013 - Christophe Dupupet)
Integrating ODI with OBIEE: How to invoke ODI scenarios & Load Plans from OBIEE (08/30/2013 - benjamin.perez-goytia@oracle.com)
How to Understand and Diagnose ODI Connectivity Issues (08/30/2013 - Christophe Dupupet)
How Many ODI Master Repositories Should We Have? (07/01/2013 - Christophe Dupupet)
ODI: Restricting Visibility of Work Repositories (05/15/2013 - Christophe Dupupet)
Index and Navigation Pages
Index of Oracle Data Integrator articles (03/11/2013 - G. Allen Pearson)

All site content is the property of Oracle Corp. Redistribution not allowed without written permission

↧

ODI: Restricting Visibility of Work Repositories

October 7, 2014, 12:28 pm

≫ Next: How Many ODI Master Repositories Should We Have?

≪ Previous: Index of Oracle Data Integrator articles

Introduction

When ODI developers connect to the ODI studio, they can edit the connection parameters. In particular, they can manually select the Work Repository that they will connect to. Repository access can be password protected, but for security reasons it is best to not even list repositories that users should not have access to. This post will detail the necessary steps to make sure that only authorized repositories are listed in the ODI login detail window.

Restricting Visibility of Work Repositories in ODI

The techniques described in this post will only be available to ODI users that have the required privileges to access the ODI Security navigator: Supervisors and Security Administrators. If you want to experiment with the instructions described in this post, make sure that you do indeed have these privileges.

First, let’s make sure that we all agree on what we are trying to achieve here in restricting the list of repositories that is visible to the user.

Objective

If we can prevent ODI users from seeing repositories that they are not allowed to access, we can guarantee that they will not be able to connect to these repositories (independently of passwords that may have been set to further secure access to these, which is a separate topic).

What we are trying to achieve here is the following: when a user wants to connect to an ODI repository, the login screen will show up. If the user clicks on either the green plus sign (to create a new login profile) or clicks on the pencil button (to edit the current login profile) it is possible to edit the connection details. This is represented by action (1) in figure 1.

Figure 1: Editing the login profile

When the user now clicks on the magnifying class next to the work repository entry, ODI connects to the Master Repository and will list all Work Repositories, as shown with action (2) in figure 1. Or more specifically, ODI lists all authorized Work Repositories. We will now see how to define what is authorized or not.

Creating a new Non-Generic profile

The first step if we want to limit which objects are visible to an ODI user is to define a non-generic profile. Non-generic profiles are typically prefixed with the letters NG to differentiate them from the generic ones: for instance, ODI ships out of the box with an NG Designer profile. What makes a profile generic is that users have the same privileges on all objects: if I have the privilege to edit interfaces, this means that I will have the right to edit any interface in ODI Studio. With a non-generic profile, the privileges that I will be granted will only be valid for objects that have been explicitly granted to me (or objects that I have created myself if I have the required privileges to do so).

Since we want to restrict the user’s ability to view the complete list of Work Repositories at connection time, we will create a non-generic Connect profile, following these steps:

Select the ODI Security navigator. If you have previously closed the Security Navigator, you can restore it from the main menu View / ODI Security Navigator.

Figure 2: The Security Navigator

Right-click on the CONNECT profile and select Duplicate Selection: ODI creates a new profile called Copy of CONNECT.

Double-click on the new profile to rename it. The best practice is to prefix the profile with NG and to append the name of your company at the end of the name, so that you make it obvious to other security administrators that this profile is non-generic and does not come out of the box. In figure 3 below, you will see that I have named mine NG CONNECT XOF.

Expand the new profile to list all objects. At the bottom of the list of objects you will see an entry for Work Repositories.

Expand Work Repositories: you should now see the View method.

Figure 3: Expanded details of the new profile

Double-click the View method and deselect the Generic Privilege check-box.

Figure 4: Deselect the generic privilege

Save the changes on the view method.

Making the view method non-generic is what makes the entire profile a non-generic profile.

Assigning the new privilege to a user

We now need to assign this new profile to our developers in lieu of the CONNECT privilege. Drag and drop the new profile on each developer’s name and select Yes when you are prompted to grant this profile to the user. You can then add any other profile in order to complete the privileges for that user.

Figure 5: User with NG CONNECT profile

If we were to stop here, these users would not be able to see any Work Repository: we still need to assign them the authorized repositories.

Assigning authorized repositories to the user

Once the new privilege has been assigned to the developers, you have to drag and drop the work repository name from Topology onto the user name in the Security module. To perform this operation, we have to dissociate the Security navigator from the other navigators. Step (1) in figure 6 below shows how to grab the tab of the Security navigator to move that navigator outside of the panel.

Dissociating Security Navigator

Figure 6: Dissociating the Security navigator from other navigators

Now that the Security navigator is separated from the other navigators, you can drag and drop any object from the other navigators on the user names. For this exercise, we want to drag and drop the authorized Work Repository onto the User name as illustrated in step (2) of Figure 6 above.

If you want to move the Security navigator back with the other navigators once you are done assigning individual objects, grab the Security tab again, and move your mouse over the center of the navigator panel on your left. When you see a blue rectangle right in the middle of the panel as shown in step (3) of Figure 6, you can release the mouse and the Security navigator will be back where it was originally.

When you drag and drop the repository on the user, you are prompted to confirm that you want to grant this privilege to the user: select Yes. You then have to “activate” the privilege. There are two possible options in that activation window, make sure to select the bigger check-mark as highlighted in Figure 7. If you roll your mouse over the activation icons, the proper icon will show “allow all methods in all repositories”. Only this option will work, the alternative would require the selection of a repository, which we are restricting here.

Figure 7: Activate the repository

You can now save the object definition and try to connect with the user whose repository selection has been limited. When this user will try to edit the connection to the repository in the ODI Studio, only the allowed repository will be visible (WORKREP1116 in figure 8, which you can contrast with the privileges of SUPERVISOR in Figure 1).

Figure 8: Restricted list of repositories

This approach has been tested and validated with ODI 11.1.1.6.0

Obviously, for a complete security profile you will still have to assign regular privileges to the different users in addition to this NG_CONNECT profile: DESIGNER, OPERATOR, TOPOLOGY ADMIN, etc.

Conclusion

As we have seen in this step by step example, it is possible to configure users’ security to restrict their privileges so that they only see and access authorized Work repositories even if these repositories are defined in the same Master repository: the Non-Generic option of the profiles and objects’ methods gives us this level of flexibility.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

How Many ODI Master Repositories Should We Have?

October 7, 2014, 12:28 pm

≫ Next: How to Understand and Diagnose ODI Connectivity Issues

≪ Previous: ODI: Restricting Visibility of Work Repositories

Introduction

A question that often comes up is how many Master Repositories should be created for a proper ODI architecture. The short answer is that this will depend on the network configuration. This article will elaborate on this very simplistic answer and will help define best practices for the architecture of ODI repositories in a corporate environment.

How Many ODI Master Repositories Should We Have?

Master and Work Repositories

Before delving into the specifics of Master Repositories, it is important to have a good understanding of what Master Repositories are and how they differ from Work Repositories.

Master Repositories are used to store what can be considered sensitive information:

- Information related to the connection to source and target systems: JDBC URLs, user names and passwords used to connect to the different systems, LDAP connectivity information, and schemas where data can be found. In fact, all the information that is managed with the Topology navigator is stored in the Master Repository;

- Information related to ODI internal security, if security is handled by ODI instead of using an external server for roles management (ODI 12c and later) and authentication management: ODI users names and passwords, ODI users privileges and profiles, profiles definition;

- Versions: when a new version of an object is created in the ODI studio, it is saved in the Master Repository.

Work Repositories are used to store the objects that result from the work of developers:

- Source and target metadata (models in the Designer navigator);

- Projects and all their children objects: folders, interfaces (ODI 11g and earlier releases) or mappings (ODI 12c and later), packages, procedures, variables, sequences, Knowledge Modules, User functions;

- Scenarios, load plans and schedules;

- Logs resulting from code execution.

There are also Work Repositories that are labeled “Execution” Work Repositories. These can be used in production environments to make sure that source code will not be modified hastily in a live environment. These Work Repositories only contain scenarios, load plans, schedules and the execution logs.

Repositories relationships

All Work Repositories are attached to a Master Repository. A Master Repository can group multiple Work Repositories, but a Work Repository is attached to one and only one Master Repository.

When multiple Work Repositories are attached to the same Master Repository, they all share the same Topology and Security definitions. Versioned objects can be restored in any of the Work Repositories sharing the same Master Repository.

When several Work Repositories are attached to the same Master Repositories, each repository typically matches an execution environment so that different versions of the objects can be used in parallel. The use of execution contexts will allow for the execution of the objects in the proper environment.

Figure 1 represents an environment where we would have 3 Work Repositories sharing the same Master Repository: development, test and production.

Figure 1: three Work Repositories sharing a single Master Repository

In this example, release 1.0 or the scenarios can be running in production, while release 2.0 is developed and tested in separate Work Repositories. The versioned source code for release 1.0 of the scenario is available is the Master Repository.

Challenges encountered in a corporate environment

In a corporate environment, it is to be expected that the production environment is isolated from the rest of the information systems, in particular isolated from development and test environments. Often times, firewalls will prevent data exchanges and communication between these environments.

This will force us into having a separate Master Repository for the production environment. We have to expect that the architecture in a corporate environment will look more like the one represented in Figure 2 below.

Figure 2: Three Work Repositories in a corporate environment with Firewall.

In architectures of this type we can still take advantage of the notion of contexts as we move objects from the Development repository to the Test/QA repository (and take advantage of one single Master Repository inside the firewall). From then on, the promotion of objects to the production environment is limited to the components that have been validated in the Test/QA environment.

The synchronization of the Topology objects is usually quite limited: logical schema names will have to match, and contexts will have to match. But the physical architecture will be specific to each environment.

The bulk of the objects that will be promoted to the production environment is comprised of Scenarios and Load Plans that have been successfully tested.

One challenge with the setup we have so far is that the process of promoting objects to the production environment is not tested.

A more robust environment would be to force the validation of the process of promotion of scenarios and topology to the production environment beforehand. To perform this operation, we do not have to operate behind the firewall, but we need to replicate the environment. A pre-production environment (looks like production, but is not production) will allow us to perform this validation. This new approach is represented in Figure 3.

Figure 3: Corporate environment with Firewall and pre-production repositories

There will be customers who will want to have dedicated Master Repositories for each environment and each repository. This is absolutely a valid choice, but sharing Master Repositories will reduce the number of administrative tasks (such as ODI upgrades, topology updates, etc.) and allow for more flexibility in the evolution of the infrastructure.

Expanding the infrastructure

Now that we have a solid foundation for our infrastructure we can expand further.

Let’s look back at our original example: we had version 1.0 of the scenario in the production Repository and version 2.0 in the development Repository. We still need to be able to fix potential problem in the production environment, but for safety reasons we do not have the source code in that environment (this prevents over-zealous developers from introducing untested fixes directly in a production environment with potentially disastrous effects). A common solution is to introduce a “Hotfix” repository, where the source code of the objects used in production can be restored and corrected as needed. Corrected objects can then be tested again before they are promoted to the production environment. This is another case for a repository that would be “like” production, without using the actual production repository. Here we can share the same Master Repository as the pre-production repository, as shown on figure 4.

Figure 4: introducing a repository to fix issues identified in the production environment.

Repositories infrastructure and objects promotion

We can now superimpose objects movements over our architecture. Figure 5 represents objects movements as follows:

- Orange arrows represent the required synchronizations from a topology perspective. However, since the physical definition of servers will be different from environment to environment, only the name of the Logical Schemas and Contexts must be synchronized.

- Red arrows represent the movement of scenarios and load plans (execution components)

- Yellow arrows represent the movement of source objects in and out of source control (internal or external to ODI).

Note: if you look carefully, you will notice a little discrepancy on this picture: we represent the ability to restore source code in a test repository that is marked as “Execution” Work Repository: obviously you cannot import code in such a repository. But some customers will want to have the source code available in their validation environment to allow for more intelligent testing. If you have that preference, then use a Development repository for your Test environment. If not, just remove the arrow that connects the test environment to source control.

Figure 5: Complete environment with detailed objects movement.

From this picture we can see the following movements:

- Scenarios and load plans (future production objects) are loaded from the development repository to the test repository for validation. Once validated, they are loaded to the pre-production repository so that the promotion process can be validated. Upon success, the objects are promoted to production

- The source code for promoted objects is versioned as the scenarios and load plans are promoted to the test environment (obviously intermediate versions can be created independently of the promotions). When objects are promoted to production, the matching source code can be restored in a hotfix environment to make sure that it is available in case issues are identified in the production environment

- If fixes are performed in the hotfix environment, the corrected source code is versioned. At the same time Scenarios and Load Plans are promoted to the test environment. From there, they follow the same path as the objects promoted from the development environment: from Test to pre-production and ultimately to production.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

How to Understand and Diagnose ODI Connectivity Issues

October 7, 2014, 12:28 pm

≫ Next: Integrating ODI with OBIEE: How to invoke ODI scenarios & Load Plans from OBIEE

≪ Previous: How Many ODI Master Repositories Should We Have?

Introduction

Understanding connectivity issues in an ODI environment requires a good grasp of the ODI architecture, as well a good understanding of where to look for information when such problems arise. We will review all these elements to help with diagnosis and resolution.

How to Understand and Diagnose ODI Connectivity Issues

We will start with a quick review of the ODI architecture to make sure that all components are in place when we will look for connection issues. Then we will look at connection issues that can be experienced on the agent or studio side.

ODI architecture reminder

When considering ODI connectivity issues, we have to take into consideration all the elements represented in Figure 1:

Figure 1: Components of the ODI architecture

We will start with a quick review of these components:

ODI repository: this is the foundation of the ODI architecture. If the connection to the repository fails, nothing else will work. The ODI documentation has very detailed recommendations to make sure that the repository is always available, in particular here: http://docs.oracle.com/cd/E25054_01/core.1111/e10106/odi.htm#BGBGJGGA
ODI Agent: the agent is the orchestrator of all executions. As such, the agent will not work if it cannot connect to the repository. Whether you are using a standalone Agent or a JEE agent, the necessary JDBC drivers must be in place to connect to all sources and targets no matter what mechanisms are used to transfer data to and from these systems: these drivers will be used to send code (SQL code or scripts) that the different systems must execute. For ODI 11.1.1.x the drivers for the standalone agents will be under the /drivers directory under the agent installation folder. ODI also ships with DataDirect JDBC drivers. For ODI 11.1.1.x, these drivers are installed under %ODI_HOME%/odi_misc.
ODI Studio: the studio will not work if it cannot connect to the repository. In addition, some developers will connect directly to some source and target components directly from the Studio, if only to reverse-engineer metadata or to view data from specific tables. For these operations, the necessary JDBC drivers must be in place for source and target systems. Custom drivers for the studio must be installed under %APPDATA%\odi\oracledi\userlib
Source and target databases: no particular requirements as long as they are accessible to the Agent (and Studio if needed).
Firewalls: it is recommended to identify all firewalls that control traffic to and from the different elements described here (repository, agents, studio, sources and targets) in order to accelerate troubleshooting in case of connectivity issues.

Steps to troubleshoot ODI connection issues

Connectivity issues will manifest themselves either on the studio side or on the agent side. The following step by step instructions will help us make sure that connectivity is possible to and from all the different components.

Preventing issues on the Studio side

All the necessary connections for the Studio to operate properly are illustrated in Figure 2 below.

Figure 2: Required connections for ODI Studio

To validate that all connections are working properly, run the following tests:

1. Test that you can connect to the repository: if you start the studio and it refuses to connect, then either the database hosting the repository is down, or there is a network issue preventing you from connecting.

2. Test the connection to the agent from the Topology navigator to make sure that the agent is up and running and that you can connect to that agent from the Studio, as shown in figure 3 below.

Figure 3: Testing the agent from the Topology navigator

3. To check whether you can connect directly to the source and target databases, try to view data from the models or interfaces (warning: choose tables with a limited number of records or it will take a while for the rows to be displayed) as illustrated in figure 4.

Figure 4: Right-click on a Datastore to view data

Identifying issues on the Agent side

All the necessary connections for the Agent (JEE or Standalone) to operate properly are illustrated in Figure 5 below.

Figure 5: Required connections for ODI Agents

To validate that all connections are working properly, run the following tests:

1. Test the connection to the agent from Topology (as we saw earlier in figure 3) to make sure that it is up and running and that you can connect to that agent from the Studio. If that test fails, make sure that the agent is up and running, that it was started on the port number defined in Topology and that no firewall prevents the connection between the Studio and the agent.

2. Make sure that the agent can connect back to the repository. An easy test is to create a very basic package that only contains a “variable refresh” that will run a simple query against that database (for instance “select sysdate from dual” on Oracle). Generate a scenario from that package and run this scenario with the agent. Use the ODI Operator to view the ODI logs: as long as something appears for this execution (successful or not –all we are validating at this point is the connection to the repository) then the agent does indeed connect properly to the repository.

3. Can the agent connect to the source and target databases? You can test the connection to all physical servers from Topology and run that test through the agent: in the Topology navigator, open the definition of a physical data server, and click the Test Connection button as shown in figure 6.

Figure 6: Test database connection

Then when prompted for the connection test, select the agent in the drop down as shown in figure 7.

Figure 7: Using an agent to test the database connection

This will validate that the agent can connect to the database. You will have to repeat this test for every database in the environment.

Additional tests to prevent further connectivity issues

The most efficient data transfers will be performed with database utilities and direct connections from server to server, even if these operations are orchestrated by the ODI Agent. For these techniques to work, it is imperative that the source and target servers can see one another. It is recommended to try and ping the target servers from the source servers, and to ping the source servers from the target servers. This will guarantee that data can move from server to server in the most efficient manner.

What about flat files?

Connectivity to flat files can also present some challenges. In this case, typical connectivity issues all come down to privilege issues. Whether you are trying to access a file from the Studio or with an agent, the user that started to program (Studio or agent) must have sufficient privileges on the Operating System to access the file. An easy way to validate this is to connect to the operating system using that same user name and to try and edit the file from a command line. If this operation fails, then ODI access to the file will be similarly restricted.

Understanding Connectivity Loss

Testing that you can connect from one element of the architecture to the next may not be enough to guarantee that all connectivity issues are at bay. Any of the connections we have validated above can be severed at any point in time: firewalls and databases can be configured to timeout when a connection looks idle for a while and these timeouts will just end-up in a connection loss from an ETL perspective.

Regardless of which connection is lost, understanding what information to look for is key to understanding what is happening.

Agent losing connection to the repository

The most frustrating connection loss is when the connection between the agent and the repository is severed by a 3^rd party component (Firewall or database). If the agent loses the connection to the repository, there is no way for it to update the logs in that repository and indicate that there is a connection issue. SQL code that is being executed by the databases will continue to run. It typically looks like the agent “hangs”. The agent will try to re-connect to the repository, but if that connection is gone for good, you may eventually have to restart the agent. When the agent restarts, it will mark the stale sessions as bad (this restart process is described in details in section 7.2.1 and 7.2.2 of the ODI documentation available here: http://docs.oracle.com/cd/E25054_01/core.1111/e10106/odi.htm#BGBEGCBB)

One way to identify such connectivity problems is to monitor the agent stderr output. Note that pinging the agent (from ODI Topology with the “Agent Test” or with the ODI tool ODIPingAgent) can still be successful: all these tests are validating is that the agent is up and running, but they do not validate that the agent can connect to the repository. In our current use case, the agent is indeed up and running.

Running a small scenario, similar to the one we have described earlier when testing the agent’s connectivity to the repository will be a better validation of the ability of the agent to connect to the repository and the databases.

Agent losing connection to a database

This type of problem is easier to diagnose since error messages will be properly reported in the ODI logs (the agent can report the errors to the repository in this case).

The error messages will differ from one JDBC driver to the next, but can typically contain one of these errors:

The connection does not exist.
Connection reset by peer.
(…) before connection was unexpectedly lost.

These messages are usually issued by the JDBC drivers and then relayed as such by ODI. Here is an example of such a message from an AS400 JDBC driver:

ODI-1241: Oracle Data Integrator tool execution fails.
Caused By: java.sql.SQLException: The connection does not exist.
at com.ibm.as400.access.JDError.throwSQLException(JDError.java:382)
at com.ibm.as400.access.AS400JDBCConnection.checkOpen(AS400JDBCConnection.java:394)
at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2570)
at com.ibm.as400.access.AS400JDBCStatement.close(AS400JDBCStatement.java:434)
at com.ibm.as400.access.AS400JDBCPreparedStatement.close(AS400JDBCPreparedStatement.java:436)
at com.sunopsis.sql.SnpsQuery.close(SnpsQuery.java:386)
at com.sunopsis.dwg.tools.WaitForData.actionExecute(WaitForData.java:691)

Note the com.ibm.as400.access.AS400JDBCConnection prefix to the error messages: this indicates that the error has been identified by an AS400 JDBC driver. And that driver reports that it just lost connection to the database itself…

The most common reasons for network connection loss are:

Firewalls that drop the connection: most firewalls sever connections after a certain amount of idle time (defaults of 10 and 30 minutes are observed quite regularly). Since the ODI agent will send SQL code to be executed by the database and wait for the completion of that code before issuing any other commands, it is very possible that the connection remains idle for some time.
The same is true with database timeout parameters: many databases will sever what looks like an idle connection. In an ETL process, ODI may run some code on the source system, then on the target system, and then try and cleanup some temporary data on the source system once everything has been committed to the target database (this will be the case for some CDC implementations for instance). If the integration time on the target side is superior to the database timeout on the source side, the source connection will be long gone when ODI will try and do its house cleaning.

Going further…

If you are interested in learning more on this subject, Oracle support has a very comprehensive guide to troubleshoot connectivity errors. Look for note ID 850014.1 at http://support.oracle.com.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Integrating ODI with OBIEE: How to invoke ODI scenarios & Load Plans from OBIEE

October 7, 2014, 12:28 pm

≫ Next: Correlating SQL statement in DB Sessions with ODI Sessions in ODI 11.1.1.7

≪ Previous: How to Understand and Diagnose ODI Connectivity Issues

Introduction

In a data warehouse environment where the ETL/ELT tool is Oracle Data Integrator and the BI tool is OBIEE, it is common to find the need for BI users to execute ODI scenarios and ODI load plans within OBIEE via the BI EE Action Framework. OBIEE administrators and BI users with special privileges may want to launch ad hoc jobs to process data for a specific BI task or business activity. In most cases, these jobs are already implemented as ODI scenarios and ODI load plans, and they can be accessed and executed via ODI web services. In OBIEE, you can invoke web services, or configure OBIEE Actions in conjunction with web services.

This article explains how you can invoke ODI web services from OBIEE to execute your ODI scenarios and ODI load plans. This article will show you how to execute an ODI scenario from OBIEE using ODI Web Services. The same steps can be followed to execute ODI load plans.

Advantages of invoking ODI web services from OBIEE

There are many benefits in exposing ODI web services to OBIEE. Here are some advantages:

In a BI environment where the presentation tool is OBIEE, BI users are already familiar with the OBIEE interface. No training is required for users to learn how to launch ODI scenarios and load plans.
ODI administrators can pass the power of executing scenarios and load plans to BI users that prefer to use the OBIEE Actions Framework.
ODI administrators can focus on more complex data integration activities, and BI users can execute their ad hoc jobs at any time.
For the BI user, the OBIEE Action is transparent; they don’t need to know that web services are being used to execute their ad hoc jobs.
OBIEE roles and user groups can be configured to secure the execution of specific OBIEE Actions that call ODI web services.
In OBIEE, the Oracle Credential Store can also be used to configure who and when an ODI web service can be executed from the BI Presentation Services.
The ODIInvoke and ODIInvokeCallBack web services can be used to manage many types of operations such as start, stop, restart, and check the status of a scenario or load plan.
In OBIEE, the parameters of the ODI scenarios and load plans can be customized with friendly labels and names (when creating an OBIEE Action), so BI users can easily understand what they need to specify when calling an ODI web service via an OBIEE Action.
BI Administrators can create and configure OBIEE Actions and hide some of the ODI parameters from the BI user, so the user can only execute the action, but she or he can not see the ODI security parameters such as ODI user, and ODI password.

How to Invoke ODI Scenarios & Load Plans from OBIEE?

Configuring ODI Java EE Agent for OBIEE

This article assumes that you have already deployed and configured the ODI Agent as a Java EE application. If you need instructions on how deploy and configure the ODI Agent as a Java EE application, please see Deploying and Configuring the ODI Agent as a Java EE Application before you continue any further.

The first step to integrate your ODI web services with OBIEE is to modify the ODI EAR (Enterprise ARchive) file that comes with the ODI Java EE agent deployment files.

The name of the EAR file is oraclediagent.ear, and it is typically located by default in the following location:

<Oracle Middleware Home>\[ODI Domain Name]\setup\manual\oracledi-agent

Open the EAR file with a tool that can read EAR files (i.e. Winzip) as shown in Figure 1.

Figure 1: ODI EAR file (oraclediagent.ear)

At the root level of the ear file, locate a WAR (Web application ARchive) file called oraclediagent.war. Select and open the WAR file as shown in Figure 2.

Figure 2: ODI WAR file (oraclediagent.war)

At the root level of the WAR file, locate and select the WSIL (Web Services Inspection Language) file called OdiInvoke.wsil (see Figure 3). Open the WSIL file with a text editor (i.e. Notepad).

Figure 3: ODI WSIL file (OdiInvoke.wsil)

Add a new web service called “ODI Web Services”, and specify the name of the service (ODIInvoke) and its WSDL (XML document that describes the service). ODIInvoke is the ODI web service that can be used by other applications such as OBIEE to invoke ODI scenarios and ODI load plans.

If the WSIL file has never been modified, replace the content of the WSIL file with the following XML code (adjust the service name, URL location, and port number based on your requirements):

<?xml version = ’1.0′ encoding = ‘UTF-8′?>
<!–Generated by Oracle BI Services–>
<inspection xmlns=”http://schemas.xmlsoap.org/ws/2001/10/inspection/”>
<service>
<abstract>ODI Web Services</abstract>
<name>ODI Web Services</name>
<description referencedNamespace=”http://schemas.xmlsoap.org/wsdl/” location=”http://localhost:8001/oraclediagent/OdiInvoke?WSDL”/>
</service>
</inspection>

Save your changes in all 3 files: WSIL, WAR, and EAR file.

The next step is to redeploy your ODI Java EE agent. In Weblogic, launch the Weblogic Console and login as an Administrator. Locate your oraclediagent deployment, and proceed to update it with the new version of your oraclediagent.ear file as shown in Figure 4.

Figure 4: oraclediagent.ear Re-deployment

Select “Finish” in the Update Application Assistant screen. If you successfully updated the oraclediagent with the new EAR file, you should see a confirmation message as shown in Figure 5.

Figure 5: Weblogic confirmation

The next step is to test your new WSIL. Launch a browser, and type the URL of your WSIL (http://localhost:8001/oraclediagent/OdiInvoke.wsil). If your WSIL was successfully configured, you should see the following XML code as shown in Figure 6:

Figure 6: URL Test of ODI WSIL

The next step is to validate and test your WSDL document. Launch a browser, and type the URL of your WSDL (http://localhost:8001/oraclediagent/OdiInvoke?WSDL). If your WSDL was successfully configured, you should see the following XML code as shown in Figure 7:

Figure 7: URL Test of OdiInvoke WSDL

At this point, no additional configuration is needed in ODI. The next steps are going to demonstrate how you configure OBIEE to call the ODI web services.

Registering ODI Web Services in OBIEE

The Action Framework of OBIEE 11g is a component of the Oracle BI EE architecture that allows us to invoke web services that are deployed in other application servers. These services can be configured in OBIEE as Action Web Services, and they can be used within the Oracle BI Presentation Services. For more information about OBIEE Actions Framework, please see Using Actions to Integrate Oracle BI EE with External Systems.

Our discussion will be limited to what configuration is needed to enable and execute ODI web services from OBIEE. However, we strongly recommend that you secure your OBIEE actions. For more information, please see Overview of Action Security.

The first step is to register the ODI Web Services in the OBIEE Action Framework configuration file called ActionFrameworkConfig.xml. This configuration file is typically located by default in the following location:

<Oracle Middleware Home>\user_projects\domains\bifoundation_domain\config\fmwconfig\biinstances\coreapplication

In this file, we are going to add a new WSIL entry to specify the name and location of the ODI WSIL. OBIEE will use the WSIL to locate the WSDL, and retrieve available services based on the WSDL document. For more information on how to configure this file, please see Configuring the OBIEE Action Framework. Make a backup of this file before making any modifications.

Modify the ActionFrameworkConfig.xml file by adding a new WSIL registry (under the <registries> section) for ODI Web Services as follow (change the name of the WSIL registry, and the path of the WSIL based on your requirements):

<registry>
<id>ETLWS</id>
<name>ETL Web Services</name>
<content-type>webservices</content-type>
<provider-class>oracle.bi.action.registry.wsil.WSILRegistry</provider-class>
<description></description>
<location>
<path>http://localhost:8001/oraclediagent/OdiInvoke.wsil</path>
</location>
</registry>

Save your changes.

Restart your OBIEE server, and verify that the ActionFrameworkConfig.xml file has been loaded successfully in the OBIEE server as shown in Figure 8.

Figure 8: OBIEE ActionFrameworkConfig.xml successfully loaded.

Invoking ODI Web Services in OBIEE

Once your ODI Web Services has been configured with OBIEE Action Framework, BI users can login into OBIEE Presentation Services and access the ODI web services by creating a new OBIEE Action of type “Invoke A Web Service” as shown in Figure 9.

Figure 9: Creating an OBIEE Action

Create a new OBIEE Action of Type “Invoke A Web Service”. The new ODI web services will appear in the “Select Web Service Operation” screen of OBIEE as shown in Figure 10.

Figure 10: ODI Web Services in OBIEE

Select an ODI operation such as InvokeStartScen, and customize the web service screen as shown in Figure 11.

Figure 11: Defining and customizing parameters for ODI InvokeStartScen

In my example above, I created an Action to run an ODI scenario called “PAYROLL”. This scenario is an ODI package that BI users like to invoke once a month. This Action will help us automate this task, so BI users can invoke this scenario when they are ready to process their monthly payroll.

In the “Edit Action” screen above, I modified the following parameters:

ODI Prompts: I replaced “ODI User” with “Payroll User”, “Scenario Name” with “Payroll Job Name”, and “Value” with “Enter Payroll Month”
ODI Values: I left User and Password blank, so the user will enter these values when he or she runs the Action. Also, I provided a syntax for parameter “Enter Payroll Month”: “YYYY-MM”.
Other attributes: I defined some of the values as Fixed or Hidden because BI users should not worry about ODI Repository Names, Scenario Versions, Context, etc.
Options: I selected “Options” to customize “Dialog Title”, “Action Help”, and the “Execute Button Text” as shown in Figure 12.

Figure 12: Customizing Options for an OBIEE Action

Save your new OBIEE Action.

Now that my OBIEE Action has been fully configured, I decided to create a dashboard in OBIEE that allows users to execute the Action (the ODI scenario called “PAYROLL”), and check the status of the scenario, all in one screen. To do this, I had to bring two ODI tables, SNP_SESSION and SNP_USER, into OBIEE. I modified the OBIEE RPD to model these two tables as shown in Figure 13.

Figure 13: OBIEE RPD with ODI tables

For more information on how to integrate OBIEE with ODI metadata to build report-to-source data lineage, please see Oracle Business Intelligence Enterprise Edition Data Lineage For ODI 11g.

Finally, I created an OBIEE dashboard that includes 4 main areas: (1) an option to execute the Action, (2) when the action is invoked, it will prompt the user for necessary parameters, (3) an option to filter the ODI user(s), and (4) a table that shows the status of the ODI executions. Figure 14 illustrates my final dashboard.

Figure 14: OBIEE Dashboard that invokes ODI scenarios

In ODI, I can see the executions of the OBIEE Actions in the ODI Operator as shown in Figure 15:

Figure 14: ODI Operator

Conclusion

ODI Web Services is a great mechanism to execute ODI scenarios and load plans from other enterprise applications such as OBIEE. Now you can configure ODI and OBIEE to invoke the most complex ODI scenario with the click of a button!

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Correlating SQL statement in DB Sessions with ODI Sessions in ODI 11.1.1.7

October 7, 2014, 12:28 pm

≫ Next: Using an External Database for your XML Schema, Part I: Understanding the ODI XML JDBC Driver

≪ Previous: Integrating ODI with OBIEE: How to invoke ODI scenarios & Load Plans from OBIEE

Jayant Mahto from the ODI Product Management team has published an excellent blog explaining how to correlate database sessions with ODI sessions. You can find his post here: https://blogs.oracle.com/biapps/entry/correlating_sql_statement_in_db

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Using an External Database for your XML Schema, Part I: Understanding the ODI XML JDBC Driver

October 7, 2014, 12:28 pm

≫ Next: Using an External Database for your XML Schema, Part II: Optimizing the Use of the External Schema

≪ Previous: Correlating SQL statement in DB Sessions with ODI Sessions in ODI 11.1.1.7

Introduction

Whether you are processing XML files or XML messages on JMS, ODI provides a JDBC driver that will translate the XML structure into a relational schema. By default, this schema is created in an in-memory database. While this approach will work extremely well for small XML files or messages, it can become challenging for larger volumes. In this two part article we are first reviewing how the driver works, then we are detailing the benefits of using an actual database instead of the default in-memory storage for the creation of the relational schema.

Using an External Database for your XML Schema, Part I: Understanding the ODI XML JDBC Driver

It is important to understand how the driver works before we look into the specifics of particular configurations. This is our immediate focus.

How the driver works

If you want to know everything there is to know about the XML JDBC driver, the Oracle® Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator has a very extensive appendix that lists all commands, parameters and options supported by the driver: http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDICHDB .

To allow us to cover both JMS XML messages and XML files, we will talk generically about XML structures when the explanation applies to both files and messages. In the case where the behavior differs, we will explicitly mention XML files and JMS XML messages. Note that when we define the JDBC URL to connect to a JMS XML message, we can directly use the parameters of the XML JDBC driver in the URL: ODI will know which parameters to use specifically for JMS or for XML.

For now, all we need is to understand the mechanisms behind the driver. If you read the ODI documentation, you will see that elements get converted to tables and attributes get converted to columns. By default, the database schema used to host these tables resides in the memory space of the agent processing the XML structure.

In addition, the driver will automatically add new columns to handle the relationship between elements: primary keys and foreign keys are added to retain the XML hierarchy in the form of a parent-child relationship. For instance, consider the following basic structure in XML:

<GEOGRAPHY_DIM>

     <country COUNTRY_ID="6" COUNTRY_NAME="Autralia">

          <region REGION_ID="72" REGION_NAME="Queensland">

              <city CITY_ID="63" CITY_NAME="Brisbane" POPULATION="505179" />

              </city>

         </region>    

    </country>

</GEOGRAPHY_DIM>

This will be translated to 4 tables: GEOGRAPHY_DIM, COUNTRY, REGION, CITY.

Note: we should actually look at the DTD or XSD for an accurate definition of the XML structures. The file is provided here for illustration purposes only.

The COUNTRY table derived from the original XML file country elements has two columns: COUNTRY_ID, COUNTRY_NAME

The JDBC driver adds columns to link the countries to their parent element (GEOGRAPHY_DIM) and to allow for the regions to be associated with the proper country. The new columns that are created automatically by the driver are: GEOGRAPHY_DIMFK, COUNTRYPK

The driver has also added “orders” columns to keep track of the relative position of the elements in the original file, should that information be needed later on.

Figure 1: relational representation of an XML file.

Loading the data

The default behavior is that when you first connect to the data, the driver will immediately load the data in the database schema. If you are using the agent’s out-of-the-box configuration, this means that the data is loaded in the in-memory database. Subsequent accesses to the data read directly from the database schema and do not need to connect back to the original XML structure. The connection to XML is made when:

- You test an XML connection from Topology

- You reverse-engineer an XML model

- You view the data from the Studio

- A task needs to connect to XML (when executing a select statement for instance)

One direct benefit is that generic Knowledge Modules can be used to read data from XML files (LKM SQL to Oracle for instance): all select statements are executed against the database schema where the driver has loaded the data. There is nothing specific to the driver itself anymore.

There are several ways you can control and alter the default behavior. The first one is by adding parameters to the XML JDBC URL used in the definition of the server in Topology. The second is by issuing specific commands to the XML JDBC driver.

XML JDBC Driver Properties

All properties for the XML JDBC driver are available in the Oracle® Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator available here: http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDECBHH. In particular, the properties “load_data_on_connect” and “drop_on_disc” can impact when the data is loaded and removed from the database schema. The following table is an extract from the documentation:

Property	Mandatory?	Values	Default	Behavior
load_data_on_connect or ldoc	No	boolean (true \| false)	true	Load automatically the data in the schema when performing the JDBC connection. If set to false, a SYNCHRONIZE statement is required after the connection to load the data.This option is useful to test the connection or browse metadata without loading all the data.
drop_on_disc or dod	No	boolean (true \| false)	false	Drop automatically the schema when closing the JDBC connection.If true, the schema is stored in the built-in engine, it is always dropped.If true and the data is on an external database, only the current reference to the schema in memory will be dropped, but the tables will remain in the external database. This means that if you try to connect to this schema again, it will reuse the tables in the external database rather than starting from scratch (as it would when the data is loaded in memory).

XML driver commands

All commands for the XML JDBC driver are available in the Oracle® Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator available here: http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDFDFJF

As long as the technology in your Knowledge Modules or ODI procedures is set to “XML” you can use these commands directly. In particular, the SYNCHRONIZE command will allow you to update data in the database from the file, or to overwrite the file with the content of the tables in the schema. For instance, to load all data from the file and overwrite the content of the database schema you would use the command:

SYNCHRONIZE [SCHEMA <schema_name>] FROM FILE

Conversely, to overwrite the file with the content of the database schema you would use the command:

SYNCHRONIZE [SCHEMA <schema_name>] FROM DATABASE

The consequences for data movement

There are many benefits to the processing of XML files in an in-memory engine: separate agents can connect to the same JMS queue and work in parallel to process more messages; loading small messages is extremely efficient; no additional footprint is required for the transformation of the canonical data into a relational format.

Figure 2 below illustrates the data movement when using the memory engine with the XML JDBC driver.

Figure 2: data movement with the in-memory engine

When large volumes of data have to be processed for each message, it can be more efficient to use an actual database for the tables created by the XML JDBC driver. This will be true for three main reasons:

- Memory limitations: if all the processing of the XML file happens in memory, all of a sudden large volumes of data are clogging the memory space of the agent and can potentially require memory swaps that are costly in terms of performance

- Insert-select: when the data is in an in-memory database, you still have to move the data to the target database. If you can stage the data directly in the target server, then you will be able to leverage set-based operations – one of the main reasons for why ELT is so much more efficient than ETL

- Staging tables: if you ever have to perform heterogeneous join operations (i.e. joining with data that resides outside of the XML structure) ODI will create staging table in the target server. This amounts to staging twice: once in memory, once in the target server… If the tables are created directly in the target database, there is no need for additional staging.

Figure 3 illustrates the data movement when using an external database to store XML data for the XML JDBC driver:

Figure 3: data movement with external database

Part II of this blog focuses on how to best take advantage of this external storage.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Using an External Database for your XML Schema, Part II: Optimizing the Use of the External Schema

October 7, 2014, 12:28 pm

≫ Next: Implementing Early Arriving Facts in ODI, Part II: Implementation Steps

≪ Previous: Using an External Database for your XML Schema, Part I: Understanding the ODI XML JDBC Driver

Introduction

Whether you are processing XML files or XML messages on JMS, ODI provides a JDBC driver that will translate the XML structure into a relational schema. By default, this schema is created in an in-memory database. While this approach will work extremely well for small XML files or messages, it can become challenging for larger volumes. In this two parts article we are first reviewing how the driver works, then we are detailing the benefits of using an actual database instead of the default in-memory storage for the creation of the relational schema.

Using an External Database for your XML Schema, Part II: Optimizing the Use of the External Schema

When dealing with a large XML payload (file or JMS message) the most efficient way to process the data will be to leverage the XML JDBC driver to load the data into the target database and then use ODI to process the data directly from that database, bypassing the XML JDBC driver altogether. In other words, once the data is loaded by the driver we completely ignore the fact that the data comes from XML: it is just another schema of the target database.

To implement this we will take a slightly different approach for plain XML Files vs. XML messages over JMS as the constraints are a little bit different: messages on JMS will arrive at a more or less continuous pace, while files tend to be more static. We also have to confirm to JMS that we have processed the data (so as to remove the message from the queue) whereas there is no such process with XML files.

As we go over the steps to use for an optimal leverage of the external schema, you may want to have the following references at hand:

- The Oracle® Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator contains all the necessary details on parameters and options. This can be found in Section B.3.3 Using an External Database to Store the Data located here:http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDICHDB

- Chapter 9 of the Oracle Data Integrator 11g Cookbook from Packt Publishing contains step by step instructions on how to setup the XML JDBC driver with an external database.

Using an external database to store the XML schema

What we mean by “external database” is a database that is not the agent’s in memory database. For ODI 11.1.1.x that database can be Oracle, Microsoft SQL server or MySQL (and of course HSQLDB, which is used for the in-memory database).

There are two ways to indicate to the driver that an external database will be used:

- Specific properties can be added to the JDBC URL to specify where to connect. These properties are dp_driver, dp_url, dp_user, dp_password and dp_schema (plus dp_catalog if you are connecting to Microsoft SQL server).

- One single property: db_props can be used to point to a properties file that will contain all additional parameters (driver name, URL, user name, password, schema and catalog for SQL Server). This file must be located in the CLASSPATH of your agent (typically with the JDBC drivers that you are using for ODI), NOT with the XML file.

Note that in both cases the password must first be encrypted. You can use the encode.bat or encode.sh scripts that are installed with the standalone agent to perform this operation.

In addition to the mandatory parameters described above, additional parameters can be specified to alter the behavior of the driver. The XML Driver documentation describes these in details here: http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDFIJEH

One benefit of using an external database is that we can use the target database to store the XML schema. But if we let the XML JDBC driver handle the data, ODI will still think that there are 2 connections to the data: XML and the target database. The construct is the same as if the data was still outside of the database.

Figure 1: processing data from an XML connection into a database

If we use the data directly from a database connection (instead of using an XML connection) then all SQL operations can be done within the same database: from an ODI perspective, this means that there is no need for an LKM anymore: your IKM can do it all. This saves us the creation of at least one staging table – something that can improve performance quite a bit!

Figure 2: loading XML using a database connection

This is why we restrict the use of the driver to only load the data in the database. After that we reverse engineer and process the tables and their data as a normal schema of the database. Figure 3 shows a view of Topology where such a schema is declared under an Oracle server connection.

Figure 3: XML data is stored in a database schema

A simple trick for JMS XML connections is to use actual XML and XSD files to perform the first connection, rather than trying to connect to the queue and retrieve an XML message.

The sole purpose of that first connection is for the XML JDBC driver to create the tables in the target database schema. Once the tables are created, we can reverse engineer the tables from the database as shown on figure 4.

Figure 4: Tables from an XML file reverse-engineered under an Oracle model.

You will note that in the database, the name of the tables is prefixed with the name of the XML schema defined in the connection parameters of the XML file (GEO in the example shown in figure 4).

All interfaces will be built using these tables as sources. The only element we need to add for the execution of these interfaces is an ODI procedure that loads the data from the XML structure into the target database.

Performing this operation will differ somewhat whether we are reading from JMS or reading an XML file, so we will now separate the two.

Plain XML File

We now have a JDBC URL that redirects the XML schema to our target database. As long as we do not set the parameter load_data_on_connect to false a simple Test of the connectivity to the file from Topology will create the schema and load the data immediately in the database. But this manual operation needs to be automated for packages and scenarios that will process the data from this schema. For this we create an ODI procedure with two simple steps. In both steps, make sure that the technology is set to XML and that the schema points to the XML schema we are working with. Figure 5 illustrates these settings.

Figure 5: XML step in procedure

The command on target for these steps are:

TRUNCATE SCHEMA

SYNCHRONIZE FROM FILE

Use this procedure in your packages before all interfaces that will read XML data, as shown on figure 6 below. Remember, since the model that represents the XML data model points to the target database schema, the source tables in these interfaces come directly from that target database.

Figure 6: use of a procedure used to load XML data in a database schema.

JMS XML

For XML data extracted from JMS there are two things that we need to do. First, as we did for the XML file, we will need to load the data before we run interfaces. Then once we have successfully processed all the data that we need we must “commit the read” in JMS so that the message gets removed from the queue.

As we did for the XML File, for each steps in these procedures we will have to set the technology to JMS-XML and to select the proper logical schema name.

All we have to do to load data from JMS is to look at the ODI LKMs that usually perform that operation for us. The LKM JMS XML to SQL has a step called Load JMS to XML. The code that we are interested in will be under the tab Command on Source. As you can see on figure 7, this step uses many of the options of the KM: you can either create the same options in your procedure and then copy the code from this step as-is, or you can decide on predefined values and replace all references to odiRef.getUserExit with the values that you have selected.

Figure 7: Extracting data from JMS and loading into the XML schema

Note that in the case of the procedure, you can copy the code on the Command on Target side. This procedure will have a single step, and will have to be called in your packages before any interface that needs access to the data.

We will need a second procedure that will be executed after the interfaces to commit the reads on JMS. Again the LKM JMS XML to SQL gives us the solution with the step Commit JMS Read. We have to look under the Command on source tab to find the code (quite simple this time: Commit). We can copy this under the Command on Target tab of the single step of our second procedure (and again, we can put this on the Command on Target in the procedure):

Figure 8: commit JMS.

Remember to set the technology to JMS-XML and to select the proper logical schema name for each step in these procedures.

Now use both procedures in the package: the first one to read, the second to commit the read after all interfaces have been executed, as shown on figure 9 below.

Figure 9: Read and commit JMS – then process data from database model

An additional benefit on the JMS XML side is that if you later on decide to process more of the data from the same message with additional interfaces, you will not have to worry about properly setting this commit option in the last one of the interfaces as you would normally have to do with the standard LKM JMS XML to SQL: now that you have externalized the command, you can process all the data you want between the load and commit procedures.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Implementing Early Arriving Facts in ODI, Part II: Implementation Steps

October 7, 2014, 12:28 pm

≫ Next: Implementing Early Arriving Facts in ODI, Part I: Proof of Concept Overview

≪ Previous: Using an External Database for your XML Schema, Part II: Optimizing the Use of the External Schema

Introduction

This article includes step by step instructions on how to implement an early arriving fact using database functions, ODI lookups, ODI user functions, and ODI custom knowledge modules. In this article, I focus on creating a reusable solution in ODI that can be utilized for other early arriving facts in your data warehouse.

This is Part II of a two-parts article. Part I gives you an overview of the proof-of-concept (POC) I created in ODI to implement an early arriving fact. To read about Part I, go to “Implementing Early Arriving Facts in ODI, Part I: Proof of Concept Overview“. To download a copy of the Proof of Concept (POC) I created for this article, please go to “ODI Repository Sample for Early Arriving Facts”.

Implementing Early Arriving Facts in ODI, Part II: Implementation Steps

Implementation Steps

My POC includes the following implementation steps:

Creating database functions to handle late arriving dimensions – I will demonstrate how to implement these functions using PL/SQL. I will keep to a minimum the code that is required and let ODI handle the rest of my ETL process flow. I am proposing the development of 2 functions: (1) customer function, and (2) product function. We don’t need a function to handle the status dimension, since we are not expecting late arriving records for this dimension.
Implementing warehouse dimensions as ODI lookups – ODI lookups is one of the best features of ODI 11g. I am going to demonstrate how to implement all of my dimensions as ODI lookups for my Early Arriving Orders Fact Interface. For the Customer Slowly Changing Dimension, I will add an additional condition to ensure that my Customer lookup filters data by the active row of each customer.
Creating ODI user functions to manage early arriving facts – Another great feature of ODI. I plan to encapsulate the database functions into ODI function, so I can add additional business rules, and re-use them in any other early arriving fact table. This section of this article will give you a clear overview of the benefits of using ODI user functions.
Mapping Warehouse Keys of the Fact Table with ODI user functions – In this section, I am going to illustrate how to map warehouse keys of the fact table with ODI user functions in our Early Arriving Orders Fact Interface.
Modifying “IKM Oracle Incremental Update” – This KM will be modified to make optional step “sub-select inline view”. This step is not required by the Orders Fact interface.
Testing – Finally, we will test our POC, and validate our test cases.

Step #1: Creating database functions to handle Late Arriving Dimensions

Our first step is to create a series of database functions to handle late arriving dimensions. Since my database technology is Oracle, I will implement these functions using PL/SQL. My goal is to write the minimum amount of PL/SQL code required to effectively manage the late arriving dimensions. The rest of the logic will be handled by ODI. As I explained in part I of this article, there are 2 late arriving dimensions: customer dimension (type 2), and product dimension (type 1).

The logic to handle the missing dimension record is the same. Each function will take one parameter: the missing dimensional natural key. Each function will return one value: the warehouse key or surrogate key of the new dimension record. The functions will only be invoked if an early arriving fact record is detected.

Figure 1 illustrates the PL/SQL package definition for our 2 late arriving dimension functions:

Figure 1: PL/SQL Package Definition

Figure 2 illustrates the actual implementation of the PL/SQL function that I developed to handle the customer late arriving dimension. I am using the default table (W_DEFAULT_VALUES) to populate the values, but a list of hard-coded values can be used too.

Figure 2: PL/SQL function LATE_DIMS.D_CUSTOMER_DIM

The SQL exception of “-00001” is needed in case another interface, running at the same time, inserts the late arriving dimension record.

Step #2: Implementing Warehouse Dimensions as ODI Lookups

ODI Lookups is a great feature in ODI 11g that can be used to implement warehouse dimensions in interfaces where the target datastore is a warehouse fact table.

For more information on how to define Source Datastores and Lookups in ODI 11g, please see: “Defining Source Datastores and Lookups in ODI”

In our discussion, we are going to model our 3 dimensions (Customer, Status, and Product) as source lookups for the Orders Fact interface (Warehouse.W_ORDERS_F). The Source Datastore Area of our Orders Fact interface will be modeled as follow:

One Driving Table: ORDERS. This is a staging table, and it is the main source of the Orders Fact table. It is called the “driving” table.
Three Lookup Tables: Status, Product, and Customer dimensions.

Figure 3 shows how to add the driving table called “ORDERS” (the staging table) into the source area of our Orders Fact Interface:

1. Locate your driving table in the “Models” section of ODI, and drag it into the source area of your interface.
2. In the Target area of your interface, map column ORDER_NUM with ORDERS.ORDER_NUM.
3. In the Source area of your interface, select the Lookup option to add your first lookup table.

Figure 3: Adding the Driving Table

The first lookup we would like to create is the Status Dimension. Figure 4 shows how to add the Status Dimension as an ODI lookup.

1. Locate and select the Status Dimension in the Lookup Table area.
2. Select Next.

Figure 4: Selecting the Status Dimension as Lookup

Proceed to create the Lookup condition as shown in Figure 5:

1. Select ORDER_STATUS from the source table (the driving table).
2. Select STATUS_CD from the Lookup table.
3. Select Join to create the lookup condition.
4. Set Staging as your “Execute On” option.
5. Select “SQL left-outer join in the from clause” as your Lookup Type.
6. Select Finish.

Figure 5: Lookup Condition

Figure 6 shows the Status Dimension Lookup in the source area of the Orders Fact interface.

Figure 6: Status Dimension Lookup

Repeat the same steps for the Product Dimension.

When creating a Lookup condition for our Customer dimension, an additional filter must be configured. Customer is a slowly changing dimension (Type 2), which means we must add an additional condition to filter by the active record of the customer. In our Customer dimension, the active record of a customer is the row where CURRENT_FLAG is equal to 1. Figure 7 shows how to add this filter:

1. In the Lookup Wizard screen, manually type the following condition in the Lookup Condition box:

AND W_CUSTOMER_D.CURRENT_FLAG = 1

2. The entire lookup condition must read as follow:

ORDERS.CUST_ID=W_CUSTOMER_D.CUST_ID AND W_CUSTOMER_D.CURRENT_FLAG = 1

Figure 7: Adding the Current Flag Filter

Figure 8 shows how the lookup tables and the driving table have been modeled in the Orders Fact interface (Warehouse.W_ORDERS_F):

Figure 8: Lookup and Driving table for interface “Warehouse.W_ORDERS_F”

Step #3: Creating ODI user functions to manage early arriving facts

Now that we have modeled our dimensions as ODI lookups, and implemented database functions to correctly handle late arriving dimensions, the next step is to map the dimension foreign keys of our early arriving fact table. Also, we need to find an efficient way to invoke the late arriving dimension functions when we encounter early arriving fact records.

In this section, I am going to illustrate how you can build and use ODI user functions to correctly map your fact keys and invoke the late arriving dimension functions when necessary. We can then use these ODI user functions in other interfaces that need to handle other early arriving facts.

The first step is to construct a database expression that correctly handles the early arriving fact record. I like to use a “SQL Case statement” to handle this case. As an example, let’s construct a SQL case statement that invokes the customer late arriving dimension function when an early fact record arrives and the customer does not exist in the customer dimension:

case

when W_CUSTOMER_D.ROW_WID is null then LATE_DIMS.D_CUSTOMER_DIM(ORDERS.CUST_ID)

else W_CUSTOMER_D.ROW_WID

end

ROW_WID (the customer surrogate key) is NULL when the customer does not exist in the customer dimension: a case of an early arriving fact record. In this case, the late arriving customer function is invoked to insert the missing customer. The function then returns the new customer surrogate key. If the customer already exists in the customer dimension, then the existing customer surrogate key is used.

Now, let proceed to implement the above SQL case statement as an ODI User function:

1. Select User Functions from the ODI project, right-click, and select “New User Function”.
2. Enter the function Name, and enter a new Group called “Late Arriving Dimensions” (you can call it anything).
3. In the Syntax box, enter the name of the function, and the parameters that the function will accept. In our example, we are using 2 parameters: SurrogateKey, and NaturalKey. Follow the syntax as shown on Figure 9.

LATE_CUSTOMER($(SurrogateKey),$(NaturalKey))

Figure 9: Creating a new ODI user function

Let’s now implement the actual function as illustrated in Figure 10:

1. Select the “Implementations” tab
2. Select the “+” sign to create a new implementation
3. Enter the SQL case statement as follow:

CASE

WHEN $(SurrogateKey) IS NULL THEN LATE_DIMS.D_CUSTOMER_DIM($(NaturalKey))

ELSE $(SurrogateKey)

END

4. Select “Oracle” as the “Linked Technology”. Select any other technology where SQL case statements are supported.
5. Select OK, and save your new ODI user function.

Figure 10: Implementation Syntax of function “LATE_CUSTOMER”

Repeat the previous steps again to create the ODI user function that will be used to map the Product Foreign Key of the Early Arriving Fact table. Figure 11 illustrates the implementation of this function:

Figure 11: Implementation of function “LATE_PRODUCT”

Step #4: Mapping Warehouse Keys of the Fact Table with ODI user functions

Now, we can use our new ODI user functions to map both the customer foreign key, and the product foreign key of our early arriving fact table.

Figure 12 shows the mapping section of our early arriving fact interface (Warehouse.W_ORDERS_F). Let’s proceed to map these columns:

1. Select “CUST_WID” column mapping.
2. Select the Expression Editor. We can directly type the expression in the Mapping Property box, but it is easier to construct the expression with the ODI Expression Editor.

Figure 12: Mapping the early arriving fact table

Figure 13 shows how to construct your expression with the Expression Editor window:

1. Locate the Project User Functions created in our previous section.
2. Double click on the user function called “LATE_CUSTOMER”. The Expression editor box should be filled with the actual function syntax. Replace the parameter names of the function with the actual column names as follow:

SYNTAX: LATE_CUSTOMER($(SurrogateKey),$(NaturalKey))

EXPRESSION: LATE_CUSTOMER(W_CUSTOMER_D.ROW_WID, ORDERS.CUST_ID)

3. Select Apply.

Figure 13: Mapping the Customer Surrogate Key

Repeat the steps to map PRODUCT_WID. Figure 14 shows the mapping of PRODUCT_WID:

Figure 14: Mapping the Product Surrogate Key

The last fact key we need to map is the STATUS_WID. Since the Status dimension is not a late arriving dimension, and we don’t expect to see any unknown statuses from early arriving facts, we map W_ORDERS_F.STATUS_WID with W_STATUS_D.ROW_WID. Figure 15 shows this mapping:

Figure 15: Mapping the Status Surrogate Key

Finally, let’s map the rest of our early arriving fact columns as follow:

1. REVENUE_AMT is our measure, and it should be mapped to: ROUND(ORDERS.PRICE * ORDERS.QUANTITY,2)

2. CREATED_DT, and UPDATED_DT should be mapped to current date and time: SYSDATE.

Figure 16 shows the full mapping of our early arriving fact interface:

Figure 16: Full Mapping of the Early Arriving Fact Interface

Our mappings are now complete. Let’s run the Early Arriving fact interface and verify how ODI evaluates the ODI user functions. Figure 17 shows an example of how the “Insert flow into I$ table” step (from the IKM Oracle Incremental Update) looks like when running the Orders Fact interface. You can see how ODI substitutes the user functions (LATE_CUSTOMER, and LATE_PRODUCT) with the actual code: the CASE statements:

Figure 17: ODI user function substitution, IKM step “Insert of the I$ table”

Step #5: Modifying “IKM Oracle Incremental Update”

This POC requires the modification of the IKM Oracle Incremental Update. Here is why we need to modify the KM:

In our Order Fact interface, columns CUST_WID, and STATUS_WID have been mapped to ODI user functions LATE_CUSTOMER, and LATE_PRODUCT respectively. These ODI user functions will invoke the PL/SQL package called “LATE_DIMS”. This package will execute an INSERT DML operation when a warehouse key is not found in the Customer or Product dimension.
In Oracle database, DML (INSERT, UPDATE, and DELETE) operations are not allowed inside SELECT statements. Step “sub-select inline view” of the IKM Oracle Incremental Update will fail due to this DML operation restriction. This step builds a sub-select statement for another interface that wants to use the current interface as a source. In our Orders Fact interface, we don’t need this step.

To properly manage this restriction, we are going to create a new KM option in the IKM Oracle Incremental Update that will allow us to only execute step “sub-select inline view” if the KM option is set to “True”. By default, the option will be set to “True”, but for the Orders Fact interface, the option will be set to “false”.

Create a new option in the KM called “INLINE_VIEW” of type “Check Box” with a default value of “True” as illustrated on Figure 18.

Figure 18: INLINE_VIEW option for IKM Oracle Incremental Update

Save your new KM Option. Select the “Details” tab of the KM, and open the “Sub-Select Inline view” step as shown in Figure 19.

Figure 19: Modifying Step “sub-select inline view”

At the bottom of the step, expand “Options” as illustrated in the Figure 20. Unselect the “Always Execute” option and only check the “INLINE_VIEW” option. This mean that the step will only be executed if INLINE _VIEW option is set to “True” (which by default, this is how we implemented the option). The idea is that the behavior of the KM will not change. But for our Orders fact interface we are going to set this option to “false”.

Figure 20: Configuring the Execution of the INLINE_VIEW option

Finally, Notice that I renamed the KM by adding at the end “(Inline View Option)”. Save all your changes.

Close and open the Orders fact interface again (Warehouse.W_ORDERS_F). Go to the Flow tab of the interface, and select the new INLINE_VIEW option, and set it to “false” as shown Figure 21:

Figure 21: Setting to “false” the INLINE_VIEW option

Save your changes, and proceed to test your interface.

Testing Your Orders Fact Interface

In order to test your early arriving fact, you must download and install the “ODI Repository Sample for Early Arriving Facts” file. Here are the steps to test:

1. There are 3 records in the MY_STG_AREA.ORDERS staging table.

2. Order “12-234-2344” and “12-230-1111” are early arriving facts:

- Customer #1234 does not exist in the customer dimension

- Product #A201DE does not exist in the product dimension.

3. Figure 22 illustrates the content of the MY_STG_AREA.ORDERS staging table:

Figure 22: Content of Orders Staging Table

4. Run the “Warehouse.W_ORDERS_F” interface, and verify the following:

a. The MY_WAREHOUSE.W_ORDERS_F table should contain 3 new records:

Figure 23: Orders Fact records

b. The MY_WAREHOUSE.W_CUSTOMER_D table should have a new record: the late arriving customer (CUST_ID #1234):

Figure 24: Late Arriving Customer record

c. The MY_WAREHOUSE.W_PRODUCT_D table should have a new record: the late arriving product (product #A201DE):

Figure 25: Late Arriving Product record

5. Table MY_STG_AREA.CUSTOMER contains the late arriving customer #1234 with its full attributes. Also, table MY_STG_AREA.PRODUCT contains the late arriving product #A201DE with its full attributes.

6. Run both “Warehouse.W_CUSTOMER_D (Type 2)” and “Warehouse.W_PRODUCT_D (Type 1)” interfaces, and verify the following:

a. The MY_WAREHOUSE.W_CUSTOMER_D should have a new record for customer #1234 with its full attributes (ROW_WID=134). The old record (ROW_WID=132) has been historized.

Figure 26: New customer record

b. The MY_WAREHOUSE.W_PRODUCT_D has been updated. Product #A201DE has been updated with the correct product line, and product description.

Figure 27: Product record update

7. Finally, run again the “Warehouse.W_ORDERS_F” interface, and verify that for customer #1234, the customer warehouse ID (CUST_WID) in the fact table has been updated with the latest warehouse ID. In my example the latest customer warehouse ID is 134:

Figure 28: Fact table with updated customer warehouse IDs

Conclusion

An early arriving fact is a common predicament in a data warehouse. In this article we learned how to address this issue using ODI. Our solution included a combination of features in both ODI and the database engine.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Implementing Early Arriving Facts in ODI, Part I: Proof of Concept Overview

October 7, 2014, 12:28 pm

≫ Next: Exceptions Handling and Notifications in ODI

≪ Previous: Implementing Early Arriving Facts in ODI, Part II: Implementation Steps

Introduction

This article illustrates how to implement early arriving facts in Oracle Data Integrator (ODI). This article has two parts. Part I gives you an overview of a proof-of-concept created in ODI to implement an early arriving fact. Part II includes step-by-step instructions on how to implement an early arriving fact using Oracle database functions, ODI lookups, ODI user functions, and ODI custom knowledge modules. Also, Part II focuses on creating a reusable solution in ODI to effectively manage any type of early arriving fact or late arriving dimension.

If you wish to skip Part I of this article and would like to read about Part II, go to “Implementing Early Arriving Facts in ODI, Part II: Implementation Steps“.

Implementing Early Arriving Facts in ODI, Part I: Proof of Concept Overview

What is an Early Arriving Fact?

Ralph Kimball, the leading visionary in the data warehouse industry, defines an early arriving fact as follow:

“An early arriving fact takes place when the activity measurement arrives at the data warehouse without its full context. In other words, the statuses of the dimensions attached to the activity measurement are ambiguous or unknown for some period of time.”

In Extract-Transform-Load (ETL) terminology, an early arriving fact is when the ETL process performs a look up of a surrogate key in a dimension table using the natural key of a fact table, and no value is returned because the dimension record doesn’t exist yet.

An early arriving fact is also known as a late arriving dimension because the dimensional member will arrive after the activity measurement.

In this article, both terms are being used: “early arriving fact” and “late arriving dimension.” But, they both refer to the same data warehouse event.

For example, the term “early arriving fact” will be used in this article in sections where the topic of discussion is a warehouse fact table. However, the term “late arriving dimension” will be used instead to describe activities that relate to a warehouse dimension table.

Proof of Concept Overview

The best way to learn how to implement an early arriving fact in ODI is to develop a proof-of-concept, also known as a POC.

The POC for this article includes two main components: ODI repository and two database schemas. The database schemas include sample data, which can be used for testing purposes. If you would like to download a copy of this POC, please go to “ODI Repository Sample for Early Arriving Facts”.

The following sections describe all the components of this POC.

The Data Warehouse

The data warehouse environment for this POC is divided into two main areas: the data mart and the staging area. The data mart is a star schema with 3 dimensions and 1 fact table. The staging area is the source data area for dimensions and fact tables.

Figure 1: Warehouse ETL Process Flow

The Data Mart

The data mart for this POC includes three dimensions: customer, product, and status. There is only one fact: orders.

The customer dimension, defined in the sample database schema as W_CUSTOMER_D, is a Type-2 dimension. Type 2 dimensions are also known as slowly changing dimensions: they store and manage both current and historical data over time. This dimension is also a late arriving dimension: a delay on adding a new customer or updating an attribute of an existing customer may occur in this dimension. Also, it is possible that facts containing new customers may arrive in the data warehouse before adding the new customers into this dimension.

The product dimension, defined in the sample database schema as W_PRODUCT_D, is a Type-1 dimension. No history is being maintained about products; new products get added and existing data gets updated with changes from the source system. This dimension is a late arriving dimension, too. When new products are added in the source system, facts containing the new products may arrive in the data warehouse before adding the new products into this dimension.

The status dimension, defined in the sample database schema as W_STATUS_D, is a Type-1 dimension. This dimension is always current with data from the source system. No delays are expected between source system changes and the content of this dimension, and no early arriving facts are expected in regards to this dimension.

The orders fact contains measured activity about customer orders. This fact is considered an early arriving fact; its activity measurement may include customers and products that have not yet been added into the customer and product dimensions respectively.

The Staging Area

The staging area includes 4 staging tables: customer, product, status, and orders.

Customer is the source table for the Customer Dimension. Product is the source table for the Product Dimension, and status is the source table for the Status Dimension. The Orders table is the source for the Orders Fact. The Orders table has the early arriving facts.

The ODI Repository

The ODI Repository consists of the following metadata:

Dimension Interfaces

Warehouse.W_CUSTOMER_D (Type 2)
Warehouse.W_PRODUCT_D (Type 1)
Warehouse.W_STATUS_D (Type 1)

Fact Interface

Warehouse.W_ORDERS_F

Part II of this article will discuss in greater detail how to build the Warehouse.W_ORDERS_F interface. Figure 2 shows the interfaces used for this ODI project:

Figure 2: ODI Interfaces

ODI Variables

This ODI project includes a series of ODI variables to manage default values such as dimension flags, dimension default values, and default date formats. Most of the variables are refreshed by selecting the corresponding default value from a table called W_DEFAULT_VALUES.

ODI User Functions

There are 2 ODI user functions for this project: LATE_CUSTOMER and LATE_PRODUCT. These user functions will be used in the mapping of the fact interface. Part II of this article dedicates a section to showing how to create these functions and build expressions with them in the fact interface.

ODI Knowledge Modules

Two ODI Integration Knowledge Modules (IKM) are used for this project:

IKM Slowly Changing Dimension – This KM will be used only by the Customer Dimension.
IKM Oracle Incremental Update (Inline View Option) – This IKM will be used by all other interfaces. It is a modified version of the “IKM Oracle Incremental Update” that comes with ODI. It has been modified to make optional the execution of step “sub-select inline view”. Part II of this article dedicates a section about why this step should be modified and a step-by-step illustration on how to modify it.

Figure 3 shows the ODI variables, user functions, and knowledge modules used for this project.

Figure 3: ODI variables, functions, and knowledge modules

Models

There are three ODI models: Dimensions, Facts, and Staging. Dimensions and Facts are both using the same logical schema called “Warehouse”. Model Staging uses a logical schema called “Staging”.

Figure 4 shows the organization of the ODI Models.

Figure 4: ODI Models

ODI Topology

The ODI topology contains one physical architecture: Oracle. There is only one data server: Warehouse. The Warehouse and Staging areas are both in the same physical database, so only one ODI data sever is required.

Figure 5: ODI Topology Manager

Conclusion

An early arriving fact is a common predicament in a data warehouse. In this article, we covered the main components of a proof-of-concept (POC) created in ODI to address this issue. I strongly recommend downloading a copy of this POC, which can be found at this location: “ODI Repository Sample for Early Arriving Facts”. Part II of this article demonstrates how to implement some of the components of this POC. It is a great way to understand how to use ODI to address this issue and create a reusable solution for any type of early arriving fact and late arriving dimension. If you would like to read more about this subject, please go to “Implementing Early Arriving Facts in ODI, Part II: Implementation Steps“.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Exceptions Handling and Notifications in ODI

October 7, 2014, 12:28 pm

≫ Next: Using ODI with a Development Topology that Doesn’t Match Production Topology

≪ Previous: Implementing Early Arriving Facts in ODI, Part I: Proof of Concept Overview

Introduction

ODI processes are typically at the mercy of the entire IT infrastructure’s health. If source or target systems become unavailable, if network incidents occur, ODI processes will be impacted. When this is the case, it is important to make sure that ODI can notify the proper individuals. This post will review the techniques that are available in ODI to guarantee that such notifications occur reliably.

Exceptions Handling and Notifications in ODI

When you create an ODI package, one of the ODI Tools available to you for notifications is OdiSendMail:

Figure 1: OdiSendMail

Specify the SMTP server, to and from parameters, subject and message for your email and you can alert operators, administrators or developers when errors occur in your packages.

One challenge though is that if you want an email to be sent for any possible error in your package, each and every step of the package must be linked to this tool in case of error. This can get to be quite overwhelming, and there is no guarantee that one step will not be forgotten by a developer along the way.

Figure 2: Sending an email for every step that ends-up in error

Another caveat with sending a notification email from the package itself is that the final status of the package will be successful as long as sending the email is successful. If you want the package to fail, you now have to raise an exception after sending the email to force the package to end in an error state.

If we look back at the original objective, all we really want is to send a notification no matter what error occurs. So ideally we want the package to actually fail. Only after it has failed should we send the notification email. This is exactly what we can perform with the Load Plan Exceptions.

Load Plans and Exceptions

If we handle the exceptions at the load plan level there is no need to send the notification emails from the original package. What must do now is to create two separate packages: one to process the data, another one to handle the exceptions.

Figure 3: package without notification emails

Figure 4: package dedicated to notification emails

There are two main aspects to a load plan:

Steps: a set of scenarios that must be orchestrated, whether executions are serialized of parallelized
Exceptions: a series of scenarios that can be used in case one of the steps fails to execute properly.

Figure 5: Load plans steps

For instance in Figure 6 below, we have created an exception called Email Notification. This exception will run a dedicated scenario whose sole purpose is to send a notification email. The scenario here is generated from the package represented earlier in figure 4.

Figure 6: Load Plans Exception

When you edit the properties of the steps of the load plans, you can choose whether they will trigger the execution of an exception scenario or not. For each step, including the root step, you can define what the behavior will be in case of error:

Choose the exception scenario to run (or leave blank)
Run the exception scenario (if one is selected) and stop the execution of the load plan at this point
Run the exception scenario (if one is selected) and proceed with the rest of the load plan.

Figure 7: selection of an exception scenario

If you select an exception at the root level, this will be the default exception for the entire load plan unless you decide to overwrite this default with a different selection in the properties of individual branches or step of the load plan.

This will guarantee that no matter what fails in the scenarios used in the load plan, the notification email is always sent. This includes corner cases where the scenarios would not even start, for instance in case of errors in Topology definitions such as missing physical schema definitions.

In figure 8 below we can see steps of the load plan where the scenario that was executed fails, raising an exception:

Figure 8: scenario step raising an exception

We can also look at the scenario sessions to see the original scenario failure and the execution of the notification scenario:

Figure 9: scenario sessions for failed scenario and notification scenario.

Conclusion

Leveraging exceptions handling in Load Plans is a very straightforward and efficient way to ensure that no matter what fails in your ODI processes, notifications are sent reliably and consistently.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Using ODI with a Development Topology that Doesn’t Match Production Topology

October 7, 2014, 12:28 pm

≫ Next: Using Oracle Data Pump in Oracle Data Integrator (ODI)

≪ Previous: Exceptions Handling and Notifications in ODI

Introduction

The cost of reproducing a production environment for the purpose of developing and testing data integration solutions can be prohibitive. A common approach is to downscale the development environment by grouping databases and applications. The purpose of this post is to explain how to define the ODI Topology when the physical implementation differs between the development and production environments.

Using ODI with a Development Topology that Doesn’t Match Production Topology

Let’s consider a setup where the production environment uses three distinct database installations to host data that needs to be aggregated. By reducing the volume of data, it is possible to group all three databases in one single installation that developers can leverage as they write their data integration code. This simplifies the infrastructure maintenance and greatly reduces the overall cost. We have represented such a setup in figure 1 below.

Figure 1: Simplified architecture to reduce the cost of development

The challenge with such a setup will be to make sure that the code designed in the development environment will run properly in the production environment. As a point of reference, an ODI Topology that would match the above infrastructure is represented in figure 2.

Figure 2: Different Topology layout for development and production environments

ODI Code Generation

The SQL code that ODI generates is optimized for the infrastructure where the code runs. If we look at the Development environment represented above, best practices recommend that we create a dedicated login for the database, and then use that single login to access data from the two source schemas and then use the same login to write to the target schema.

From an ODI Topology perspective, this means that there is one single data server under which we define 3 physical schemas.

Based on the above Topology declaration, ODI optimizes the data movement when you create your mappings. It makes sure that data flows as quickly as possible from the source schemas to the target schemas. From that perspective, ODI does not use any LKM for the source schemas: all data is already in the database, there is no need to stage the data in a C$ table (or C$ view or C$ external table).

Figure 3: Sources and targets schemas in the same database

Conversely, if the data happens to be physically located on separate servers, then ODI automatically introduces LKMs to bring the data into the target server.

Figure 4: Separate databases for two sources and target schema

The challenge with topology discrepancies

If the development environment matches the architecture shown on figure 3, and the production environment matches the architecture represented on 4, the scenarios generated in the development environment cannot run in the production environment as is. But the last thing we want to do is to redesign the developed code as we promote it from environment to environment.

With the challenge now clearly stated, let’s see what ODI provides to solve this conundrum.

Option 1: Optimization contexts

When you are building mappings, you can ask ODI to generate code based on a specific execution environment, as long as all environments are defined in the same Master repository. In this case, ODI Contexts will represent the different environments.

If we look at the use case provided here as an example, we can ask ODI to generate the code explicitly for the production environment even though the development environment is much simpler. In other words, we can force the use of LKMs to better represent the reality of production even if they are not needed to process the data in the development environment.

If you click on the Physical tab or your mappings on ODI 12c, you can see an option called Optimization Context (in earlier versions of ODI, this same option is available in the Overview tab of Interfaces). If there is a discrepancy between the different environments, setting this option properly guarantees that the code always matches the layout of the environment of your choice.

Figure 5: Optimization context selection

One challenge with this approach though is that you will have to remember to select the proper optimization context for every single mapping that is created. Unless of course you use the production context as your default context for Designer in the User Preferences: Select the Tools menu, then Preferences… Under ODI/System you can set the Default Context for Designer

Figure 6: ODI preferences to set the default context for Designer

For users of older releases of ODI you will find this parameter under the menu ODI/User Parameter.

Option 2: Design the development Topology to match the production Topology in ODI

Another approach would be to ignore the specifics of the simplified Development environment and design everything so that it matches the production environment.

The ODI best practice of declaring all schemas in the same database under a single data server in Topology has only one purpose: making sure that ODI generates the most optimized code for the environment. But if the production environment does not have all schemas in the same database, then we can create different data servers in the development environment so that ODI believes that there are different databases.

In a more extreme example we have combined all our source and target tables in the same database schema for our development environment. This does not prevent us from creating 3 separate database and schema definitions. Then we can have 3 separate models that each contain only the relevant source or target tables in order to match the production environment:

Figure 7: Comparing real data organization vs. simulated data organization

If we use a Topology organization that always matches the production environment, then we never have to worry about setting the optimization contexts in any of the mappings.

Conclusion

It is possible for ODI to always generate code that matches your production environment infrastructure even if it differs from your development environment. Just make sure that you are aware of these discrepancies as you lay out your Topology environment so that you can select the approach that best fits the specifics of your projects.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Using Oracle Data Pump in Oracle Data Integrator (ODI)

January 9, 2014, 3:56 pm

≫ Next: Understanding the ODI JKMs and how they work with Oracle GoldenGate

≪ Previous: Using ODI with a Development Topology that Doesn’t Match Production Topology

Introduction

This article presents a modified version of the ODI load knowledge module called LKM Oracle to Oracle (datapump). The ODI load knowledge module (LKM) presented in this article has been enhanced with additional options and steps to take advantage of the best features of Oracle Data Pump and Oracle External Tables. Some of the enhancements include data compression, server to server file transport, threads control, and the use of Oracle Optimizer Hints. This article shows how to configure and use this knowledge module.

If your source and target data stores are both Oracle, Data Pump is the fastest way to extract and load data between Oracle data stores.

If you would like to download a copy of this knowledge module, please go to “Oracle Data Integrator Knowledge Modules – Downloads.” Search for “datapump”. The knowledge module is called “LKM Oracle to Oracle Datapump Plus”.

Using Oracle Data Pump in Oracle Data Integrator (ODI)

Oracle Data Pump is a component of the Oracle database technology that allows users to export and import data faster than other traditional Oracle utilities. In addition to basic export and import functionality, Oracle Data Pump supports the use of Oracle External Tables, and keeps data in a binary format, further improving processing efficiency.

Oracle External Tables is another feature of the Oracle database technology that allows users to define external files as database tables. This feature empowers users with the ability to query files using SQL syntax as if the files were tables of a database schema.

The combination of using Oracle Data Pump with External Tables is a great use-case for extracting, loading, and transforming (EL-T) data with ODI.

Oracle Data Pump works only between Oracle databases.

If you would like to learn more about using Oracle Data Pump with Oracle External Tables, please refer to the following documents for additional information: “Using Oracle External Tables”, “Overview of Oracle Data Pump”, and “The Oracle Data Pump Access Driver.” The scope of this article focuses on how to use and configure an ODI knowledge module that takes advantage of Oracle Data Pump and External Tables to efficiently load data between Oracle data servers.

Overview: LKM Oracle to Oracle Datapump Plus

The ODI version of the LKM Oracle to Oracle (datapump) provides basic and standard functionality on how to export and import data between Oracle data stores using Data Pump and External Tables.

The load knowledge module presented in this article, LKM Oracle to Oracle Datapump Plus, has been customized with additional options and features such as data compression, number of Data Pump workers, degree of parallelism, file-transfer, optimizer hints, and suffix names for all temporary objects.

Figure 1 shows a list of features that have been added to the LKM Oracle to Oracle Datapump Plus.

Figure 1: LKM Oracle to Oracle Datapump Features

The following sections explain in detail how to use and take advantage of these new LKM features.

Load Knowledge Module (LKM) Options

Figure 2 shows a list of available options for the LKM Oracle to Oracle Datapump Plus.

Figure 2: LKM Oracle to Oracle Datapump Plus – Options

The default values for the above LKM options have been configured as follow:

The Data Pump active workers (parallel threads) for both export and import operations have been set to the same value: 3 workers.
The Data Pump directories for both export and import operations have been set to the same location: user home.
Data will be compressed before is written to the data pump set.
Data Pump files will not be transferred.
Data Pump files will be deleted when the upload operation is complete.
Log files will be deleted when the upload operation is complete.
The suffix name has no value; this feature is optional.
The optimizer hint option has no value; this feature is optional.

All these options can be configured, changed, and customized by the user. The following sections describe how to configure and customize these LKM options.

NUM_OF_ACTIVE_WORKERS_EXPORT

Oracle Data Pump can export data in parallel. Each parallel process (the Data Pump worker) can create its own Data Pump file. Therefore, multiple Data Pump files can be created in parallel.
This LKM has been enhanced to allow the user to specify the desired number of Data Pump workers to be used during the export operation.
A data pump file will be created for each Data Pump worker, so that each worker has an exclusive lock on each pump file, and data can be written in parallel.
Figure 3 shows the default value for this option: 3 Data Pump workers.
Figure 3: Knowledge Module Option – Number of Data Pump workers on Export

Note: Increasing the number of Data Pump workers does not necessary mean that a Data Pump process will export data faster. It is important to test and find the optimum number of Data Pump workers based on database resources, and other Data Pump jobs running at the same time.

NUM_OF_ACTIVE_WORKERS_IMPORT

Oracle Data Pump can also upload data in parallel. Oracle recommends having a number of threads (Data Pump workers) that is greater or equal to the number of Data Pump files, so that multiple threads can read from the file and can write to the target table, all at the same time.
On import, Data Pump workers can read multiple Data Pump files or even chunks of the same Data Pump file concurrently. Therefore, data can be uploaded into the target table in parallel even if there is only one Data Pump file.
Figure 4 shows the default value for this option: 3 Data Pump workers.
Figure 4: Knowledge Module Option – Number of Data Pump workers on Import

Note: Increasing the number of Data Pump workers does not necessary mean that a Data Pump process will upload data faster. It is important to test and find the optimum number of Data Pump workers based on database resources, and other Data Pump jobs running at the same time.

DATA_PUMP_EXPORT_DIR

This option allows the user to specify the physical path or location where Data Pump files will be exported. This can be a hard-coded value or the name of an ODI variable that contains the actual path or location.
The value specified in this option should be a physical path in the source data server, or a shared network drive where both the source and target data servers can access the datapump files.
Figure 5 shows an example on how to define an ODI variable with this option. The variable contains the actual physical path or location to be used when exporting Data Pump files.
By default, this option is set to the home directory of the user running the ODI agent.
Figure 5: Knowledge Module Option – Data Pump Export Directory

DATA_PUMP_IMPORT_DIR

There are cases when the export and import directories cannot be the same physical location, and an additional directory must be defined in the target data store.
This option allows the user to specify the physical path or location to be used for Data Pump import operations. This can be a hard-coded value or the name of an ODI variable that contains the actual path or location.
Figure 6 shows an example of using an ODI variable with this option. The variable contains the actual physical path or location to be used when importing Data Pump files.
The value specified in this option should be a physical path in the target data server, or a shared network drive to access the source files from the target data server.
By default, this option is set to the home directory of the user running the ODI agent.
Figure 6: Knowledge Module Option – Data Pump Import Directory

COMPRESS_DATA

This option allows the user to compress data before it is written to the Data Pump file set. On export, Data Pump compression is an inline operation, allowing smaller Data Pump files to be created on disk. This translates to a significant savings in disk space.
Data Pump compression is fully inline on the import side as well. There is no need to uncompress Data Pump files before importing them into the target data store.
This feature is only available in versions of Oracle 11g and higher.
Figure 7 shows the compress data option.
By default, this option is set to false.
Figure 7: Knowledge Module Option – Data Compression

TRANSFER_FILES

This option allows the user to copy the Data Pump files from the source data server to the target data server.
If a shared network drive is used for both export and import operations, then this option is not required.
However, this option should be used in environments where the target data server has no access to the source Data Pump files, and the files must be transferred to a location in the target server for the import operation.
Also, this option is useful when it is too slow to import the Data Pump files from a remote location such as a shared network drive, and a local directory in the target data server provides better performance.
The option uses the Oracle database file transfer package called dbms_file_transfer. For additional information on this Oracle package, please refer to “Oracle DBMS File Transfer Package.”
A database link must be configured in the target data server in order to transfer files between data servers. See section “Configuring your Database Link for File-Transfer Operations” for details on how to configure your database link.
Figure 8 shows the file transfer option. By default, this option is set to false.
Figure 8: Knowledge Module Option –Transfer Files

DELETE_DATA_FILES

This option allows the user to delete or keep the Data Pump files created by the knowledge module during the export and import operations.
After the Data Pump files have been exported and imported successfully, the knowledge module deletes the Data Pump files created in the source data server. If the Data Pump files have been transferred to a target data server, the knowledge module deletes the target Data Pump files as well.
Figure 9 shows this option. By default, this option is set to true.
Figure 9: Knowledge Module Option – Delete Data Files

DELETE_LOG_FILES

This option allows the user to delete or keep the log files created by the knowledge module during the export and import operations.
After the Data Pump files have been exported and imported successfully, the knowledge module deletes the log files created for both operations.
Figure 10 shows this option. By default, this option is set to true.
Figure 10: Knowledge Module Option – Delete Log Files

SUFFIX_NAME

This option allows the user to define a suffix name in all temporary objects created by the knowledge module at runtime. This can be a hard-coded value or an ODI variable that contains the actual suffix name.
This option provides another degree of parallelism: multiple executions of the same ODI mapping can run in parallel when unique suffix names are used. An ODI variable can be used to store a suffix name. The variable can be refreshed with a unique value before launching another instance of the same mapping. Each execution of the same ODI mapping will have its own set of temporary object names.
Figure 11 shows an example of how to specify an ODI variable with this option.
By default, this option has no value. It is optional, but it can be very useful.
Figure 11: Knowledge Module Option – Suffix Name

OPTIMIZER_HINT

Oracle Optimizer Hints offer a mechanism to instruct the Oracle Optimizer to choose a certain query execution plan based on a criteria specified by the user. This feature of the Oracle database is a great way to optimize the execution of your database queries.
This KM option allows the user to add an Oracle Optimizer Hint in the SQL statement that defines the external table in the source data server.
The value for this option can be a hard-coded Oracle Optimizer Hint or an ODI variable that contains the hint.
The syntax for an Oracle Optimizer Hit is as follow:

/*+ hint [text] [hint[text]]… */

Some of the most common Oracle Optimizer Hints such as APPEND, STAR_TRANSFORMATION, and ALL_ROWS are reusable hints: the hints can be defined inside the knowledge module and use by many ODI mappings.
Other Oracle Optimizer Hints such as INDEX, FULL, and PARALLEL depend on the tables used by each ODI mapping, and the user may choose to define the Oracle Optimizer Hint in the mapping itself.
Figure 12 shows an example of how to define an Oracle Optimizer Hint in an ODI mapping:

1. Open the ODI mapping, and select the Physical tab.
2. In the Staging Area of the Target Group box, select the Loading KM property.
3. Expand the Loading Knowledge Module tree, and ensure that the LKM Oracle to Oracle Datapump Plus knowledge module has been selected.
4. Select the Optimizer Hint option and enter the Oracle Optimizer Hint.
5. Save your mapping changes.
Figure 12: Customizing Knowledge Module Options in Mappings

For additional information on how to use Oracle Optimizer Hints, please refer to “Oracle Database Performance Tuning Guide, Using Optimizer Hints.”
By default, this option has no value. It is optional, but it can be really useful in cases where hints are needed to speed up the execution of the SQL statement that defines the source external table.

Configuring your Environment to Work with Oracle Data Pump

The following sections describe the steps required to enable the LKM Oracle to Oracle Datapump Plus in your environment. There are 3 areas where configuration is required: the source data server, the target data server, and the ODI studio.

Server Configuration

Create a physical directory in the source data server to manage your Data Pump export operations. Example:

/usr/oracle/odi/datapump/export

If you plan to copy or transfer your datapump files from your source data server into your target data server, create another physical directory in the target data server. Example:

/usr/oracle/odi/datapump/import

If you plan to use a shared network drive for both operations, create one single physical directory in the shared data server. Example:

/usr/oracle/odi/datapump/shared

Data Server Privileges

In the ODI Topology Navigator, identify which user is configured to connect to the source data server. For instance, Figure 13 shows the ODI physical data server connection for a source database. The user configured to connect to this source data server is ODI_USER.
Figure 13: ODI Topology Manager – Physical Source Data Server Connection

Using a database tool such as SQL Developer, grant the following system privileges to the user configured to connect to the source data server.

grant create any directory to ODI_USER;

grant drop any directory to ODI_USER;

Follow the same steps for your target data server, and grant the same system privileges to the user configured to connect to the target data server:

grant create any directory to <target_db_user>;

grant drop any directory to <target_db_user>;

Configuring Your Database Link for File-Transfer Operations

If your environment requires copying or transferring your datapump files from the source data server to the target data server, follow the instructions in this section.

If you configured a shared network drive for both export and import operations, you don’t need to setup a database link.

The LKM Oracle to Oracle Datapump Plus uses the Oracle dbms_file_transfer package to perform the transfer of Data Pump files from the source data server to the target data server. Grant the “execute” system privilege on this package to the user or schema of the target data server:

grant execute on dbms_file_transfer to <target_db_user>;

A database link is required in order to copy the Data Pump files from the source data server to the target data server. Grant the “create database link” system privilege to the user or schema of the target data server:

grant create database link to <target_db_user>;

Example: The following database link will connect to a source data server with a user called ODI_USER. The source data server is called “OLTP_ORDERS”.

create database link “ORDERS_DB” connect to “ODI_USER”
identified by odi using ‘OLTP_ORDERS’;

Ensure that your new database link is also added in the tnsnames.ora file of your target data server. Example:

OLTP_ORDERS =

(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = orders1-svr
(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = orders.us.example.com)
)
)

In the ODI Topology Navigator, open your physical source data server configuration, and add the name of the database link in the Instance / dblink (Data Sever) text box as shown in Figure 14:
Figure 14: ODI Topology Manager- Database Link Configuration

How does LKM Oracle to Oracle Datapump Plus work?

Figure 15 shows the main steps of the LKM Oracle to Oracle Datapump Plus. There are 5 main steps:

1. On the source data server, the knowledge module creates an Oracle directory with the physical location where the Data Pump files will be exported. Using the knowledge module options, an external table is defined in the source data server.

2. The export operation starts when the external table definition is executed in the source data server. The Oracle Data Pump utility starts writing the files on disk in this step. If the compress option is enabled, the data will be compressed before it is written on disk. One log file will be created on disk to log any errors or data that cannot be exported. The number of datapump workers specified in the knowledge module option will be used to write each file in parallel.

Figure 15: LKM Oracle to Oracle Datapump Plus – Main Steps

	3. Once the Data Pump files have been written on disk, if the file-transfer option is enabled, the files will be copied from the source data server to the target data server. When the file-transfer operation is complete, the user will have a duplicate copy of each file in both places: the source and the target data servers. If the file-transfer option is disabled, the knowledge module will skip the file-transfer step. Therefore, it is expected that the target data server has access to the source Data Pump files or a shared network drive has been configured by the user to perform both operations: the export and import of Data Pump files. .
	4. On the target data server, an Oracle directory will be created to specify the location of the Data Pump files to be imported. Another external table will be created on the target data server. One log file will be created on disk to log any errors or data that cannot be imported. .
	5. The datapump files will be loaded into an integration table by an ODI integration knowledge module (IKM). The import operation will start when a select-statement is issued against the external table in the target data server. The integration table can be an ODI I$ table or any other table specified by an ODI integration knowledge module. Finally, if the user has chosen to delete the temporary objects, all temporary objects created by the knowledge module will be deleted.

Understanding the Code Generated by the Knowledge Module

It is recommended to get familiar with the steps that the LKM Oracle to Oracle Datapump Plus executes at runtime. The ODI Simulation feature is a great tool to review the LKM steps.

Figure 16 shows two steps or tasks of the LKM Oracle to Oracle Datapump Plus: “Create Oracle directory on SOURCE,” and “Create Oracle directory on TARGET.”
In this example, an ODI variable will be used to create the Oracle directory names with a suffix. Also, ODI variables will be used to specify the physical location of the export and the import directories.
At runtime, ODI evaluates these variables and replace them with their actual values.

Figure 16: LKM Oracle to Oracle Datapump Plus – Oracle Directory Name Creation

Figure 17 shows an example of how ODI generates the code for the external table on the source data server. This example highlights where in the code the KM options are being used.
For instance, every temporary object contains a suffix name: MAR2013. The compress data option is enabled; data will be compressed before it is written to the Data Pump file set.
A total of 3 Data Pump files will be created during the export operation. Three Data Pump active workers will write to each Data Pump file in parallel. An optimizer hint will be used when selecting data from the source table.

Figure 17: LKM Oracle to Oracle Datapump Plus – Code Sample for External Table on Source

Figure 18 shows a sample of the code that ODI will generate to transfer one Data Pump file from the source data server to the target data server.
This example highlights where in the code the database link name will be used. This database link name comes from the “Instance / dblink (Data Server)” text box of the source data server that has been defined in the Physical Architecture of the ODI Topology Navigator.
Also, it highlights the actual names of the Oracle source and target directories.
This block of code will be duplicated for each Data Pump file that needs to be transferred.

Figure 18: LKM Oracle to Oracle Datapump Plus – Code Sample for File Transfer

Conclusion

Oracle Data Pump is the fastest way to extract and load data between Oracle data servers. If you would like to download a copy of this knowledge module, please go to “Oracle Data Integrator Knowledge Modules – Downloads.” Search for “datapump”. The knowledge module is called “LKM Oracle to Oracle Datapump Plus”.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Understanding the ODI JKMs and how they work with Oracle GoldenGate

October 7, 2014, 12:28 pm

≫ Next: Understanding Where to Install the ODI Standalone Agent

≪ Previous: Using Oracle Data Pump in Oracle Data Integrator (ODI)

Introduction

The best option for ODI Changed Data Capture is to leverage Oracle GoldenGate. To understand how to best leverage the out-of-the-box integration between ODI and GoldenGate, we will review how ODI handles CDC with an in depth explanation of the JKMs principles, then expand this explanation to the specifics of the ODI-GoldenGate integration.

Understanding the ODI JKMs and how they work with Oracle GoldenGate

ODI is an ELT product. As such, it does not have its own transformation engine: data processing is done by leveraging the environment where the data is extracted from or loaded to (whether that environment is a database, an XML file, a JMS message or a Hadoop cluster). When it comes to the detection of changes in a source system, it is only natural that ODI on its own would not have proprietary mechanisms for that detection. ODI once again leverages other existing components, and creates an infrastructure to identify and process the changes that these components are detecting. We are reviewing here the details of this infrastructure, with an emphasis on how this infrastructure is leveraged when combining ODI with GoldenGate for the detection and delivery of the changes.

1. Understanding ODI JKMs

All the code generated by ODI is defined by a family of templates called Knowledge Modules. The Journalizing Knowledge Modules (JKM) are the KMs used when Changed Data Capture (CDC) is required for the data integration projects.

There are two journalizing modes for the JKMs: simple mode, and consistent set mode. Before going into the specifics for each mode, let’s review the common ground.

1.1 Infrastructure and key concepts

Rather than storing a copy of the entirety of the records that are changed, ODI will only require that the Primary Key of the changed records be stored in its infrastructure. If no primary key is available in the source system, any combination of columns that uniquely identifies the records can be used (in that case a primary key is defined in ODI, without any need to create a matching key in the database).

To store these Primary Keys the JKM will create a table named after the source table with a J$ prefix. ODI also creates views that join this J$ table with the original table so that one simple select statement can extract all the columns of the changed records. ODI automatically purges the content of the J$ table when the records have been processed.

ODI also maintains a list of subscribers to the changes the same way messaging systems work: each target application that requires a copy of the changes can be assigned a subscriber name. When consuming the changes, applications filter the changes by each providing their own subscriber name. In that case, when performing a purge, only the changes processed by all subscribers are removed from the J$ table (in other words, as long as at least one subscriber has not consumed a changed record, that record remains in the J$ table). Different subscribers can safely consume changes at their own pace with no risk of missing changes when they are ready for their own integration cycle. Figures 1 to 3 illustrate the consumption by 2 subscribers with different integration cycles and shows how purges are handled.

For the purpose of our illustration we assume that we have two subscribers: GL_INTEGRATION and DWH_INTEGRATION. GL_INTEGRATION consumes the changes every hour, 15 minutes past the hour. DWH_INTEGRATION consumes the changes once a day at 8:00pm. First, the GL_INTEGRATION processes all available changes at 12:15 PM:

Figure 1: Changes processed by the first subscriber.

As more changes appear in the J$ table, GL_INTEGRATION continues to process the new changes at 1:15 PM.

Figure 2: More changes processed by the first subscriber.

At the end of the day, the subscriber DWH_INTEGRATION consumes all the changes that have occurred during the day.

Figure 3: Changes processed by the second subscriber followed by a purge of the consumed records.

In the above example, if changes had occurred before 8:00pm, but after GL_INTEGRATION last processed the changes (i.e. 7:15pm) then these changes would not be purged until GL_INTEGRATION has processed them all (i.e. 8:15pm)

To process the changes, ODI applies a logical lock on the records that it is about to process. Then the records are processed and the unlock step defines if the records have to be purged or not, based on other subscribers consumption of the changes.

Two views are created: the JV$ view and the JV$D view.

The JV$ view is used in the mappings where you select the option Journalized data only. Figure 4 shows where to find this option in the Physical tab of the mappings:

Figure 4: Extracting only the changes from the source table.

The code generated by ODI uses this view instead of the original source table when this option is selected. The JV$ view joins the J$ table with the source table on the primary key. A filter in Logical tab of the mappings allows the developers to select the subscriber for which the changes are consumed, as illustrated in figure 5 below:

Figure 5: Selecting the subscriber name in the mapping options to consume changes.

The subscriber name does not have to be hard-coded: you can use an ODI variable to store this name and use the variable in the filter.

The JV$D view is used to show the list of changes available in the J$ table when you select the menu Journal Data from the CDC menu under the models and datastores. Figure 6 shows how to access this menu:

Figure 6: viewing the changes from the graphical interface.

1.2 Simple CDC

Simple CDC, as the name indicates, is a simple implementation of the infrastructure described above. This infrastructure works fine if you have:

One single subscriber
No dependencies between records (Parent-child).

Because of these limitations though, the most recent and most efficient JKMs provided out of the box with ODI are all Consistent set JKMs. One important caveat with simple CDC JKMs is that they create one entry per subscriber in the J$ table for every single changed row. If you have two subscribers, each change generates two records in the J$ table. Having three subscribers means three entries in the J$ table for each change. You can immediately see that this implementation works for basic cases, but it is very limited when you want to expand your infrastructure.

When using Simple CDC JKMs, the lock, unlock and purge operations are performed in the IKM: each IKM has the necessary steps for these operations, and these steps are only executed if:

Journalizing is selected in the interface, as described above in figure 4;
The JKM used for journalizing in the model that contains the source table is a Simple CDC JKM

1.3 Consistent set CDC

Consistent Set CDC addresses the two limitations of simple CDC:

Dependencies between parent and child records
Handling of more than one subscriber.

1.3.1 Parent-Child relationship

There are two conflicting requirements when processing parent and child records:

Parent records must be processed first, or child records cannot be inserted (they would be referencing invalid foreign keys).
ODI needs to mark the records that are about to be processed (This is the logical lock mentioned earlier), and then process them. But as we are processing the parent records, changes to additional parent and children can be written to the CDC tables. The challenge is that by the time we lock the children records in order to process them, the parent records for the last arrived changes have not been processed yet. Figure 7 below illustrates this: if ODI starts processing the changes in the Orders table at 12:00:00, and then starts processing the changes in the Order Lines table at 12:00:02, the parent record for order lines 4 and 5 is missing in the target environment: order # 3 had not arrived yet when the Orders changes were processed.

Figure 7: Parent and children records arriving during the processing of changes

When you define the parameters for consistent set CDC, you have to define the parent-child relationship between the tables. To do so, you have to edit the Model that contains these tables and select the Journalized Tables tab. You can either use the Reorganize button to have ODI compute the dependencies for you based on the foreign keys available in the model, or you can manually set the order. Parent tables should be at the top, children tables (the ones referencing the parents) should be at the bottom.

In Figure 8 we see a Diagram that was created under the model that hosts the journalized tables to represent the relationships between the tables. To reproduce this, create a Diagram under your Model, then drag and drop the selected tables in that diagram: the foreign keys will automatically be represented as arrows by the ODI Studio.

Figure 8: ODI Diagram that represents the parent-child relationship in a set of tables.

In the illustration shown in figure 9 we would have to move PRODUCT_RATINGS down the list because of its reference to the SUPPLIERS table.

Figure 9: Ordering CDC tables from parent to child

Once the tables are ordered, the Consistent Set JKMs can “lock” the records based on this order: children records first, then the parent records. From then on it is safe to process the parent records, followed by the children records knowing that none of these children are missing their references. If more parent and children records are delivered to the J$ tables while the data is being processed (as was the case in Figure 7 with order #3 and the matching order lines), the new records are not locked and are ignored until the next iteration of the integration process. This next iteration can be anywhere from a few seconds later to hours later, depending on latency requirements.

Another improvement over simple CDC is that consistent set CDC does not duplicate the records in the J$ table when multiple subscribers are registered. Instead, ODI maintains window_ids that are used to identify when the records have been inserted in the infrastructure. Then it is only a matter of knowing which window_ids have been processed by which subscriber.

When the records of a set (parents and children records) are about to be processed, children records and parent records are logically locked. The KMs do the following operations:

Make sure that all records have a window_id, then identify the highest available window_ids (this is the Extend Window operation)
Define the array of window_ids to be processed by the subscribers (this is the Lock Subscriber operation).

These operations are performed in the packages before processing the interfaces where CDC data is processed as shown in Figure 10. After the data has been processed, the subscribers must be unlocked and the J$ table can be purged of the consumed records.

Figure 10: Example of a package for consistent set CDC.

To add these operations to a package, drag and drop the Model that contains the CDC set, then set the Type to Journalizing Model and select the appropriate options in the consumption section, as shown in Figure 10. You can drag and drop the model twice to separate the two operations. Note that in this case, the Extend Window operation must be done before the Lock Subscriber operation.

We will now look at how the Extend Window and Lock Subscriber package steps work:

Extend window

Either the window_id column of the J$ table is updated by the detection mechanism (as is the case with GoldenGate JKMs) or it is not (as is the case with trigger based JKMs). In all cases, the SNP_CDC_SET table is first updated with the new computed window_id for the CDC Set that is being processed. The window_id is computed from the checkpoint table for GoldenGate JKMs or is based on an increment of the last used value (found in the SNP_CDC_SET table) for other JKMs.

For non GoldenGate JKMs, all records of the J$ table that do not have a window_id yet (the value would be null) are updated with this new window_id value so that the records can be processed: these are records that were written to the J$ table after the last processing of changes and were never assigned a window_id.

Again, GoldenGate writes this window_id as it inserts records into the J$ table.

Lock subscriber

For all JKMs, the subscribers have to be locked: their processing window are set to range between the last processed window_id (which is the “minimum” window_id) and the newly computed window_id (which is the “maximum” window_id).

Unlock and purge

After processing, Unlocking the subscribers only amounts to overwriting the last processed window_id with the newly computed window_id (this way the next time we want to process changes, the “minimum window_id” is the one we had computed as the maximum window_id for the completed run). The “purge” step makes sure that the records that have been processed by all subscribers are removed from the J$ tables (all records with a window_id less than or equal to the lowest processed window_id across all subscribers)

Now that we understand the mechanics of ODI CDC, we can look into the details of the infrastructure.

2. Details of the Infrastructure

2.1 Simple CDC

Simple CDC requires only 2 tables and 2 views. The first table is for the list of subscribers and the tables they each monitor, the second table (J$) lists the changes for each source table that is monitored (there is one such table for each monitored source table). The views are used to either see the changes from the ODI Studio, or to consume the changes in the mappings.

The subscribers table is created in the Work Schema of the Default Schema for the data server. To identify the Default Schema, look under the data server definition in the physical architecture of the Topology Navigator: the Default Schema is marked with a checkmark. If you edit the schema, the Default checkbox is selected. As such, there will be a single, shared subscribers table for all the schemas on that server.

The J$ table and the two views are created for each table that is journalized. These are created in the Work Schema associated to the Physical Schema where the source table is located.

2.1.1 The Subscribers Table

When you register a subscriber to consume changes for a give table, the subscriber name and the name of the table are added to the SNP_CDC_SUBS table.

SNP_CDC_SUBS
JRN_TNAME (PK)	Name of the table monitored for changes
JRN_SUBSCRIBER (PK)	Name of the subscriber
JRN_REFDATE	Last update of the record
JRN_ROW_COUNT	Number of rows in the journalizing tables for this subscription
JRN_DATA_CMD	Placeholder for the SQL query to retrieve the changed data
JRN_COUNT_CMD	SQL query to update the JRN_ROW_COUNT column

2.1.2 The J$ tables

For simple CDC, the J$ tables contains the PK of the changed records in the table that is monitored for changes, along with the name of the subscriber for whom that change is recorded.

The JRN_CONSUMED is used to logically lock the records: when the records are inserted in the J$ table, the value is set to 0. Remember, for Simple CDC the lock/unlock operations are performed by the IKMs. When the IKMs lock the records, the value is changed to 1. The “unlock” process only purges records with a value equal to 1.

The JRN_FLAG column indicates the type of change detected in the source system. Deleted records are marked with a ‘D’. Inserts and updates are marked with an ‘I’: the IKM differentiates between inserts and updates based on the content of the target table: there could have been more than one change in the source system between two ODI integration cycles, for instance a new record can be inserted and then updated before the new cycle gets started. In that case, even though the last event in the source system is an update, the operation that is needed on the target side is an insert with the latest values found in the source system.

J$<SOURCE_TABLE_NAME>
JRN_SUBSCRIBER	Name of the subscriber who subscribed to the table changes
JRN_CONSUMED	Boolean flag. Set to 0 when the records are inserted, incremented to 1 when the records are marked for consumption (or “locked”)
JRN_FLAG	Type of operation in the source table (D=deleted, I= inserted or updated)
JRN_DATE	Date and time of the change
PK_x	Column x of the primary key (each column of the primary key of the source table is represented as a separate column in the J$ table)

This shows that ODI does not replicate the transactions; it does an integration of the data as they are at the time the integration process runs. Oracle GoldenGate replicates the transactions as they occur on the source system.

2.1.3 The JV$ View

The JV$ view is the view that is used in the mappings where you select the option Journalized data only. Records from the J$ table are filtered so that only the following records are returned:

Only Locked records : JRN_CONSUMED=’1’
If the same PK appears multiple times, only the last entry for that PK (based on the JRN_DATE) is taken into account. Again the logic here is that we want to replicate values as they are currently in the source database. We are not interested in the history of intermediate values that could have existed.

An additional filter is added in the mappings at design time so that only the records for the selected subscriber are consumed from the J$ table, as we saw in figure 5.

2.1.4 The JV$D view

Similarly to the JV$ view, the JV$D view joins the J$ table with the source table on the primary key. This view shows all changed records, locked or not, but applies the same filter on the JRN_DATE column so that only the last entry is taken into account when the same record has been modified multiple times since the last consumption cycle. It lists the changes for all subscribers.

2.2 Consistent Set CDC

The infrastructure for consistent set CDC is richer to accommodate more complex situations.

Once again, all infrastructure table (SNP_xxx) are created in the Work Schema of the Default Schema for the data server. As such they are shared resources for all schemas defined under the data server. The J$ table and the associated views are created in the Work Schema associated to the Physical Schema where the source table is located.

Now let’s look at the different components of this infrastructure.

2.2.1 The CDC set table

This table keeps track of the latest (and highest by the same token) window_ids used for a given CDC set. It is updated during the Extend Window step of the packages.

SNP_CDC_SET
CDC_SET_NAME (PK)	Name of the CDC Set
CUR_WINDOW_ID	Last window_id that has been used for this CDC set
CUR_WINDOW_ID_DEL	Last Window_id used to compute delete consistency
CUR_WINDOW_ID_INS	Last Window_id used to compute insert/update consistency
RETRIEVE_DATA	Command to execute in order to retrieve the journal data (used by the SnpsRetrieveJournalData API
REFRESH_ROW_COUNT	Command to execute in order to refresh the row count(used by the SnpsRefreshJournalData API

2.2.2 The subscribers table

This table lists the subscribers, and for each subscriber it references the data set to which the subscriber subscribed, along with the minimum and maximum window_id for this combination of subscriber and CDC set.

SNP_CDC_SUBS
CDC_SET_NAME (PK)	Name of the CDC Set
CDC_SUBSCRIBER (PK)	Name of the subscriber who subscribed to the CDC Set
CDC_REFDATE	Last update of the record
MIN_WINDOW_ID	Window_ids under this one should be ignored
MAX_WINDOW_ID	Maximum Window_id used by this subscription
MAX_WINDOW_ID_DEL	Maximum Window_id to take into consideration when looking at consistency for deletes
MAX_WINDOW_ID_INS	Maximum Window_id to take into consideration when looking at consistency for inserts / updates
CDC_ROW_COUNT	Number of rows in the journalizing tables for this subscription

After the Extend Window step updated in the SNP_CDC_SET table for the current CDC set, the Lock Subscriber step in the packages updates the maximum window_ids of the SNP_CDC_SUBS table with the same values for the current subscriber.

Only the changes from the J$ table that have a window_id between the minimum and maximum window_id recorded in the SNP_CDC_SUBS table are processed. Once these changes have been processed and committed, the maximum window_id is used to overwrite the minimum window_id (this is done in the Unlock Subscriber step of the package). This guarantees that the infrastructure is ready for the next integration cycle, starting where we left off.

2.2.3 The table listing the content of a CDC set

This table lists the tables that are journalized in a given data set.

SNP_CDC_SET_TABLE
CDC_SET_NAME	Name of the CDC Set
FULL_TABLE_NAME (PK)	Full name of the journalized table. For instance SUPPLIERS.PRODUCT_RATINGS
FULL_DATA_VIEW	Name of the data view. For instance ODI_TMP.JV$DPRODUCT_RATINGS
RETRIEVE_DATA	Command to execute in order to retrieve the journal data (used by the SnpsRetrieveJournalData API
REFRESH_ROW_COUNT	Command to execute in order to refresh the row count(used by the SnpsRefreshJournalData API

2.2.4 The table listing the infrastructure objects

This table lists all the CDC infrastructure components associated to a journalized table.

SNP_CDC_OBJECTS
FULL_TABLE_NAME (PK)	Name of the Journalized table
CDC_OBJECT_TYPE (PK)	Source table, view or data view, OGG component
FULL_OBJECT_NAME	Name of the object whose type is CDC_OBJECT_TYPE
DB_OBJECT_TYPE	TABLE/VIEW/TRIGGER/OGG EXTRACT/OGG REPLICAT/OGG INIT EXTRACT

This table is leveraged to make sure that ODI does not attempt to recreate an object that has already been created (see section 4.1 Only creating the J$ tables and views if they do not exist).

2.2.5 The J$ tables

For consistent CDC, the J$ tables contain the PK of the changed records along with a window_id that is updated to make sure that this records is processed in the appropriate order. Depending on the JKMs, the window_id can be updated by the mechanism used to detect the changes (as is the case for the GoldenGate JKMs) or during the Extend Window step of the package (in which case it is an increment of the last used value).

J$<SOURCE_TABLE_NAME>
WINDOW_ID	Batch order defining when to process the data
PK_x	Column x of the primary key (each column of the primary key of the source table is represented as a separate column in the J$ table)

2.2.6 The Views

The JV$ view is the view that is used in the mappings where you select the option Journalized data only. Records are filtered so that only the following records are returned:

Records where the window_id is between the minimum and maximum window_id for the subscribers;
If the same PK appears multiple times, only the last entry for that PK is taken into account. The logic here is that we want to replicate values as they currently are in the source database: we are not interested in the history of intermediate values that could have existed.

A filter created in the mappings allows the developers to select the subscriber for which the changes are consumed, as we saw in figure 5.

The JV$D view uses the same approach to remove duplicate entries, but it shows all entries available to all subscribers, including the ones that have not been assigned a window_id yet.

3. Focus on ODI JKMs for GoldenGate

The main benefit for ODI to leverage Oracle GoldenGate is that GoldenGate is the least intrusive product on the market to continuously extract data from a source system and replicate the transactions with the best possible performance.

3.1 Why GoldenGate?

We would not be doing justice to GoldenGate by trying to explain its many benefits in just a few lines. The GoldenGate documentation contains a very good introduction to the technology available here: http://docs.oracle.com/goldengate/1212/gg-winux/GWUAD/wu_about_gg.htm . If you are interested in best practices, real life recommendations and in depth understanding of GoldenGate, our experts concentrate their work here: http://www.ateam-oracle.com/di-ogg/.

The important elements of integrating with GoldenGate from an ODI perspective are the following:

Low impact on the source system: GoldenGate reads the changes directly from the database logs, and as such does not require any additional database activity.

Decoupled architecture: GoldenGate continuously replicates the transactions occurring on the source system, making these changes available to ODI as needed without the need for ODI to run a (potentially costly) SQL query against the source system when the ODI integration cycle starts. If part of the infrastructure is down, the impact on the other elements is minimal: the GoldenGate Capture process is independent from its delivery process, and the ODI processes are independent from the GoldenGate processes. This also allows for real time capture on the source system and scheduled delivery in the target system.

Performance of the end-to-end solution: even though the large majority of ODI customers run their ODI processes as batch jobs, some customers are reducing the processing windows continuously. Using GoldenGate for CDC allows for unique end-to-end performance, with customers achieving under-10 seconds end-to-end latency across heterogeneous systems: this includes GoldenGate detection of the changes, replication of the changes, transformations by ODI and commit in the target system.

Heterogeneous capabilities: both ODI and GoldenGate can operate on many databases available on the market, allowing for more flexibility in the data integration infrastructure.

3.2 Integration between ODI and OGG

The main components of the integration between ODI and GoldenGate are the following:

The ODI JKMs generates the necessary files for GoldenGate to replicate the data and update the ODI J$ tables (oby and prm files for the capture, pump and apply processes), including the window_id
These files instruct GoldenGate to write the PK of the changed records and to update the window_id for that change. The window_id is computed by concatenating the sequence number and the RBA from the GoldenGate checkpoint file with this expression:

WINDOW_ID = @STRCAT(@GETENV(“RECORD”, “FILESEQNO”), @STRNUM(@GETENV(“RECORD”, “FILERBA”), RIGHTZERO, 10))

If you are using OGG Online JKMs, ODI can issue the commands using the GoldenGate JAgent and execute these commands directly. If not, ODI generates a readme file along with the oby and prm file. This file provides all the necessary instructions to configure and start the GoldenGate replication using the generated files.
If you already have a GoldenGate replication in place, you can read the prm files generated by ODI to see what needs to be changed in your configuration so that you update the J$ tables (or read the next section for an explanation of how this works).

3.3 How does GoldenGate update the J$ tables

ODI creates a prm file for the apply process that contains basic replication instructions.

ODI writes two maps in that prm file. The first one instructs GoldenGate to copy the data from the source table into the staging tables.

map <Source_table_name>, TARGET <Target_table_name>, KEYCOLS (PK1, PK2, …, PKn);

The second one makes sure that the J$ table is updated at the same time as the staging table. GoldenGate in this case has two targets when it replicates the changes.

map <Source_table_name>, target <J$_Table_name>, KEYCOLS (PK1, PK2,…,PKn, WINDOW_ID),  INSERTALLRECORDS, OVERRIDEDUPS,
COLMAP (
PK1 = PK1,
PK2 = PK2,
...
PKn=PKn,
WINDOW_ID = @STRCAT(@GETENV("RECORD", "FILESEQNO"), @STRNUM(@GETENV("RECORD", "FILERBA"), RIGHTZERO, 10))
);

If you already have GoldenGate in place to replicate data from the source tables into a staging area, you may not be interested in using the files generated by ODI. You have already configured and fine tuned your environment, you do not want to override your configuration. All you need to do in that case is to add the additional maps for GoldenGate to update the ODI J$ tables.

3.4 Evolution of the GoldenGate JKMs between ODI 11g and ODI 12c

There is a deeper integration between ODI and GoldenGate in the 12c release of ODI than what was available with the 11g release. One immediate consequence is that the JKMs for GoldenGate have evolved to take advantage of features that now become available:

In ODI 11g the source table for an initial load was different from the source table used with GoldenGate for CDC: the GoldenGate replicat table had to be used explicitly as a source table in CDC configurations. With the 12c implementation of the GoldenGate JKMs, the same original source table is used in the mappings for both initial loads and incremental loads using GoldenGate. For CDC, the GoldenGate source becomes the source table in the mappings for CDC. The GoldenGate replicat is considered as a staging table and as such is not represented in the ODI mappings anymore. David Allan has a very good pictorial representation of the new paradigm available here: https://blogs.oracle.com/dataintegration/resource/odi_12c/odi_12c_ogg_configuration.jpg.
The new JKMs allow for online or offline use of GoldenGate: in online mode, ODI communicates directly with the GoldenGate JAgent to distribute the configuration parameters. The offline mode is similar to what was available in ODI 11g.

4. Elements to look for in the ODI JKM if you want to go further

To illustrate JKM internal workings, we are looking here at code of some of the Knowledge Modules delivered with ODI 12.1.2.0

4.1 Only creating the J$ tables and views if they do not exist

Traditionally in ODI KMs, tables and views can be created with the option to Ignore Errors so that the code does not fail if the infrastructure is already in place. This approach does not work well in the case of JKMs where we do want to know that the creation of a J$ table (or view) fails, but we will continuously add tables and views to the environment. What we want is to ignore the tables that have already been created, and only create the ones that are needed.

If you edit the JKM Oracle to Oracle Consistent (OGG) and look at the task Create J$ Table you can see that there is code in the Source command section as well as for the Target command section. The target command creates the table, as you would expect. The source command only returns a result set if the J$ table we are about to create in not referenced in the SNP_CDC_OBJECTS table. If there is no result set from the source command, the target command is not executed by ODI: the standard behavior in KM and procedures tasks is that the target command is executed once for each element of the result set returned from the source command (if there is a source command). Zero elements in the result set mean no execution.

4.2 Operating on Parents and Children tables in a CDC set

Some operations require that ODI processes the parent table first; others require that the child tables are processed first. If you edit the JKM Oracle Consistent and look at the task Extend Consistency Window (inserts), under the Consumption tab, you can see that that the window_ids are applied in Descending order, as shown in Figure 11. The reverse operation further down in the KM, Cleanup journalized tables, is done in Ascending order.

Figure 11: Repeating the code for all tables of the CDC set in the appropriate order

Note that since GoldenGate updates the window_ids directly for ODI, the matching step does not exist in the GoldeGate JKMs. But the same technique of processing tables of the set in the appropriate order is leveraged when creating or dropping the infrastructure (look at the Create J$ and Drop J$ tasks for instance in the GoldenGate JKMs).

Conclusion

As you can see the ODI CDC infrastructure provides a large amount of flexibility and covers the most complex integration requirements for CDC. The out-of-the-box integration with Oracle GoldenGate helps developers combine both products very quickly without the need for experts to intervene. But if you need to alter the way the two products interact with one another, JKMs are the key to the solution you are dreaming about.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”. For Oracle GoldenGate, visit “Oracle A-Team Chronicles for GoldenGate”

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Understanding Where to Install the ODI Standalone Agent

April 14, 2014, 6:41 am

≫ Next: ODI Agents: Standalone, JEE and Colocated

≪ Previous: Understanding the ODI JKMs and how they work with Oracle GoldenGate

Introduction

With no middle tier required in the ODI architecture, the most common question from people who are new to ODI is where to install the standalone agent: source, target or middle tier systems? We look here at different possible architectures and provide the best choices accordingly.

Understanding Where to Install the ODI Standalone Agent

Source, Target or Middle tier?

Source systems be dispersed throughout the information system: connection from one source system to an other one is never guaranteed. This usually make source systems a less than ideal location for the ODI agent. Dedicated systems can work, but if they are independent of the database servers involved in the ETL processes, then the infrastructure is dependent on physical resources that are not tightly coupled with the ETL processes. This means that there are more components to provision, monitor and maintain over time. A simple answer then is that installing the agent on the target systems makes sense. In particular if we are talking of a data warehousing environment, where most of the staging of data will already occur on this target system.

But in the end, this simple answer is a convenience, not an all be all. So rather than accepting this as an absolute truth, we will look into how the agent works and from there provide a more nuanced answer to this question.

For the purpose of this discussion we are considering the Standalone version of the agent only – the JEE version of the agent runs on top of a Weblogic server, which pretty much defines where to install the agent… but keep in mind that in the same environment standalone and JEE agents can be combined. Also note that the JEE agent architecture addresses the limitation described above when the agent is installed neither on the source nor on the target systems. For one, Weblogic would not be dedicated solely to the ODI agent. In addition, the WebLogic cluster will provide the necessary High Availability infrastructure to guarantee that the agents and their schedules are always available.

In addition, ODI 12c introduces the notion of Collocated agents: these can be viewed as Standalone agents that can be managed and monitored from WebLogic Server and Enterprise Manager. Naturally the recommendations for the location of the Standalone agent will also apply to the Collocated agents. For more on the different types of ODI agents see ODI agents: Standalone, JEE and Colocated

First we will look into connectivity requirements. Then we will look into how the agent interacts with the environment: flat files, scripts, utilities, firewalls. And finally we will illustrate the different cases with real life examples.

Understanding Agent Connectivity Requirements

The agent has to perform up to a number of operations for a scenario to run:

Connect to the repository (always)
Connect to the source and target systems (always, the minimum requirement will be to connect to the databases to send DDL and DML for execution by the systems)
Access the data via JDBC (if needed – depends on the Knowledge Modules that are used in the mappings of the scenario).

Connection to the repository

The agent will connect to the repository to perform the following tasks:

Retrieve from the repository the code of the scenarios that are executed
Complete the code generation based on the context that was selected for execution
Write the generated code in the operator tables
After the code has been executed by the databases, update the operator tables with runtime statistics and, if necessary, error messages returned by the databases or operating system.

ODI 12c improves dramatically the communication between the agents and the repository with blueprint caching and optimized logs generations. Nonetheless the same connection requirements remain.

To perform all these operations, the agent will use JDBC to connect to the repository. The parameters for the agent to connect are defined when the agent is installed. For a 12c standalone agent, these parameters are managed with the config.sh script (or config.cmd on Windows) available in ORACLE_HOME/oracle_common/common/bin. For earlier releases of ODI these parameters are maintained manually by editing the file odiparams.sh (or odiparams.bat on Windows) found in the ORACLE_HOME/oracledi/agent/bin directory.

What does this mean for the location of the agent?

Since the agent uses JDBC to connect to the repository, the agent does not have to be on the same machine as the repository. The amount of data exchanged with the repository is limited to logs generation and updates. In general this amounts to relatively little traffic, except in the case of near real time environments where the agent can be communicating extensively with the repository. In all cases, it is highly recommended that the agent be on the same LAN as the repository. To limit latency between the agent and the repository, make sure that the agent is physically close to the repository (and not a few miles away). Beyond that, the agent can be installed on pretty much any system that can physically connect to the proper database ports to access the repository.

Connection to sources and targets

Before sending code to the source and target systems for execution, the agent must first establish a connection to these systems. The agent will use JDBC to connect to all source and target databases at the beginning of the execution of a session. These connections will be used by the agent to send the DDL (create table, drop table, create index, etc.) and DML (insert into… select…from… where…) that will be executed by the databases.

What does this mean for the location of the agent?

As long as the agent is sending DDLs and DMLs to the source and target systems, once again it does not have to be physically installed on any of these systems. However, the location of the agent must be strategically selected so that it can connect to all databases, sources and targets. From a network perspective, it is common for the target system to be able to view all sources, but it is not rare for sources to be segregated from one another: different sub-networks or firewalls getting in the way are quite common. If there is no guaranty that the agent can connect to all sources (and targets) if it is installed on a source system, then it makes more sense to install it on one of the target systems. The DDL and DML activity described above requires a limited amount of physical resources (CPU, memory), so the impact of the agent on the system on which it is installed is quite negligible.

Conclusion: from an orchestration perspective, the agent could be anywhere in the LAN, but it is usually more practical to install it on the target server.

Data transfer using JDBC

ODI processes can use multiple techniques to extract from source systems and load data into target systems: JDBC is one of these techniques. If the mappings executed by the agent use JDBC to move data from source to target, then the agent itself establishes this connection: as a result the data will physically flow through the agent.

JDBC is never the most efficient way to transfer large volumes of data. Database utilities will always provide better performance and require fewer resources. It is always recommended to review Knowldege Modules selections made in your mappings and interfaces to make sure that only the most efficient KMs are used when transferring large volumes of data .Defaulting every data transfer to JDBC is never a good practice. In addition to being inefficient in terms of performance, JDBC will use the memory space allocated to the agent. The more JDBC processes you have running in parallel, the more memory is used by the agent. Before increasing the memory allocation for the agent, always double check which Knowledge Modules are being used, and how much data is processed with these Knowledge Modules. Replacing JDBC KMs with native utilities KMs will both address memory requirement and improve performance.

What does this mean for the location of the agent?

This is a case where we have to be more careful with the agent location. In all previous cases, the agent could have been installed pretty much anywhere as the performance impact was negligible. Now if data physically moves through the agent, placing the agent on either the source server or the target server will in effect limit the number of network hops required for the data to move from source to target.

Let’s take the example where the agent runs on a windows server, with a source on a mainframe and a target on Linux. Data will have to go over the network from the mainframe to the windows server, and then from the windows server to the Linux box. In data integration architectures, the network is always a limiting factor for performance. Placing the agent on either the source or the target server will help limit the adverse impact of the network.

Figure 1: JDBC access with remote ODI agent

Figure 2: JDBC access with ODI agent on target

Other considerations: accessing files, scripts, utilities

Part of the integration process often requires accessing resources that are local to a system: flat files that are not accessible remotely, local scripts and utilities. A very good example is the leverage of database bulk loading utilities for files located on a file server. Installing the agent on the file server along with the loading utilities allows ODI to bulk load the files directly from the server. An alternative is to share (or mount) the directories where the files and utilities are installed so that the agent can view them remotely. Keep in mind that mounted drives and shared network directories tend to degrade performance though.

What does this mean for the location of the agent?

It is actually quite common to have the ODI agent installed on a file server (along with the database loading utilities) so that it can have local access to the files. This is easier than trying to share directories across the network (and more efficient), in particular if when dealing with disparate operating systems.

Another consideration to keep in mind at this point is that there is no limit to the number of ODI agent any given environment: some jobs can be assigned to specific agents because they need access to resources that would not be visible to other agents. This is a very common infrastructure, where a central agent receives the job execution requests (maybe on the target server, or a JEE agent on WebLogic Server). Some of the scenarios can then leverage satellite agents in charge of very specific tasks.

Figure 3: ODI agent loading flat files

Beyond databases: Big Data

In a Hadoop environment, execution requests are submitted to a NameNode. This Namenode is then in charge of distributing the execution across all DataNodes that are deployed and operational. It would be totally counter-productive for the ODI agent to try and bypass the NameNode. From that perspective, the best location for the ODI agent is to be installed on the NameNode.

The Oracle BigData appliance ships with the ODI agent pre-packaged so that the environment is immediately ready to use.

Firewall considerations

One element that seems pretty obvious is that no matter where the agents are located, it is important to make sure that the firewalls will let the agents access the necessary resources. More challenging can be the timeouts that some firewalls (or even servers in the case of iSeries) will have. For instance it is a common configuration for firewalls to drop connections that are inactive for more than 30 minutes. If a large batch operation is being executed by the database, the agent has no reason to overload the network or the repository with unnecessary activity while it’s waiting for the operation to be completed… but as a result the firewall could disconnect the agent from the repository or from the databases. The typical error in that case would appear as “connection reset by peer”. When experiencing such a behavior, think about reviewing firewall configurations with your security administrators.

Real life examples

We will now look into some real life examples, and define where the agent would best be located for each scenario.

Loading data into Exadata with external tables

We are looking here into the case where files have to be loaded into Exadata. An important point from an ODI perspective is that we first want to look into what makes the most sense for the database itself – then we will make sure that ODI can deliver.

The best option for Exadata in terms of performance will be to land the files on DBFS and take advantage of the performance of Infiniband for the data loads. From a database perspective, loading the files as external tables will give us by far the best possible performance.

Considerations for the agent

The key point here is that external tables can be created through DDL commands. As long as the files are on DBFS, they are visible to the database… (They would have to be for us to use External tables anyhow). Since the agent will connect to Exadata via JDBC, it can issue DDLs no matter where it is installed, on a remote server or on the Exadata appliance.

Figure 4: Remote ODI agent driving File Load
via external tables

Loading data with JDBC

There will be cases where volume mandates the use of bulk loads. Other cases will be fine using JDBC connectivity (in particular if volume is limited). Uli Bethke has a very good discussion on this subject here (http://www.business-intelligence-quotient.com/?tag=array-fetch-size-odi), even though his focus at the time was not to define whether to use JDBC or not.

One key benefit of JDBC is that it is the simplest possible setup: as long as the JDBC driver is in place and physical access to the resource is possible (file or database) data can be extracted and loaded. For a database, this means that no firewall prevents access to the database ports. For a file, this means that the agent has physical access to the files.

Considerations for the agent

The most common mistake for files access is to start the agent with a username that does not have the necessary privileges to see the files – whether the files are local to the agent or accessed through a shared directory on the network (mounted on Unix, shared on Windows).

Other than that, as we have already seen earlier, locate the agent so as to limit the number of network hops from source to target (and not from source to middle tier to target). So the preference for database-to-database integration is usually to install the agent on the target server. For file-to-database integration, have the agent and database loading utilities on the file server. If files and databases sources are combined, then it is possible to either have a single agent on the file server, or to have two separate agents, thus optimizing the data flows.

Revisiting external tables on Exadata with file detection.

Let’s revisit our initial case with flat files on Exadata. Let’s now assume that ODI must detect that the files have arrived, and that this detection triggers the load of the file.

Considerations for the agent

In that case, the agent itself will have to see the files. This means that either the agent will be on the same system as the files (we said earlier that the files would be on Exadata) or the files will have to be shared on the network so that they are visible on the machine on which the agent is installed. Installing the agent on Exadata is so simple that it is more often than not the preferred choice.

Figure 5: ODI agent on Exadata detecting new files
and driving loads via external tables

Conclusion

The optimal location for ODI standalone agents will greatly depend on the activities that the agent has to perform, but there are two locations that always work best:

Target database ;
File server.

Keep in mind that an environment is not limited to a single agent – and more agents will enhance the flexibility of the infrastructure. A good starting point for the first agent will be to position it on the target system. Additional agents can be added as needed, based on the specifics of the integration requirements.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-Team Chronicles for ODI.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

ODI Agents: Standalone, JEE and Colocated

April 16, 2014, 10:30 am

≫ Next: Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

≪ Previous: Understanding Where to Install the ODI Standalone Agent

Introduction

With its version 12, Oracle Data Integrator offers three flavors of agents: JEE, Colocated and Standalone. The purpose of this post is to compare and contrast the benefits of each one of these types of agents and to help developers and architects select the best option(s) for their implementation.

ODI Agents: Standalone, JEE and Colocated

In this comparison between the three agent flavors, we only refer to the 12c version of ODI unless otherwise noted.

Agents flavors and installation types

To provide some context, let us review the different agents available in ODI 11g and ODI 12c, as well as the type of installation required for each agent flavor.

Low footprint: the standalone agent

Historically, this is the original ODI agent, and it is still available today. The number one benefit of this agent is its very light footprint: it does not require an application server since it is a standalone Java program. This makes it ideal for installations on busy and already active systems, including database servers. It provides all the required features to orchestrate data integration processes, including exposing these processes as web services. The lightweight nature of the agent brings many benefits:

In an ELT architecture, the agent can reside on the same server as the target database, hence reducing the number of network hops when transferring data through JDBC (see Understanding Where to Install the ODI Standalone Agent for more details on this);
For integrations that require access to flat files, scripts and utilities that are not accessible from the network, installing the agent on the file server or the machine that hosts scripts and utilities makes for a very flexible infrastructure.
Lifecycle and availability of a Standalone Agent is managed by Node Manager monitoring and restarting the agent.

Enterprise deployments: the JEE agent

There are a number of limitations with the standalone agent when we are looking at an enterprise deployment of ODI. These limitations are addressed with the JEE agent (introduced in version 11g of ODI) that can be installed on WebLogic Server (WLS). The following are benefits of the JEE agent which are inherited directly from Weblogic server:

High availability: since the execution of the scenarios depends entirely on the agents, it is important that the agent be up and running continuously. This is even more important if the agent is used as a scheduler: if the agent is down when a scenario is scheduled to start, the execution of said scenario will be skipped. Since the JEE agent runs on WebLogic Server, agents can now be deployed on several nodes of the cluster. The schedules can be stored in a Coherence cache to guarantee that even if a node is down, the schedules are always executed. A very good description of the mechanisms involved is available in the Fusion Middleware documentation here: Oracle Fusion Middleware High Availability Guide. Another very good introduction to the concepts from an ODI perspective is available from Sachin Thatte here: https://blogs.oracle.com/dataintegration/entry/setup_of_odi_11g_agents_for_hi

Configurable connection pooling: The Standalone agent also uses a UCP (Universal Connection Pool), but its pool parameters cannot be configured. This configuration is possible with WLS for the JEE agent. Creating a DBMS connection is a slow operation. With WLS connection pools, connections are already established and available to the ODI JEE agent. The agents are more efficient since they do not have to keep connecting and disconnecting from the databases. In addition, DBMSs run faster with dedicated connections than if they have to handle incoming connection attempts at run time. Connection pooling also makes it easier to manage and optimize the number of concurrent connections to a system. A very good overview of the mechanisms and benefits of connection pooling is available here: http://docs.oracle.com/cd/E16655_01/java.121/e17659/intro.htm#JJUCP8109

Management, Monitoring, Alerting: Oracle Enterprise Manager Cloud Control (OEM) can be used to manage the ODI JEE agents:

Agents discovery
Configuration management (components version, connection parameters)
Monitoring of agents’ health and performance
Alerting
Trending

Obviously the JEE agent would not be installed on a database server, and since it is running on top of an application server it is constrained to access the resources accessible to that web server. To access files and utilities that would not be accessible from the application server, the JEE agent can be combined with Standalone agents that would provide local access to remote resources.

Lightweight and centrally manageable: the Colocated agent

As specified here: http://docs.oracle.com/middleware/1212/wls/WLDCW/intro.htm “A WebLogic domain is the basic administrative unit of WebLogic Server. It consists of one or more WebLogic Server instances, and logically related resources and services that are managed collectively as one unit.” For implementations that require neither high availability for the agents nor connection pooling, but where there is an interest in all other benefits offered by the JEE agent, the Colocated agent is the solution. The only component that is required to run on the machine that hosts the Colocated agent is the WLS Node Manager. Thanks to the Node Manager, the Colocated agent is part of the WLS domain, and as such it benefits from all the features available to elements of the domain, except for those that would require the presence of a Managed Server on the machine. In other words, the Colocated agent is a Standalone agent that can be centrally managed and monitored.

Summary

The following rule-of-thumb can be used to decide when to use the different agents. Keep in mind that you can mix and match the different types of agents within the same environment.

Standalone agent

Light install and runtime footprint
Requirement to access local files or utilities
Optimized network traffic for JDBC connections

Java EE agent

High Availability Requirement
Configurable connection pooling requirement
Colocation with others Fusion middleware components such as SOA or BI
Management, monitoring and alerting requirement

Colocated agent

Light runtime footprint (The Enterprise Install lays out WLS and the Java Required Files. See Java Required Files Custom WLST Commands for more details on JRF)
Need to access local files or utilities
Optimized network traffic for JDBC connections
Management, monitoring and alerting requirement

The following chart summarizes the features of the different agent types

(*): Only if the node manager is installed and running

(^#): An Administration Server must be installed in the domain, but can reside on a different machine.

(^x): This feature is a function of Oracle Enterprise Manager, not of the agent itself.

(^&):Learn about Fusion Middleware Control

References

If you are not familiar with WLS components leveraged by ODI (a lot of the content above refers to the Oracle WebLogic Server documentation) here are some references that will help you with a better understanding of the architecture and benefits of WLS:

All WLS Domains must have an Administration Server, which is a central point for managing the domain and to provide access to the WLS administration tools. All changes to configuration and deployment of the applications are done through the Administration Server. A very good introduction is available here.

WLS Managed Servers host applications, components and resource. When configured in a cluster, they provide scalability and high availability.
WLS Node Managers are used to manage and monitor the health of the components in a domain. For more information, see Administering Server Startup and Shutdown for Oracle WebLogic Server and Administering Node Manager for Oracle WebLogic Server.

More Web Logic Server references

ODI agent references

ODI agents installation and configuration is described in Installing and Configuring Oracle Data Integrator
How to setup ODI agents for high availability: Oracle Fusion Middleware High Availability Guide
ODI JEE agents and High Availability
Configuring and Monitoring ODI from Oracle® Enterprise Manager Cloud Control
Understanding ODI Agents
Understanding Where to Install the ODI Standalone Agent

Special thanks to Sachin Thatte, Julien Testut, Benjamin Perez-Goytia and Sandrine Riley for their help on this subject.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-Team Chronicles for ODI.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

June 27, 2014, 2:41 pm

≫ Next: ODI 11g and 12c Repository Structures Available on MOS

≪ Previous: ODI Agents: Standalone, JEE and Colocated

Introduction

This article illustrates how to import data from relational databases into Hadoop Distributed File Systems (HDFS) with Sqoop and Oracle Data Integrator (ODI). This article features two ODI knowledge modules designed to work with Sqoop: IKM SQL to HDFS File (Sqoop) and IKM SQL to HDFS Hive (Sqoop).

If your source data store is a relational database such as Oracle, Teradata, and MySQL, the Sqoop tool provides a fast and efficient way to extract and load data into your Hadoop Distributed File System.

If you would like to download a copy of these knowledge modules, go to “Oracle Data Integrator Knowledge Modules – Downloads.” Search for “sqoop”.

The examples discussed in this article have been developed with ODI 11.1.1 and ODI 12.1.3. A copy of these ODI repositories with examples can be found at “ODI Repository Sample SQL to HDFS Using Sqoop.”

Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

Sqoop is a tool designed to transfer data between relational databases and Hadoop clusters.

Using Sqoop, data can be imported from a relational database into a Hadoop Distributed File System (HDFS). In Hadoop, data can be transformed with Hadoop MapReduce – a software framework capable of processing and transforming enormous amounts of data in parallel. Once the data is transformed in the Hadoop cluster, it can then be exported with Sqoop into another relational database.

This article illustrates how to import data from a relational database into a Hadoop cluster using Sqoop and Oracle Data Integrator (ODI).

Sqoop supports three types of import operations: import into HDFS files, import into Hive tables, and import into HBase tables.

This article focuses on how to import data from relational databases into HDFS files and Hive tables.

ODI 12.1.3 Application Adapters for Hadoop

ODI 12.1.3 offers a new set of knowledge modules called the Oracle Data Integrator Application Adapters for Hadoop. These application adapters include the following main capabilities:

Loading data from files and SQL databases into HDFS, HBase, and Hive tables using Sqoop.
Performing additional validation and transformation of data within the Hadoop cluster using Apache Hive and the Hive Query Language (HiveQL).
Loading processed data from Hadoop into SQL databases or HBase using Sqoop.

Figure 1 illustrates the ODI 12.1.3 Application Adapters for Hadoop. It is recommended to get familiar with this new set of knowledge modules. They offer additional capabilities and features that are not in the scope of this article.

Figure 1: Applications Adapters for Hadoop

To learn more about the ODI 12.1.3 Application Adapters for Hadoop, go to “Application Adapters Guide for Oracle Data Integrator 12.1.3.”

About the Knowledge Modules in this Article

The knowledge modules discussed in this article have been developed as a tool to help ODI users to get started with the capabilities of using the Sqoop tool with Oracle Data Integrator.

These knowledge modules offer basic capabilities and options with the Sqoop tool. They are a great resource for learning how to integrate data from relational databases into Hadoop using Sqoop and Oracle Data Integrator.

Two ODI knowledge modules are featured in this article:

IKM SQL to HDFS File (Sqoop) – imports data from a relational database into a HDFS directory.
IKM SQL to HDFS Hive (Sqoop) – imports data from a relational database into a Hive table.

Figure 2 shows these two knowledge modules in an ODI project called Movie. Both knowledge modules are of type Integration.

The Movie project contains ODI objects – mappings, packages, scenarios, and variables – to import data from an Oracle database into a HDFS cluster. The Sqoop knowledge modules are used to perform this import operation. The Oracle database contains data about movies such as movie titles, movie genres, and movie casts.

This ODI project is discussed throughout this article, and it is used to demonstrate how to import data from relational databases into Hadoop with Sqoop and Oracle Data Integrator.

Figure 2: Sqoop Knowledge Modules

These two knowledge modules are compatible with Oracle Data Integrator 11g and Oracle Data Integrator 12c. Both knowledge modules were developed using Sqoop version 1.4.3-cdh4.6.0.

For additional documentation on Sqoop commands and features, go to “Sqoop User Guide”.

IKM SQL to HDFS File (Sqoop)

The IKM SQL to HDFS File (Sqoop) is designed to import data from a relational database into a HDFS directory of a Hadoop cluster.

This knowledge module (KM) is recommended for users that would like to import raw data into a HDFS directory. The raw data can then be used in MapReduce programs or as a source to create Hive tables for further analysis.

Also, this knowledge module is recommended for users that would like to import data into a HDFS directory where the HDFS files are the source of a Hive external table.

Importing raw data into a HDFS directory offers the following benefits:

Raw data files can be used by both MapReduce programs, and Hive external tables.
If the Hive external table is deleted, the raw data is retained, and it can still be used by other Hive tables or MapReduce programs.

If users would like to import data from a relational database directly into a Hive table, the IKM SQL to HDFS Hive (Sqoop) should be used instead.

IKM SQL to HDFS Hive (Sqoop)

The main function of the Sqoop tool is to upload data into files in the HDFS cluster. However, Sqoop can also import data from a relational database into a Hive data warehouse.

The IKM SQL to HDFS Hive (Sqoop) is designed to import data from a relational database into a Hive table.

Importing data into a Hive data warehouse offers the following benefits:

Large datasets can be manipulated with the Hive query tool called HiveQL. This Hive query tool allows users familiar with SQL to manipulate and retrieve data in a structured format.
HiveQL supports custom scalar functions, aggregations, and table functions.
Hive can scale out by dynamically adding more hardware to the Hadoop cluster; thus, massive amount of data can be stored in a Hive data warehouse.
Hive offers extensibility with the MapReduce framework. MapReduce programmers can plug-in their custom mappers and reducers with HiveQL to perform complex transformations that may not be supported by the built-in features of the HiveQL language.

Hive is an open source technology under the Apache Software Foundation. To learn more about Hive, go to “The Apache Hive Project”.

Knowledge Modules Tasks

Figure 3 shows a list of tasks for both knowledge modules: IKM SQL to HDFS File (Sqoop) and IKM SQL to HDFS Hive (Sqoop). Both knowledge modules have the same number of tasks and task names, but the code generated by each task is different. There are five knowledge module tasks:

Generate Sqoop script – This step creates a shell script, the Sqoop script, in the Unix/Linux file system environment. The shell script contains the actual Sqoop command with all the necessary parameters to perform the Sqoop import operation. The output generated by the Sqoop import operation will be re-directed to a log file.
Add execute to Sqoop script – This step changes the mode of the shell script to Execute (x), so it can be executed in the Unix/Linux environment. The file-mode change is done for all users (a+) in the Unix/Linux environment.
Execute Sqoop script – This step executes the Sqoop script, and it performs the actual Sqoop import operation. If the returned code of the Sqoop import operation is not equal to 0, an error is raised, and the step fails. If the returned code of the Sqoop import operation is equal to 0, the import operation is successful.
Remove Scoop script – This step removes the Sqoop script created by the knowledge module.
Remove Log file – This step removes the log file created by the knowledge module.

Figure 3: Sqoop Knowledge Modules Tasks

Knowledge Modules Options

Figures 4 and 5 illustrate a list of knowledge module options for the IKM SQL to HDFS File (Sqoop) and the IKM SQL to HDFS Hive (Sqoop), respectively. By default, the knowledge module options have been configured as follow:

Both knowledge modules perform the Sqoop import operation in APPEND mode; existing data in the HDFS directory is preserved.
Both knowledge modules perform the import operation with 1 Sqoop mapper (parallel process). This value can be increased, but a splitting column is required.
Temporary objects will be deleted once the Sqoop operation is completed successfully.
The Hive knowledge module, IKM SQL to HDFS Hive (Sqoop), includes additional options to control the target Hive table. By default, the target Hive table will not be created, and its structure and content will not be replaced.

Figure 3: IKM SQL to HDFS File (Sqoop) - Options

Figure 4: IKM SQL to HDFS File (Sqoop) – Options

Figure 4: IKM SQL to HDFS Hive (Sqoop) - Options

Figure 5: IKM SQL to HDFS Hive (Sqoop) – Options

Table 1 below shows a list of options for both knowledge modules; each option is fully documented in this table.
Most of these options use ODI variables; thus, increasing the power and flexibility of the knowledge modules. Examples of how to use ODI variables with these options will be illustrated in the following sections of this article.
Use the column called Knowledge Module, in Table 1 below, to determine which knowledge module uses the option.

Table 1: Knowledge Modules Options

Benefits of Using Sqoop with Oracle Data Integrator (ODI)

Using Sqoop with Oracle Data Integrator offers great benefits. Sqoop imports can be designed with both ODI 11g and ODI 12c.

The following sections outline some of the benefits of using ODI with Sqoop. Most of the discussions in the following sessions are focused on the benefits of using the new features of ODI 12c.

Using ODI 12c Flow-Based Editor to Design Sqoop Imports

The new flow-based editor in ODI 12c offers the following benefits when designing Sqoop imports:

ODI 12c components such as filters, joins, and splitters can be used when designing Sqoop imports with ODI mappings.
ODI 12c reusable mappings can be used when designing Sqoop imports that make use of the same set of relational tables. ODI reusable mappings reduce the amount of effort that takes to design a mapping because reusable units of work can be incorporated in other mappings.
ODI 12c multi-target datastores can be designed with Sqoop imports. Multiple targets such as HDFS directories and Hive tables can all be designed in the same ODI mapping, increasing efficiency and enabling parallelism when loading data with the Sqoop tool.
Figure 6 shows an ODI 12c mapping, SQL to HDFS File (Sqoop), with two instances of the same reusable mapping, MY_TV_SHOWS and MY_DRAMAS, to import data from a set of relational tables into two separate target HDFS directories: TV_SHOWS, and DRAMAS.

Figure 5: ODI 12c Flow-Based Editor with Sqoop

Figure 6 ODI 12c Flow-Based Editor with Sqoop

In this example, a filter component is used to filter data by movie genre. Figure 7 shows the filter condition for the target HDFS directory called TV_SHOWS.

Figure 7: ODI 12c Filter Component

Figure 8 shows the ODI reusable mapping, MY_MOVIES, used by the ODI 12c mapping, SQL to File (Sqoop). This reusable mapping contains the source relational tables used by Sqoop to load data into the HDFS directories.
This reusable mapping uses other ODI components such as datasets, joins, filters, and distinct-sets. The filter component filters data by movie year.

Figure 7: ODI 12c Reusable Mapping for Sqoop

Figure 8: ODI 12c Reusable Mapping for Sqoop

Creating ODI 12c Deployment Specifications for Sqoop Imports

Oracle Data Integrator 12c offers the capability of designing multiple physical implementations of the same ODI mapping. This is known as ODI Deployment Specifications. For additional information on how to create deployment specifications with ODI 12c, go to “Creating Deployment Specifications with Oracle Data Integrator 12c”.

Figure 9, section A, illustrates an ODI mapping with two deployment specifications: the Initial Import and the Incremental Import.
These two deployment specifications have been configured to use the IKM SQL to HDFS File (Sqoop) in both target datastores: TV_SHOWS and DRAMAS. Section B highlights these two target datastores.

Figure 8: ODI 12c Deployment Specifications with Sqoop

Figure 9: ODI 12c Deployment Specifications with Sqoop

Figure 10 shows the Initial Import deployment specification for the target datastore called TV_SHOWS. This deployment specification imports data in OVERWRITE mode; thus, the existing data in the target HDFS directory is purged prior adding the new dataset.
Two Sqoop mappers (parallel processes) are used with this deployment specification, since a significant amount of data is imported during the initial load.
A splitting column called MOVIE_YEAR is used to split the workload among the Sqoop mappers. The splitting column comes from the ODI distinct component called OUT_TV_SHOWS. Figure 9 above shows this component. OUT_TV_SHOWS is the same distinct component called MOVIES in the reusable mapping.
An ODI variable called VAR_MOVIE_YEAR is used as the suffix name for the temporary object names and the HDFS directory name.

Figure 9: Deployment Specification for Sqoop Initial Imports

Figure 10: Deployment Specification for Sqoop Initial Imports

Figure 11 shows the Incremental Import deployment specification for the same target datastore, TV_SHOWS. In this example, the import operation is performed in APPEND mode; thus, the existing data in the target HDFS directory is preserved.
Since small datasets are imported during incremental loads, one Sqoop mapper is used with this deployment specification. As a result, a splitting column is not required.
The ODI variable called VAR_MOVIE_YEAR is also used as the suffix name for the temporary object names and the HDFS directory name.

Figure 10: Deployment Specification for Sqoop Incremental Imports

Figure 11: Deployment Specification for Sqoop Incremental Imports

Using ODI 12c In-Session Parallelism with Sqoop

Oracle Data Integrator 12c introduces a new feature that allows parts of an ODI mapping to execute in parallel. This feature is configured in the physical deployment specification of the ODI mapping. Using this feature with Sqoop adds an additional level of parallelism when importing data from relational databases into Hadoop clusters.

Figure 12 shows two execution strategies to import data into two HDFS directories: the Serial Execution and the Parallel Execution.
The Serial Execution strategy contains one execution unit, Movie Files, for both HDFS directories: TV_SHOWS and DRAMAS. The import operation of these two HDFS directories is done in serial mode because there is only one execution unit for both HDFS directories.
The Parallel Execution strategy contains two execution units, one for each HDFS directory. Multiple execution units within the same execution group run in parallel; thus, the import operation of these two HDFS directories is done in parallel.
The Parallel execution can be designed by selecting a datastore such as TV_SHOWS, and dragging the datastore outside of its execution group. This action will automatically create another execution unit for the selected datastore in the same execution group as shown in Figure 12.

Figure 11: ODI 12c In-Session Parallelism with Sqoop

Figure 12: ODI 12c In-Session Parallelism with Sqoop

Figure 13 shows the Initial Import deployment specification for both HDFS directories, TV_SHOWS and DRAMAS. In this example, two sets of Sqoop imports run in parallel, reducing the amount of time it takes to load data from the relational database tables into the HDFS directories.
Notice that in Figure 13, the reusable mapping is being used in two separate execution units in the same execution group called SQL to HDFS Files – Initial Import.

Figure 12: Two Parallel Imports with Sqoop and ODI

Figure 13: Two Parallel Imports with Sqoop and ODI

Figure 14 shows the session log for the Initial Import deployment specification. Both execution units, Dramas and TV Shows, were executed in parallel.

Figure 13: ODI 12c Session Log Parallel Execution with Sqoop

Figure 14: ODI 12c Session Log Parallel Execution with Sqoop

For additional information on how to configure and use the ODI 12c In-Session Parallelism, go to “ODI 12c In-Session Parallelism.”

Using ODI Packages with Sqoop Mappings

ODI packages can be used to implement additional parallelism when loading data from a relational database into a HDSF cluster. For instance, an ODI package can be designed to launch multiple executions of the same ODI mapping in parallel.

Figure 15 illustrates the design of an ODI package called PKG_SQL_TO_HDFS_FILES – INITIAL. The package flow loops 3 times and launches 3 executions of the same ODI scenario in parallel, the SQL to HDFS Files. This scenario is the compiled object of the ODI mapping that imports data from the Oracle relational database into the HDSF directories.
In this ODI package, the SQL to HDFS Files scenario has been configured to run in Asynchronous (parallel) mode. In ODI, packages can execute scenarios in asynchronous mode, which means the package does not wait for the execution of the scenario. Instead, the package launches the scenario and the next step in the package is executed. The scenario then runs in parallel with other steps in the package.
The ODI package shown in Figure 15 uses a variable called VAR_MOVIE_YEAR to filter and load data by movie year. This variable is also used by the IKM SQL to HDFS File (Sqoop) to create unique temporary object names, so multiple instances of the same scenario can run in parallel. The variable has been added into the knowledge module option called SUFFIX_NAME.

Figure 14: ODI Package with Sqoop Scenario

Figure 15: ODI Package with Sqoop Scenario

Figure 16 illustrates the ODI Operator with 3 scenarios running in parallel. These scenarios are child sessions of the ODI package PKG_SQL_TO_HDFS_FILES – INITIAL. Notice that the actual package has completed its execution, but all scenarios are in Running status. The session name of each scenario includes the actual value of the VAR_MOVIE_YEAR variable, the movie year.

Figure 15: ODI Operator with Scenarios Running in Parallel

Figure 16: ODI Operator with Scenarios Running in Parallel

For more information on how to execute ODI scenarios in Asynchronous mode, go to “Parallel Processing in ODI.”

In this section, three levels of parallelism have been discussed in great detail:

Level 1 – An ODI package can be used to launch multiple executions of the same ODI mapping in parallel.
Level 2 – ODI 12c In-Session parallel can be used when designing a Sqoop mapping with more than one HDFS directory or Hive table.
Level 3 – Sqoop mappers can be used to import data in parallel.

Figure 17 shows these three levels of parallelism. In this example, one ODI package launches three executions of the same ODI scenario in parallel. Each ODI scenario loads data into each HDFS directory in parallel. For each HDFS directory, two Sqoop mappers are used to load data in parallel.

Figure 16: Levels of Parallelism with Sqoop and ODI

Figure 17: Levels of Parallelism with Sqoop and ODI

Configuring Oracle Data Integrator for Sqoop Imports

ODI Topology Configuration

The Sqoop knowledge modules use two ODI technologies: File and Hive. The IKM SQL to HDFS File (Sqoop) uses the File technology and the IKM SQL to HDFS Hive uses the Hive technology.

These two technologies must be configured in the ODI Topology Navigator before using the Sqoop knowledge modules.

The following steps are a guideline on how to configure the ODI Topology for the File and Hive technologies. For additional information on how to configure the ODI Topology, go to “Setting up the Oracle Data Integrator (ODI) Topology”.

The steps to configure Sqoop with the ODI Topology are as follow:

Create the necessary contexts based on your environment.
Using the File and Hive technologies, create the physical data servers.
For each physical data server, File and Hive, create the physical schemas corresponding to the schemas containing the data to be integrated with ODI.
Using the File and Hive technologies, create the logical schemas and associate them with the corresponding File and Hive physical schemas and contexts.
Create the physical and logical agents and associate them with the contexts. The ODI agent must be located in the big data appliance where the Sqoop tool is installed, so it can execute the Sqoop scripts. The physical agent must have access to both the Sqoop tool and the source relational database.
Figure 18 shows an example of how to configure the physical schema for the File technology. The Directory Schema in Figure 18, movie_files, represents the HDFS directory used by the Sqoop knowledge module to store the HDFS data files.
The Directory Work Schema in Figure 18, /tmp, is the UNIX directory used by the Sqoop knowledge module to create temporary objects at runtime. This UNIX directory is typically located in the same machine where the ODI agent is installed, so the agent can create the temporary objects at runtime.

Figure 17: ODI Physical Schema for HDFS Files

Figure 18: ODI Physical Schema for HDFS Files

Figure 19 shows an example of how to configure the physical schema for the Hive technology. The Directory Schema in Figure 19, moviedemo,is the name of the Hive database where the target Hive table is located.
The Directory Work Schema in Figure 19, moviework, is a HDFS directory used by the Sqoop knowledge module to import the source data files into a temporary location. The source data files are then moved from the HDFS directory into the Hive warehouse directory.

Figure 18: ODI Physical Schema for Hive Tables

Figure 19: ODI Physical Schema for Hive Tables

Use the ODI Models section of the ODI Designer Navigator to create and configure your ODI models and datastores. For more information on how to create Data Models and Datastores in ODI, see “Creating and Using Data Models and Datastores in Oracle Data Integrator”.
The following section describes how to configure your ODI mappings and interfaces with the Sqoop Knowledge modules.

ODI 12c Mapping Configuration

There are 4 basic configuration steps in order to enable the Sqoop knowledge modules in an ODI 12c mapping:

Create the desired number of physical deployment specifications for your ODI mapping. Two deployment specifications are recommended: the initial import, and the incremental import.
Select a deployment specification and identify the ODI access point of the target execution unit. Figure 20, shows an example. In this example, the ODI access point of execution unit TV_SHOWS is a filter called FILTER_AP.

Figure 19: ODI Access Point Configuration

Figure 20: ODI Access Point Configuration

Using the Properties window of the ODI access point, expand the loading knowledge module option and select LKM SQL Multi-Connect.GLOBAL. Figure 21 shows the loading knowledge module for the ODI access point called FILTER_AP.

Figure 20: Load Knowledge Module Configuration

Figure 21: Load Knowledge Module Configuration

On the target execution unit, select the target datastore. Using the Properties window of the target datastore, expand the integration knowledge module option and select the Sqoop integration knowledge module.
Figure 22 shows the Properties window of the target datastore called TV_SHOWS. The integration knowledge module for this datastore is IKM SQL to HDFS File (Sqoop).

Figure 21: Integration Knowledge Module Configuration

Figure 22: Integration Knowledge Module Configuration

ODI 11g Interface Configuration

There are 6 basic configuration steps in order to enable the Sqoop knowledge modules in an ODI 11g interface:

Create your ODI interface and ensure that a target table has been already added in the interface.
Select the Overview tab of the ODI interface, and locate the Definition menu option. Under the Definition menu option, check the option called “Staging Area Different from Target”. Once this option is checked, a list of SQL technology servers will be available for selection.
Select the desired SQL technology server. Figure 23 shows an example. In this example, the SQL technology is Oracle, and the logical schema is called Movie Demo. This is the logical schema of the source tables (the relational database) in the ODI interface.

Figure 22: ODI 11g Interface - Definition

Figure 23: ODI 11g Interface – Definition

Select the Flow tab of the ODI interface to see the flow diagram. Figure 24 shows an example. Select the Staging Area box, and ensure that no loading knowledge module can be selected in this area.
Select the Target Area box as shown in Figure 24.

Figure 24: ODI 11g Interface – Flow Tab

Using the Target Area Property Inspector, select the Sqoop integration knowledge module as shown in Figure 25. Proceed to customize the knowledge module options based on your Sqoop import strategy.

Figure 24: ODI 11g Interface - Target Area Property Inspector

Figure 25: ODI 11g Interface – Target Area Property Inspector

Conclusion

If you would like to download a copy of these knowledge modules, go to “Oracle Data Integrator Knowledge Modules – Downloads.” Search for “sqoop”.

The examples discussed in this article have been developed with ODI 11.1.1 and ODI 12.1.3. A copy of these ODI repositories with examples can be found at “ODI Repository Sample – SQL to HDFS Using Sqoop.”

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

ODI 11g and 12c Repository Structures Available on MOS

November 7, 2014, 1:48 pm

≫ Next: Using Oracle Partition Exchange with Oracle Data Integrator (ODI)

≪ Previous: Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

We always recommend that customers use the ODI SDK to access the information stored in the ODI repository: the repository structure evolves from version to version, and the SDK shelters developers from these structural changes.

This said, many developers still prefer to write their own SQL queries to read from the repositories directly, even if this may mean a rewrite of these queries with each repository upgrade.

To help customers that prefer the SQL route, our support team has put together documents that describes the repositories structure for ODI versions 11.1.1.7, 12.1.2 and 12.1.3.

To find these documents, login to http://support.oracle.com, and look for Doc ID 1903225.1 : Oracle Data Integrator 11g and 12c Repository Description

For ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-Team Chronicles for ODI”.

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Using Oracle Partition Exchange with Oracle Data Integrator (ODI)

December 16, 2014, 6:27 pm

≫ Next: Configuring Oracle Data Integrator (ODI) with Oracle Partition Exchange

≪ Previous: ODI 11g and 12c Repository Structures Available on MOS

Introduction

This article presents a data integration recipe on how to use Oracle Partition Exchange with Oracle Data Integrator (ODI) to upload data very fast into partitioned tables of a large Oracle data warehouse.

This article features a new ODI knowledge module (KM), the IKM Oracle Partition Exchange Load, which uses the Oracle Partition Exchange technology to upload data, by partition, into a partitioned table of an Oracle database.

This KM supports three partitioning strategies: range, hash, and list. Single-level partitioning and composite partitioning (sub-partitions) are supported by this KM as well.

Additionally, the KM includes options to support table maintenance operations that may be required before and after a partition exchange operation. Some of these KM options include flow control, disabling and enabling table constraints, rebuilding local and global indexes, and gathering table statistics.

If you would like to download a free copy of this KM, go to “Oracle Data Integrator Knowledge Modules – Downloads,” and search for “PEL“. The KM is fully compatible with both ODI 11g and ODI 12c. Additionally, an ODI 12.1.3 repository with examples can be found at “ODI Repository Sample for Oracle Partition Exchange.”

For a detail description of how to use the KM options, and how to configure ODI with Oracle Partition Exchange, go to “Configuring Oracle Data Integrator (ODI) with Oracle Partition Exchange.”

If your Oracle data warehouse has partitioned tables, Oracle Partition Exchange offers a fast method for uploading data into your large partitioned tables.

Using Oracle Partition Exchange with Oracle Data Integrator (ODI)

Oracle Partitioning is a feature of the Oracle database that allows the decomposition of very large database objects – such as tables and indexes – into smaller and more manageable pieces called partitions.

When tables and indexes are partitioned, users and applications can query and manage data by partition or sub-partition; thus, the database can return and process data much faster than if the entire table or index is scanned.

Oracle Partition Exchange is the ability to swap the data segment of a non-partitioned table with a partition of a partitioned table, or vice versa. The benefit of Oracle Partition Exchange is that the exchange process is a data definition language (DDL) operation with no actual data movement; thus, the exchange operation is immediate – it only takes seconds.

On very large databases (VLDBs) such as Oracle data warehouses, Oracle Partition Exchange can facilitate high-speed uploads of new and incremental data into partitioned objects such as facts, cubes, and large dimension tables.

This article presents a new KM called the IKM Oracle Partition Exchange Load. This new KM allows users to create and design ODI mappings that can take advantage of the Oracle Partition Exchange technology.

ODI mappings can now exchange or swap the content of a source dataset with a partition of a partitioned table. The IKM Oracle Partition Exchange Load replaces traditional data upload techniques – such as insert, update, and merge operations – used in other ODI Integration KMs (IKMs) that have been designed for the Oracle technology.

This article focuses on how to use the IKM Oracle Partition Exchange Load with Oracle Data Integrator to accelerate the speed of your data uploads for your very large Oracle partitioned tables.

For more information on the benefits of using Oracle Partitioning on very large databases, see Oracle Database VLDB and Partitioning Guide and Using Oracle Partitioning in a Data Warehouse Environment.

For more information on how to use Oracle Partition Exchange, see section Exchanging Partitions of the Oracle Database and Partitioning Guide.

Benefits of Using Oracle Partition Exchange

Oracle Partition Exchange is a great database tool that can be used in data integration activities to upload data very fast into partitioned tables of an Oracle database.

Table 1 summarizes some of the most important benefits of using Oracle Partition Exchange for data integration activities:

Table 1: Benefits of Using Oracle Partition Exchange

When is Oracle Partition Exchange a Suitable Option?

Table 2 shows examples where Oracle Partition Exchange can be a suitable option for an ELT data integration environment.

Table 2: When is Oracle Partition Exchange a Suitable Option

IKM Oracle Partition Exchange Load: Overview

The IKM presented in this article, the IKM Oracle Partition Exchange Load, uses the Oracle Partition Exchange technology to upload data into a partition of a partitioned table.

In an ODI mapping, the partitioned table is the target datastore of a data flow. The source data for the partitioned table comes from the ODI flow table, also known as the ODI integration (I$) table. The source data for the ODI flow table comes from the source-dataset defined in the ODI mapping. The ODI flow table holds the transformed data for the partitioned table.

In most Oracle-based IKMs, the data in the ODI flow table is added into the target table by issuing additional database operations such as insert, update, and merge statements. These additional database operations require more table space, memory, and database resources; thus, increasing the overall time that takes to run the ELT data integration activities.

The knowledge module presented in this article exchanges the partition of a partitioned table with the ODI flow table; thus, eliminating the need for additional database operations during the seeding of the partitioned table.

Figure 1 shows how the IKM Oracle Partition Exchange Load performs the partition exchange operation for a table called Orders Fact. This fact table has been partitioned by month.

Figure 1: IKM Oracle Partition Exchange Load – Sample Load

In the above example, a staging table called Orders is joined with three warehouse dimensions – Customer, Product, and Status – and a filter is used to select data for the month of July 2014. At a minimum, the knowledge module performs the following three tasks:

Joins the source datastores, applies the data filter, and transforms the source data.
Loads the transformed data into the ODI flow table.
Exchanges partition JUL2014 with the ODI flow table.

Some of the knowledge module tasks – such as disabling and enabling constraints, rebuilding indexes, and collecting statistics – are optional. These optional tasks can be controlled by the user via the knowledge module options. The next section of this article presents an overview of the KM tasks.

IKM Oracle Partition Exchange Load: Tasks

Figure 2 shows a list of KM tasks. All tasks are based on the Oracle technology.

Figure 2: IKM Oracle Partition Exchange Load Tasks

Some of the knowledge module tasks above – such as Inserting data into flow table – are mandatory, and they are always executed by the knowledge module. Other knowledge module tasks – such as Flow control – are optional, and they are controlled by the knowledge module options.

Although mandatory tasks are always executed by the knowledge module, they may not have any effect on the overall execution of the data flow. For instance, if the partitioned table does not have a primary key, the Create PK on flow table task will not add a primary key to the flow table.

Only one task, Drop flow table, is configured to ignore errors. All other tasks will stop running if errors are detected during the execution of the KM.

IKM Oracle Partition Exchange Load: Execution Flow

Figure 3 shows the execution flow of the IKM Oracle Partition Exchange Load. A summary of the KM execution flow follows:

The KM enables parallel DML (data manipulation language) for the current database session. Thus, the Oracle database will attempt to execute all DML operations of the KM in parallel.
If the user chooses to enable incremental statistics on the partitioned table, the KM modifies the statistics preference of the partitioned table to incremental. Also, if the user chooses to publish statistics on the partitioned table, the publish preference of the partitioned table is set to true.
If the user chooses to disable the constraints of the partitioned table, the constraints are disabled before creating the flow table. If the flow table exists, the knowledge module drops it.
The flow table is created using the same table structure as the target partitioned table. If the partitioned table has sub-partitions, the flow table is created with partitions that resemble the sub-partitions.
Data is inserted into the flow table by joining the source datastores, applying the data filters, and transforming the source data.
If the user chooses to activate flow control, the KM invokes the check knowledge module (CKM) for the Oracle technology (CKM Oracle) to validate the data in the flow table. Invalid rows will be copied into error tables and removed from the flow table.
Once data is validated by the CKM, the constraints of the partitioned table are added into the flow table. This includes primary keys, foreign keys, unique keys, and check constraints. If the partitioned table has local indexes, they are added into the flow table as well.
If the user chooses to lock the partition of the partitioned table, the partition is locked before the exchange operation. The partition exchange operation is performed by swapping a specified partition with the flow table.
If the user chooses to rebuild local or global indexes on the partitioned table, unusable indexes are rebuilt after the partition exchange operation. If the user chooses to enable the constraints of the partitioned table, the constraints are enabled after the rebuild of the indexes.
If the user chooses to gather table statistics, statistics are gathered for the partitioned table. Incremental statistics are gathered if the user enabled incremental statistics on the partitioned table.
Finally, the flow table is dropped, and the parallel DML session is disabled.

Figure 3: IKM Oracle Partition Exchange Load Execution Flow

IKM Oracle Partition Exchange Load: Options

Figure 4 shows a list of the KM options with default values.

Figure 4: IKM Oracle Partition Exchange Load Options

The default values for the KM options are as follow:

The Partition Name option uses the ODI getPop method to get the partition name from the Partition/Sub-Partition option – a property of the datastore in the ODI mapping.
The Degree of Parallelism option has been set to PARALLEL; thus, the Oracle database will determine the most optimum degree of parallelism to be used when loading data into the flow table.
The Select Optimizer Hint option has no value, but the user can specify an optimizer hint to speed up the query that selects data from the source datastores.
The Flow Control option has been set to true; thus, the data in the flow table will be validated before the partition exchange operation.
The Lock Partition option has been set to false. The partition will not be locked before the exchange operation.
The Partition Exchange Options have been configured to perform the exchange operation with validation; thus, the data will be validated against the constraints of the partitioned table.
The Delete Temporary Objects option has been set to true. All temporary objects will be deleted once the KM completes its execution successfully.
All other KM options have been set to false.

All these options can be configured, changed, and customized by the user. Some of these options can be configured with ODI variables as well.

Loading Partitioned Tables with Oracle Partition Exchange and ODI

Using Oracle Partition Exchange to perform the data upload operation of large partitioned tables offer tremendous performance benefits. Since the partition exchange operation only involves data dictionary updates, Oracle Partition Exchange is the fastest method for uploading data into large partitioned tables.

Oracle Partition Exchange can be used to orchestrate both the initial and the incremental data upload operations of partitioned tables in a very large data warehouse.

The following sections discuss how to use Oracle Partition Exchange with ODI to perform both the initial and the incremental data upload operations of large partitioned tables.

Initial Load of a Partitioned Table with Oracle Partition Exchange and ODI

Oracle Partition Exchange is a great tool for performing the initial data upload operation of a large partitioned table. For instance, the initial upload operation of a large partitioned table can be orchestrated in parallel, and multiple partitions can be loaded and exchanged asynchronously.

This section illustrates how to design and orchestrate the initial data upload operation of a large partitioned table with ODI and the IKM Oracle Partition Exchange Load.

Initial Load of a Partitioned Table: The Logical Design

The ODI flow-based mapping editor offers great features that can be used to design the initial data upload operation of a partitioned table. ODI components such as filters, joins, lookups, and datasets can be used to upload data into partitioned tables.

Figure 5 illustrates the logical design of an ODI mapping that uploads data by partition into a warehouse fact partitioned table called W_ORDERS_F. This mapping is designed to perform the initial data upload operation for this partitioned table (W_ORDERS_F).

Figure 5: ODI Mapping – Initial Load Design

The ODI mapping, above, includes components such as datasets, lookups, and filters to select data from a group of source tables. The source table called ORDERS is joined with three warehouse dimensions – W_CUSTOMER_D, W_PRODUCT_D, and W_STATUS_D – and it has a filter to select data by month.

The customer dimension, W_CUSTOMER_D, is a Type-2 dimension , and it has a filter to select the current record of a given customer. The target datastore, W_ORDERS_F, is the partitioned table. This table has been partitioned by month.

Initial Load of a Partitioned Table: The Physical Design

ODI 12c offers the capability to design multiple implementations of the same ODI mapping – this is known as ODI Physical Deployment Specifications.

Figure 6 illustrates an ODI mapping with three physical deployment specifications: First Partition Load, Next Partition Load, and Last Partition Load. These three deployment specifications use the IKM Oracle Partition Exchange Load to perform the initial upload operation of the partitioned table (W_ORDERS_F).

Figure 6: ODI Mapping – Three Physical Deployment Specifications

Figure 7 shows the knowledge module options for the first deployment specification, First Partition Load.

Figure 7: ODI Deployment Specification – First Partition Load

The first deployment specification (First Partition Load), Figure 7 above, performs initial table maintenance operations such as disabling table constraints, enabling incremental statistics, and enabling publish statistics on the partitioned table.

Flow control is activated to validate the source dataset against the constraints of the partitioned table. The data validation is done by the Oracle check knowledge module (CKM Oracle). If invalid records are found in the flow table, they are copied into an error table and removed from the flow table before the exchange operation. Hence, the exchange operation can be done without validation, since data is guaranteed to meet the constraints of the partitioned table.

This deployment specification loads the first partition and rebuilds the local indexes.

Figure 8 shows the knowledge module options for the next deployment specification, Next Partition Load.

Figure 8: ODI Deployment Specification – Next Partition Load

The next deployment specification (Next Partition Load), Figure 8 above, performs partition exchange operations for all subsequent partitions, except the last partition of the partitioned table.

Flow control is also used in this deployment specification. Local indexes are rebuilt after every partition exchange load. Figure 9 shows the knowledge module options for the last deployment specification, Last Partition Load.

Figure 9: ODI Deployment Specification – Last Partition Load

The last deployment specification (Last Partition Load), Figure 9 above, loads the last partition of the partitioned table. Flow control is also used in this deployment specification. Additional table maintenance operations are performed with this deployment specification: table constraints are enabled, global indexes are rebuilt, and incremental statistics are gathered.

All three deployment specifications use an ODI variable called #A_TEAM.PARTITION_MONTH. This ODI variable will be refreshed in an ODI package at runtime.

For additional information on how to design deployment specifications with ODI 12c, go to “Creating Deployment Specifications with ODI 12c”.

Initial Load of a Partitioned Table: The ELT Orchestration

The IKM Partition Exchange Load can be used in conjunction with ODI packages and ODI load plans to orchestrate the entire data upload operation of a partitioned table. For instance, an ODI package can be designed to upload multiple partitions of the same partitioned table in parallel.

Figure 10 shows the design flow of an ODI package called PARTITION_EXCHANGE_LOAD. This package uses ODI scenarios and ODI variables to orchestrate the initial data upload operation for a partitioned table called W_ORDERS_F.

Figure 10: ODI Package – Partition Exchange Load

The package above invokes three ODI scenarios: Load First Partition, Load Next Partition, and Load Last Partition. All three ODI scenarios have been generated from the same mapping, using the appropriate deployment specification each time.

The package performs a count on the total of partitions found in the partitioned table. For each partition found, the package refreshes an ODI variable with the partition name, and it proceeds to upload the partition. Scenario Load Next Partition runs in asynchronous mode; thus, this scenario performs the partition uploads in parallel.

Figure 11 shows the execution logs for this package.

Figure 11: Partition Exchange Load Package – Execution Log

In the example above, the first partition load (JAN2014, at the bottom of the list) was performed by the scenario called Load First Partition. This scenario executed the mapping with the physical deployment specification called First Partition Load.

The next set of partition uploads (FEB2014 thru NOV2014) was performed by the scenario called Load Next Partition. This scenario executed the mapping with the physical deployment specification called Next Partition Load. This scenario executed in asynchronous mode; thus, all partitions were loaded in parallel.

The last partition load (DEC2014) was performed by the scenario called Load Last Partition. This scenario executed the mapping with the physical deployment specification called Last Partition Load.

All three scenarios used the IKM Oracle Partition Exchange Load to upload data into the partitioned table.

Alternatively, ODI load plans can be used to orchestrate partition uploads as well. ODI load plans offer additional features such as exception handling, parallelism, and restartability. For additional information on how to use ODI load plans, go to “Oracle Fusion Middleware Developer’s Guide for Oracle Data Integrator – Using Load Plans.”

Incremental Load of a Partitioned Table with Oracle Partition Exchange and ODI

Oracle Partition Exchange can be used to perform incremental data uploads for partitioned tables. For instance, Oracle Partition Exchange can be used to perform daily data uploads for tables that have been partitioned by month.

In ODI, components such as SET and LOOKUP can be used to merge daily datasets with the existing data of a partition, and Oracle Partition Exchange can be used to replace the content of the partition with the merged dataset.

This section illustrates how to design and orchestrate the incremental data upload operation of a large partitioned table with ODI and the IKM Oracle Partition Exchange Load.

Incremental Load of a Partitioned Table: The Logical Design

The ODI flow-based mapping editor offers additional components such as SET and LOOKUP to design incremental data uploads for partitioned tables. For instance, if a table is partitioned by month, and data is uploaded once a day into the partitioned table, a SET component can be used to perform the union operation of an incremental dataset with the existing data of a partition.

Figure 12 illustrates the logical design of an ODI mapping that uploads incremental data by partition into the partitioned table called W_ORDERS_F.

This mapping uses the SET component to perform a union of the incremental dataset with the existing data of a given partition. The partitioned table is used as an input source for the SET component (W_ORDERS_F_AS_OF_TODAY).

W_ORDERS_F_AS_OF_TODAY has been configured to select data for a given partition. The partition name is set in the Partition/Sub-Partition property of the datastore, and an ODI variable (#A_TEAM.PARTITION_MONTH) is used to select the partition name dynamically.

Figure 12: ODI Mapping – Incremental Load Design

Figure 13 illustrates how the SET component has been configured. This component uses two input connector points: DATASET, and W_ORDERS_F_AS_OF_TODAY.

Figure 13: ODI Mapping – Set Component Attributes & Operators

The SET component (above) uses the UNION operator to merge the incremental dataset (DATASET) with the existing data of the partition (W_ORDERS_F_AS_OF_TODAY). Thus, the flow table is populated with both datasets.

Incremental Load of a Partitioned Table: The Physical Design

Figure 14 shows the knowledge module options for the incremental data upload operation.

In this example, the Flow Control option is used to validate the new dataset against the constraints of the partitioned table. The data validation is performed by the Oracle check knowledge module (CKM Oracle). If invalid records are found in the flow table, they are copied into an error table and removed from the flow table before the exchange operation.

Local indexes are included during the partition exchange operation. This is done by using the INCLUDING INDEXES value in the Partition Exchange Options. Thus, there is no need to rebuild local indexes after the exchange operation.

The exchange operation is performed with validation. This is done by using the WITH VALIDATION value in the Partition Exchange Options. Since data is validated by the check knowledge module prior the exchange operation, the user can choose to perform the exchange operation without validation.

The KM does not disable or enable the table constraints for the partitioned table during the partition exchange operation. However, in the database, the table constraints for the partitioned table are enabled.

Global indexes are updated – in parallel – during the partition exchange operation. This is done by using the UPDATE INDEXES PARALLEL value in the Partition Exchange Options. Thus, global indexes will stay usable during the exchange operation.

Table statistics will be gathered after the partition exchange operation. The statistic and publish preferences of the partitioned table were enabled during the initial upload operation. Thus, incremental statistics will be gathered.

Figure 14: ODI Deployment Specification – Incremental Partition Load

The examples discussed in the above sections have been developed with ODI 12c. If you would like to download a copy of these examples, go to “ODI Repository Sample for Oracle Partition Exchange.” Additional examples are also available in this ODI repository.

For a detail description of how to use the KM options, and how to configure ODI with Oracle Partition Exchange, go to “Configuring Oracle Data Integrator (ODI) with Oracle Partition Exchange.”

Combining Oracle Partition Exchange with Oracle Data Pump

If the source data of your partitioned table is located in another Oracle data server, consider using Oracle Data Pump to extract and load data into your partitioned target table. Oracle Data Pump is the fastest way to extract and load data between Oracle data servers.

The combination of using Oracle Data Pump with Oracle Partition Exchange offers tremendous performance benefits. If you would like to learn more about using Oracle Data Pump with ODI, please visit “Using Oracle Data Pump with Oracle Data Integrator (ODI).”

Conclusion

If your Oracle data warehouse has partitioned tables, Oracle Partition Exchange offers a fast method for uploading data into your last partitioned tables.

If you would like to download a free copy of the KM discussed in this article, go to “Oracle Data Integrator Knowledge Modules – Downloads,” and search for “PEL“. The KM is fully compatible with both ODI 11g and ODI 12c. Additionally, an ODI 12.1.3 repository with examples can be found at “ODI Repository Sample for Oracle Partition Exchange.”

For more Oracle Data Integrator best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit “Oracle A-team Chronicles for Oracle Data Integrator (ODI).”

Configuring Oracle Data Integrator (ODI) with Oracle Partition Exchange

Using Oracle Data Pump with Oracle Data Integrator (ODI)

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Configuring Oracle Data Integrator (ODI) with Oracle Partition Exchange

December 16, 2014, 6:44 pm

≫ Next: Getting Groovy with Oracle Data Integrator: Automating Changes after Upgrading ODI or Migrating from Oracle Warehouse Builder

≪ Previous: Using Oracle Partition Exchange with Oracle Data Integrator (ODI)

Introduction

This article describes the steps required to enable Oracle Partition Exchange in Oracle Data Integrator (ODI).

The article features a new ODI knowledge module (KM), the IKM Oracle Partition Exchange Load, which uses the Oracle Partition Exchange technology to upload data, by partition, into a partitioned table of an Oracle database.

This KM supports three partitioning strategies: range, hash, and list. Single-level partitioning and composite partitioning (sub-partitions) are supported by this KM as well.

For an overview of the benefits of using Oracle Partition Exchange with ODI and how to use the IKM Oracle Partition Exchange Load to perform data uploads for large partitioned tables, go to “Using Oracle Partition Exchange with Oracle Data Integrator (ODI).”

Main Article: Configuring Oracle Data Integrator (ODI) with Oracle Partition Exchange

In order to configure ODI with Oracle Partition Exchange, the user must perform three configuration tasks:

Configure the IKM Oracle Partition Exchange Load options
Configure the database privileges required for Oracle Partition Exchange
Add the table partitions of the partitioned table into Oracle Data Integrator

The following sections describe how to perform these three configuration tasks.

Configuring the IKM Oracle Partition Exchange Load Options

The IKM Oracle Partition Exchange Load offers various options to efficiently manage your partition exchange upload operations. Figure 1 shows a list of the KM options with default values.

Figure 1: IKM Oracle Partition Exchange Load Options

The default values for the KM options are as follow:

The Partition Name option uses the ODI getPop method to get the partition name from the Partition/Sub-Partition option – a property of the datastore in the ODI mapping.
The Degree of Parallelism option has been set to PARALLEL; thus, the Oracle database will determine the most optimum degree of parallelism to be used when loading data into the flow table.
The Select Optimizer Hint option has no value, but the user can specify an optimizer hint to speed up the query that selects data from the source datastores.
The Flow Control option has been set to true; thus, the data in the flow table will be validated before the partition exchange operation.
The Lock Partition option has been set to false. The partition will not be locked before the exchange operation.
The Partition Exchange Options have been configured to perform the exchange operation with validation; thus, the data will be validated against the constraints of the partitioned table.
The Delete Temporary Objects option has been set to true. All temporary objects will be deleted once the KM completes its execution successfully.
All other KM options have been set to false.

All these options can be configured, changed, and customized by the user. Some of these options can be configured with ODI variables as well. The following sections describe in detail how to configure these KM options.

Partition Name

This option allows the user to specify the partition name that the knowledge module will use to upload data into a partitioned table. The user can specify the partition name in three different ways:

Use the default value for this option and specify the partition name in the logical diagram of the ODI mapping.
Use an ODI variable that contains the partition name, so that the name can be set dynamically.
Type in the actual partition name (hard-coded value).

Using the Default Value

Figure 2 shows the default value for this option: <%=odiRef.getPop(“PARTITION_NAME”)%>.
The default value uses the ODI getPop method to get the p artition name from a parameter value called PARTITION_NAME.
PARTITION_NAME is the value found in the Partition/Sub-Partition option of the ODI datastore.
The Partition/Sub-Partition option can be found in the Logical Diagram of the ODI mapping, under the General properties of the datastore (the partitioned table).
If the user chooses to use the default value for this option, the database-defined partitions must be imported into the ODI Model, and the partition name must be selected in the Partition/Sub-Partition option of the datastore.
For detailed instructions on how to import the database-defined partitions into your ODI Models and how to set the Partition/Sub-Partition option for an ODI datastore, see section “Adding Your Table Partitions in Oracle Data Integrator”.

Figure 2: Knowledge Module Option – Partition Name Default Value

Using an ODI Variable

If the user chooses to use an ODI variable for this option, the variable name must be prefixed with the ODI project code.
Figure 3 shows an example of how to specify an ODI variable for this option.
There is no need to import the database-defined partitions if an ODI variable is used with this option.

Figure 3: Knowledge Module Option – Partition Name Variable

Using a Hard-coded Value

If the user chooses to specify the actual partition name, only the specified partition can be used to upload data into the partitioned table.
Figure 4 shows an example of a partition name used for this option: JUL2014.

Figure 4: Knowledge Module Option – Partition Name

Degree of Parallelism

This option allows the user to specify the degree of parallelism or the number of parallel threads that the Oracle database will use when inserting data into the ODI flow table.

The value for this option can be hard-coded or an ODI variable containing the degree of parallelism can be used instead.

Figure 5 shows the default value for this option: PARALLEL. When using the default value for this option, the degree of parallelism is determined by the Oracle database as follow:

Degree of Parallelism = Number of CPUs available on all participating database instances * PARALLEL_THREADS_PER_CPU initialization parameter

Figure 5: Knowledge Module Option – Degree of Parallelism

Optionally, the user can specify a number of parallel threads that the Oracle database will use when inserting data into the ODI flow table. The syntax to specify the number of parallel threads is as follow:

PARALLEL <number_of_threads>

The <number_of_threads> is an integer value. Each parallel thread may use one or two parallel execution servers. Usually, the Oracle database calculates the optimum degree of parallelism, so it is not necessary to specify the number of threads.

For additional information on how use the degree of parallelism in an Oracle database, go to “Database VLDB and Partitioning Guide, Degree of Parallelism.”

Select Optimizer Hint

Oracle Optimizer Hints offer a mechanism to instruct the Oracle Optimizer to choose a certain query execution plan based on a criteria specified by the user. Oracle Optimizer Hints can offer great performance benefits when they are used in conjunction with ODI knowledge modules.

The KM uses an INSERT AS SELECT (IAS) statement to populate the ODI flow table. This KM option allows the user to specify an Optimizer hint after the SELECT keyword of the IAS statement.

By default, this option has no value. The user can specify the actual Optimizer hint or use an ODI variable containing the hint. The syntax for defining an Oracle Optimizer Hint is as follow:

/*+ hint [text] [hint[text]]… */

For example, Figure 6 shows an access path hint that instructs the Oracle Optimizer to use an index (ORDER_DT_IDX) when selecting data from a table called ORDER_DETAIL.

Figure 6: Knowledge Module Option – Select Optimizer Hint

When changes are made to the database, Optimizer hints can become obsolete or even have a negative impact on the execution of the IAS statement; thus, it is recommended to store the Optimizer hints in ODI variables. A database configuration table can be used to maintain the Optimizer hints, and the ODI variables can source the hint information from this configuration table.

Figure 7 shows an example of how to specify an ODI variable with this option.

Figure 7: Knowledge Module Option – Select Optimizer Hint – ODI Variable

For additional information on how to use Oracle Optimizer Hints, refer to “Oracle Database Performance Tuning Guide, Using Optimizer Hints.”

Flow Control

If set to True, this option invokes the check knowledge module (CKM) for the Oracle technology (CKM Oracle) to validate the data stored in the ODI flow table before the partition exchange operation.

The data stored in the ODI flow table is validated against the constraints of the partitioned table. If invalid records are found in the flow table, they are copied into an error table and removed from the flow table before the partition exchange operation.

Set this option to true if the data in the flow table is not guaranteed to meet the constraints of the partitioned table, and invalid records must be removed from the flow table before the exchange operation.

If the data stored in the flow table does not meet the constraints of the partitioned table and this KM option is not set to true, the partition exchange operation will fail.

Set this option to false if the data stored in the flow table is guaranteed to meet the constraints of the partitioned table.

Figure 8 shows the default value for this option. By default, this option is set to true.

Figure 8: Knowledge Module Option – Flow Control

Table constraints such as check constraints, not null constraints, primary keys, alternate keys, and unique keys must be reverse-engineered in ODI prior using the check knowledge module.

For additional information on how to use check knowledge modules and reverse-engineering knowledge modules (RKM), go to “Oracle Fusion Middleware Knowledge Module Developer’s Guide for Oracle Data Integrator.”

Lock Partition

If set to True, this option locks the partition of the partitioned table before the partition exchange operation.

The knowledge module performs the locking of the partitioned table at the partition level only. If the table is composite-partitioned, then the database locks all the sub-partitions of the specified partition.

The knowledge module instructs the Oracle database to lock the partition in share mode. Hence, users can query the locked partition.

When the partition is locked, no other updates can be performed against the locked partition. The locked partition will remain locked until the partition exchange operation is complete.

Figure 9 shows the default value for this option. By default, this option is set to false.

Figure 9: Knowledge Module Option – Lock Partition

Partition Exchange Options

During the partition exchange operation, additional options can be specified to control the behavior of the exchange operation. For instance, data from the ODI flow table can be validated during the exchange operation, or local indexes in the ODI flow table can be exchanged. Other operations such as updating global indexes during the exchange operation are supported as well.

Table 1 shows a list of options that can be specified as part of the partition exchange operation.

Table 1: Partition Exchange Options

Figure 10 shows the default value for this option. By default, data will be exchanged with validation. It is recommended to review each of these values, and choose the one that works best for your ELT environment. An ODI variable can be used with this KM option.

Figure 10: Knowledge Module Option – Partition Exchange Options

Set Incremental Statistics

When data is added into a partitioned table, two types of statistics should be gathered: partition-level statistics, and global statistics. For large tables, the task of gathering global statistics is a resource-intensive and time-consuming operation, since a full table scan is required.

Starting with Oracle Database 11g, a new feature was introduced to improve the performance of gathering global statistics on large partitioned tables: Incremental Statistics. The Incremental Statistics feature gathers separate statistics for each partition. Then, it updates the global statistics by scanning only those partitions that have been modified. Global statistics are generated by aggregating the partition-level statistics, thus eliminating the need for performing a full table scan on the partitioned table.

Figure 11 shows an example of how Incremental Statistics are gathered for a partitioned table called Orders Fact. In this example, seven partition-level statistics are gathered first. Then, global statistics are generated by aggregating the seven partition-level statistics.

Figure 11: Oracle Incremental Statistics

If this KM option is set to true, the knowledge module will change the statistics preference for the partitioned table to incremental; thus, global statistics for the partitioned table will be gathered incrementally as illustrated on Figure 11, above.

If this KM option is set to false, and the statistics preference for the partitioned table is not set to incremental, a full table scan may be performed by the database to maintain the global statistics.

If the statistics preference for the partitioned table is already set to incremental (in the database), and this KM option is set to false, the statistics preference for the partitioned table is not modified; thus, global statistics for the partitioned table will be gathered incrementally.

Figure 12 shows the default value for this option. By default, this option is set to false.

Figure 12: Knowledge Module Option – Set Incremental Statistics

To check the statistics preference of a partitioned table, type the following SQL statement:

select dbms_stats.get_prefs(‘INCREMENTAL’,'<SCHEMA_NAME>’,

<PARTITIONED _TABLE_NAME’) from dual;

For additional information on Incremental Statistics, go to Gathering Incremental Statistics on Partitioned Objects. For additional information on table preferences, go to Setting Table Preferences in Oracle.

Additional statistics preferences can be defined for a schema, or database. Global statistics preferences are available as well. For additional information on how to set statistics preferences, see the Oracle SET*PREFS procedures at Oracle DBMS_STATS Sub-Programs

Set Publish Statistics

If set to true, this option modifies the publish preference for the partitioned table to true. The publish preference for a table is used by the Oracle database to determine if newly gathered statistics can be published immediately into the dictionary tables.

Starting with Oracle Database 11g, Release 1, users have the ability to gather statistics and delay their publication. This new table preference allows users to test the new statistics before publishing them. Set this KM option to true if you wish to publish newly gathered statistics immediately.

The Oracle database also requires this option to be set to true if you wish to gather incremental statistics for the partitioned table.

If the publish preference for the partitioned table is already set to true (in the database), and this KM option is set to false, the publish preference for the partitioned table is not modified; thus, newly gathered statistics will be published immediately into the database dictionary tables.

Figure 13 shows the default value for this option. By default, this option is set to false.

Figure 13: Knowledge Module Option – Set Publish Statistics

Disable Constraints before Exchange

If set to true, this option disables the integrity constraints of the partitioned table before the partition exchange operation.

If this option is set to false, and the integrity constraints are enabled in the database, Oracle performs the partition exchange operation with validation to maintain the integrity of the constraints; thus, increasing the time it takes to perform the exchange operation.

If the user is confident that the data to be exchanged belongs to the partition and the data does not violate the integrity constraints of the partitioned table, then it is recommended to set this option to true, and perform the exchange operation without validation.

Figure 14 shows the default value for this option. By default, this option is set to false.

Figure 14: Knowledge Module Option – Disable Constraints before Exchange

If the user plans to perform an initial upload for the partitioned table and multiple partitions will be loaded in parallel, it is recommended to disable the integrity constraints for the entire initial upload operation. The integrity constraints can be re-enabled once all partitions have been loaded successfully.

If an incremental load for the partitioned table is performed and a single partition is loaded, the integrity constraints can be disabled and enabled, respectively, before and after the exchange operation.

See the KM option called Enable Constraints after Exchange for details on how to enable integrity constraints after the exchange operation.

The knowledge module disables only those integrity constraints with a current status of enabled.

The integrity constraints of the partitioned table are disabled before the creation of the ODI flow table. Disabled integrity constraints are not added into the ODI flow table in order to ensure a successful exchange operation.

Additional database privileges may be required when disabling integrity constraints for the partitioned table. See section “Configuring Your Database Privileges” for more information.

Rebuild Local Indexes

When a partition exchange operation is performed on a partitioned table, the local index of the exchanged partition becomes unusable, and the index must be rebuilt. If set to true, this option rebuilds the unusable local indexes of the partition that has been exchanged.

If the partition that has been exchanged is single-level (a partition without sub-partitions), the knowledge module rebuilds the unusable indexes of the partition.

If the partition that has been exchanged is composite (a partition with sub-partitions), the knowledge module only rebuilds the unusable sub-partition indexes of the partition.

Alternatively, the user can specify the INCLUDING INDEXES clause in the Partition Exchange Options of this knowledge module to automatically include the local indexes during the partition exchange operation. This will prevent local indexes from becoming unusable.

Figure 15 shows the default value for this option. By default, this option is set to false.

Figure 15: Knowledge Module Option – Rebuild Local Indexes

If the local index is already in an unusable state, the index must be rebuilt with this KM option. The UPDATE INDEXES clause does not update an index that is already in an unusable state.

This KM option rebuilds unusable local indexes sequentially, one at the time. Table 2 shows a list of factors to consider when selecting an option to maintain local indexes:

Table 2: Considerations When Maintaining Local Indexes

Rebuild Global Indexes

Similarly to a local index, when a partition exchange operation is performed on a partitioned table, the global index of the partitioned table becomes unusable, and the entire index must be rebuilt. If set to true, this option rebuilds the unusable global indexes of the partitioned table.

Alternatively, the user can specify the update [global] indexes clause in the Partition Exchange Options to automatically update the global indexes during the exchange operation. This will prevent the global indexes from becoming unusable.

If the global index is already in an unusable state, the index must be rebuilt with this KM option. The update [global] indexes clause does not update an index that is already in an unusable state.

Figure 16 shows the default value for this option. By default, this option is set to false.

Figure 16: Knowledge Module Option – Rebuild Global Indexes

This KM option rebuilds unusable global indexes sequentially, one at the time. The update [global] indexes clause of the Partition Exchange Options, however, can update global indexes in parallel. Table 3 shows a list of factors to consider when selecting an option to maintain global indexes:

Table 3: Considerations When Maintaining Global Indexes

Enable Constraints after Exchange

If set to true, this option enables the integrity constraints of the partitioned table after the successful execution of the partition exchange operation.

The knowledge module enables only those integrity constraints with a current status of disabled.

If an initial load for the partitioned table is performed and multiple partitions are loaded in parallel, the user can disable the integrity constraints before the first partition exchange operation and enable them after the last partition exchange operation.

If an incremental load for the partitioned table is performed, and a single partition is loaded, the user can disable and enable the integrity constraints, respectively, before and after the single exchange operation.

Figure 17 shows the default value for this option. By default, this option is set to false.

Figure 17: Knowledge Module Option – Enable Constraints after Exchange

Global indexes must be in a usable status before enabling integrity constraints for the partitioned table.

Additional database privileges may be required when enabling constraints for the partitioned table. See section “Configuring Your Database Privileges” for more information.

Gather Table Statistics

If set to true, this option instructs the Oracle Database to gather statistics on the exchanged partition of the partitioned table.

Incremental statistics will be gathered if the following two conditions are met:

The KM option called Set Incremental Statistics has been set to true; thus, the statistics preference for the partitioned table has been set to incremental in the database.
The KM option called Set Publish Statistics has been set to true; thus, the publish preference for the partitioned table has been set to true in the database.

If the above two conditions are not met, then a full table scan will be performed to maintain global statistics for the partitioned table.

Global indexes must be in a usable state before gathering table statistics on the partitioned table.

Figure 18 shows the default value for this option. By default, this option is set to false.

Figure 18: Knowledge Module Option – Gather Table Statistics

Two additional conditions to gather incremental statistics have been already configured in the knowledge module:

When invoking the GATHER_TABLE_STATS procedure, the ESTIMATE_PERCENT parameter is set to AUTO_SAMPLE_SIZE.
When invoking the GATHER_TABLE_STATS procedure, the GRANULARITY parameter is set to ALL; thus, table statistics will be gathered at all three levels: sub-partition (if composite-partition), partition, and table level.

For additional information on gathering table statistics, go to Gathering Table Statistics on Partitioned Tables.

Delete Temporary Objects

This option allows the user to delete the temporary objects created by the knowledge module during the partition exchange operation.

The knowledge module creates only one temporary object, the ODI flow table. If this option is set to true, the KM drops the flow table once the partition exchange operation is complete. Users can set this option to false to keep the flow table and troubleshoot failures during the partition exchange operation.

Figure 19 shows the default value for this option. By default, this option is set to true.

Figure 19: Knowledge Module Option – Delete Temporary Objects

Configuring your Database Privileges for Oracle Partition Exchange

Depending how the ODI Topology is configured, additional database privileges may be required when using the IKM Oracle Partition Exchange Load.

A detailed list of privileges is shown in Table 4. These privileges should be set on the target data server.

Table 4: Database Privileges for the Partition Exchange Load

Adding your Table Partitions in Oracle Data Integrator

If you plan to use the default value for the KM option called Partition Name, use this section to import your database-defined partitions into your ODI Models. Otherwise, skip this section.

Follow these steps in order to import the database-defined partitions, so the partitions can be selected from the Partition/Sub-Partition option of the ODI datastore.

There are 5 easy steps in order to import the database-defined partitions of a table into the ODI repository:

In the ODI Studio, open the ODI Model that contains the partitioned table (datastore), and select the Reverse Engineer
Select the Customized option of the Reverse Engineer
Enter the name of the partitioned table in the Mask textbox, so only metadata of the partitioned table will be imported.
Select the Knowledge Module called RKM Oracle. If the knowledge module has not been imported yet into your ODI Project, follow these instructions to import the RKM Oracle: Importing Knowledge Modules in ODI. By default, the RKM Oracle is located in the following directory:

<ODI_HOME>/sdk/xml-reference/

Save your ODI Model changes, and select the Reverse Engineer ODI will launch a script to import the database-defined partitions into the ODI model. Verify that the script completes successfully by reviewing the logs in the ODI Operator.

Figure 20 shows an example of how to import the database-defined partitions of a table called W_ORDERS_F.

Figure 20: Importing Database-Defined Partitions into ODI

Once the database-defined partitions have been imported into your ODI repository, you can view the partitions by opening the ODI datastore and selecting the Partitions tab.

Figure 21 shows an example of the database-defined partitions for the datastore called W_ORDERS_F. There are 12 Composite Range-List partitions defined in the database and each partition has 3 sub-partitions.

Figure 21: Partitions List for an ODI Datastore

Alternately, partitions and sub-partitions can be added manually into an ODI datastore by selecting the Add Partition or Add Sub-Partition option, respectively.

ODI variables can be added manually in the Partitions list as well; thus, the partition name can be assigned dynamically by refreshing the ODI variable at runtime. Figure 21, above, shows two ODI variables that have been added into the Partitions list of the W_ORDERS_F datastore: #A_TEAM.PARTITION_NAME and #A_TEAM.SUB_PARTITION_NAME.

Once the database-defined partitions have been added into the ODI models, they can be used in ODI mappings to select data from a partition, upload data into a partition, and exchange partitions with the IKM Oracle Partition Exchange Load.

Figure 22 shows how to assign a partition name to a datastore in an ODI mapping. In this example, the ODI variable called #A_TEAM.PARTITION_NAME has been used as the partition name for the W_ORDERS_F datastore.

To assign a partition name or an ODI variable to a datastore in an ODI mapping, follow these steps:

In the Logical Diagram of your ODI mapping, select the datastore and open the Properties window. Expand the General option.
Locate the Partition/Sub-Partition section.
Open the List Box of the Partition/Sub-Partition section, and select the desired partition name or ODI variable.

Figure 22: Setting the Partition Name in an ODI Mapping

For each datastore in your ODI mapping, you can choose different partition names or ODI variables.

Follow these recommendations when choosing a hard-coded partition name or an ODI variable to specify the partition of the datastore in your mapping:

Choose a partition name if you plan to use your mapping to load data for a single partition.
Choose an ODI variable if you plan to use the same mapping to upload data for more than one partition. This will allow you to assign partition names dynamically.

Conclusion

If your Oracle data warehouse has partitioned tables, Oracle Partition Exchange offers a fast method for uploading data into your last partitioned tables.

Using Oracle Partition Exchange with Oracle Data Integrator (ODI)

All content listed on this page is the property of Oracle Corp. Redistribution not allowed without written permission

↧

Most Recent Articles

Index and Navigation Pages

Objective

Creating a new Non-Generic profile

Assigning the new privilege to a user

Assigning authorized repositories to the user

Conclusion

Master and Work Repositories

Repositories relationships

Challenges encountered in a corporate environment

Expanding the infrastructure

Repositories infrastructure and objects promotion

ODI architecture reminder

Steps to troubleshoot ODI connection issues

Preventing issues on the Studio side

Identifying issues on the Agent side

Additional tests to prevent further connectivity issues

What about flat files?

Understanding Connectivity Loss

Agent losing connection to the repository

Agent losing connection to a database

Going further…

Configuring ODI Java EE Agent for OBIEE

How the driver works

Loading the data

XML JDBC Driver Properties

XML driver commands

The consequences for data movement

Using an external database to store the XML schema

Plain XML File

JMS XML

Implementation Steps

Step #1: Creating database functions to handle Late Arriving Dimensions

Step #2: Implementing Warehouse Dimensions as ODI Lookups

Step #3: Creating ODI user functions to manage early arriving facts

Step #4: Mapping Warehouse Keys of the Fact Table with ODI user functions

Step #5: Modifying “IKM Oracle Incremental Update”

Testing Your Orders Fact Interface

Conclusion

What is an Early Arriving Fact?

Proof of Concept Overview

The Data Warehouse

The Data Mart

The Staging Area

The ODI Repository

Dimension Interfaces

Fact Interface

ODI Variables

ODI User Functions

Models

ODI Topology

Conclusion

Load Plans and Exceptions

Conclusion

ODI Code Generation

The challenge with topology discrepancies

Option 1: Optimization contexts

Option 2: Design the development Topology to match the production Topology in ODI

Conclusion

Overview: LKM Oracle to Oracle Datapump Plus

Load Knowledge Module (LKM) Options

NUM_OF_ACTIVE_WORKERS_EXPORT

NUM_OF_ACTIVE_WORKERS_IMPORT

DATA_PUMP_EXPORT_DIR

DATA_PUMP_IMPORT_DIR

COMPRESS_DATA

TRANSFER_FILES

DELETE_DATA_FILES

DELETE_LOG_FILES

SUFFIX_NAME

OPTIMIZER_HINT

Configuring your Environment to Work with Oracle Data Pump

Server Configuration

Data Server Privileges

Configuring Your Database Link for File-Transfer Operations

How does LKM Oracle to Oracle Datapump Plus work?

Understanding the Code Generated by the Knowledge Module

Conclusion

1. Understanding ODI JKMs

1.1 Infrastructure and key concepts