Wednesday, June 11, 2014

Data Integration Platform Accelerator - Talend ETL Tool

Business Challenge

Data integration and collaboration between enterprise applications is a key factor for any organization. The most important aspect is to ease the seamless integration of data and proper migration of different data sources and effective collaboration of multiple Enterprise Applications. The core problem is to address the integration needs and challenges of Enterprise Applications which are built on PostGres database.

The following are the key problems that need to be addressed:
  1. Effective integration between multiple Enterprise Applications
  2. Collaborative seamless data integration between various Enterprise Applications
  3. Provide common source for various BI reports
  4. Proper data migration of different kinds of data sources
  5. Replication across multiple systems

Solution Approach

To address the problem the proposed solution approaches are:
  1. Trigger Based Approach - Talend CDC
  2. Using xmin – Postgres CDC
  3. Trigger & xmin - Slony-I replication
  4. Using WAL & Trigger - Streaming Replication : PostGres 9.0 onwards

1. Trigger Based Approach - Talend CDC:

The approach using Trigger base – Talend CDE:
  • Creates change tables for each source table which needs to be watched
  • Triggers store the primary key of the changed record
  • Triggers store metadata of the transaction in the change tables
    • PK (Primary key of the transaction)
    • Change Type (Update / Insert / Delete)
    • Changed By
    • Changed Time
  • Talend CDC component extracts the changes from the change tables
    • Follows a publish / subscribe based model
    • Provides a view to extract from the source table based on the PK
    • Left-join with the source table
    • Provides the latest updates
    • Maintains its own control tables

2. Using xmin – PostGres CDE:

The approach using xmin – PostGres CDE:

  • Maintains configuration on which tables need to be watched
  • Extracts the xmin for the tables to watch and stores in control tables
  • Creates a generic view to extract the changes that happened after the last extract based on xmin

Advantages:

  • Non-Intrusive approach over the source schema
Disadvantages:
  • Indexes are not available over xmin
  • Wrap-around problem
  • Cannot track deletes
Metrics:
  • On a simulated environment time to extract on table with more than 500K rows on a 4GB RAM 2 Core VM takes around 4-5 seconds
    • This would grow as the number of rows increases

3. Trigger & xmin – Slony I  Replication:

The approach using Trigger &xmin – Sloney I Replication:

  • Master / Slave replication
  • Based on triggers on Master or Origin node
  • Publisher / Subscribe model
    • Stores transactions in sl_log_1 / sl_log_2 tables
      • Stores TableId
      • Stores TransactionId
      • Stores Audit info
  • Creates Sync Events for subscribers
  • Stores sync events in sl_event table
    • Slony daemons on subscribers pull the events
    • Get the transactionid for that event from the log tables
    • Extract and load the data in the slave schema
  • Have Talend CDC point to the slave and extract the changes based on the trigger based approach

Advantages:

  • Although xmin is used internally to detect and  replicate the changes, the wraparound issues are handled by slony
  • Replication Set allows to replicate only certain tables not the whole schema
Disadvantages:
  • Pulling changes/modified data over wire may cause delay
  • Changes to the source schema
Metrics:
  • Accounting the existing stats by setting the SLON_DATA_FETECH_SIZE is between 100 to 400 the slony replication has good response times
  • As the triggers would be on the slave it would be faster for talend CDC extraction as they are based on PK’s



4. Using WAL & Trigger - Streaming Replication : PostGres 9.0 onwards

Limitations with this approach:

  • WAL-based replication requires that all databases use identical versions , running on identical architectures
  • WAL-based replication duplicates the model
  • Do not have the ability to have specific updates on Target Schema
  • Synchronous Replication based on 2-Phase commits


Proposed Solution (Talend Tool)

Based on all the above approaches, Trigger based approach seems to be the best option with using Talend CDC. The main advantage with this approach is the efficient retrieval you receive from source based on indexed columns.

Overview
Talend is an Open Source Integration Software Company that provides open source middleware solutions that enable organizations to gain more value from their applications, systems and databases. Shattering the traditional proprietary model, Talend democratizes the integration market by providing enterprise-grade open source technologies that cover both the data integration and application integration needs of organizations of all sizes.

Talend's unified integration platform addresses projects such as data integration, ETL, data quality, master data management and application integration. With their proven performance, user friendly, extensibility and robustness, Talend's solutions are the most widely used and deployed integration solutions around the world. Talend is a tool for people who are already making a Java program and want to save an abundant amount of time with a tool that generates code for them. The simple DW Architecture is below.

Talend Features:

  1. Collaborative Data Integration - Talend’s data integration products provide powerful andflexible integration, so that firms can stop worrying about how databases and applications are talking to each other. Thus providing them with the ability to maximize the value of using their data.
  2. Transform and Integrate Data between Systems - Talend’s data integration products provide an extensible, highly-performant, open source set of tools to access, transform and integrate data from any business system in real time or batch to meet both operational and analytical data integration needs. With over 450 connectors, it has the ability to integrate almost any data source. The broad range of use cases addressed include: massive scale integration (big data/ NoSQL), ETL for business intelligence and data warehousing, data synchronization, data migration, data sharing, and data services.
  3. A Comprehensive Solution - Talend provides a Business Modeler, a visual tool for designing business logic for an application; a Job Designer, a visual tool for functional diagramming, delineating data development and flow sequencing using components and connectors; and a Metadata Manager, for storing and managing all project metadata, including contextual data such as database connection details and file paths.
  4. Broad Connectivity to All Systems - Talend connects natively to databases, packaged applications (ERP, CRM, etc.), SaaS and Cloud applications, mainframes, files, Web services, data warehouses, data marts, and OLAP applications. It offers built-in advanced components for ETL including string manipulators, Slowly Changing Dimensions, automatic lookup handling and bulk loading. Direct integration is provided with data quality, data matching, MDM and related functions. Talend connects to popular cloud apps including Salesforce.com and SugarCRM.
  5. Teamwork and Collaboration - The shared repository consolidates all project information and enterprise metadata in a centralized repository shared by all stakeholders: business users, job developers, and IT operations staff. Developers can easily version jobs with the ability to roll-back to a prior version.
  6. Advanced Management and Monitoring - Talend includes powerful testing, debugging, management and tuning features with real-time tracking of data execution statistics and an advanced trace mode. The product incorporates tools for managing the simplest jobs to the most complex ones, from single jobs to thousands of jobs. Processes can be deployed across enterprise and grid systems as data services using the export tool.
  7. It uses a code-generating approach. Uses a GUI, but within Eclipse RC.
  8. It generates java or Perl code which can run on any server.
  9. It can schedule tasks (also with using schedulers’ like cron).
  10. It has data quality features: from its own GUI, writing more customized SQL queries and Java.



Figure 2: Typical enterprise data integration model

Advantages of Using Talend Tool

  • Talend is used in an application to retrieve and transform the data across multiple systems at enterprise level. 
  • With Talend Centralize data integration, enrichment and distribution is made easy
  • Capable of generating multiple BI Reports using unified transactional data model
  • Ability to configure the Talend Jobs to pick data from a certain period of time.
  • All such data flows are registered with an approval process for new ones, with special emphasis on extra-company flows
  • Metadata-driven architecture to ensure that critical data elements and transformations are documented, controlled (e.g., sensitive data) and kept in sync with implementation
  • Talend Slony replication is used to replicate data across multiple enterprise systems
  • Talend CDC is used to create Triggers and views on the source database to pull the changed data and insert / update the Target Database.

Value Addition using Talend Open Source Tool

  • Talend Open Source Tool enabled development for centralized location of data for various enterprise applications
  • With Talend Open Source we could achieve easy transactional data transformations across systems
  • Simplified tool to develop transactional data jobs
  • Insulates clients from changes in transactional systems
  • Reduces modelling and data quality effort downstream, e.g., at DW/BI for conforming dimensions, de-duping and resolving inconsistent reports
  • Provides good turn around on data retrievals
  • Provides source data for operational reporting
  • Talend CDC enabled Rapid Application development 
  • Slony enabled smooth replication of data