Every now and then I come across a blog entry that reminds me there are people out there who know a lot more about my niche than I do! This is fortunate as this week it has helped me understand ELT tools.
ETL versus ELT and ETLT
The world of data integration has it's own Coke versus Pepsi challenge - it's called ETL versus ELT. Not as exciting as Aliens versus Predator, not as compelling as Ali versus Frasier and not as sexy as Erin Brockovich versus whatever company that was ... but an important battle if you are in charge of a multi million dollar IT budget.
ETL (Extract, Transform and Load) is the coca cola in the challenge with Informatica and DataStage the champions in terms of license fees and market share. It is made up of software that transforms and migrates data on most platforms with or without source and target databases. Business Objects, SAS, Microsoft SSIS, Ab Initio and Cognos are also in the ETL corner.
ELT (Extract, Load and Transform) is the challenger and is now largely driven by RDBMS vendor Oracle with Oracle Warehouse Builder and Sunopsis. It consists of software that transforms and migrates data in a database engine, often by generating SQL statements and procedures and moving data between tables.
ELT technology was constrained by what the database was capable and since many came from RDBMS vendors they tended to be suitable for just one database platform. Eg. Oracle Warehouse Builder and Microsoft DTS. They were also lacking functionality as the vendor was more concerned with building a database rather than an ELT tool. Sunopsis was an exception as an ELT tool not owned by an RDBMS vendor (until the Oracle acquired them ).
Informatica has recently moved into the the ETLT (Extract, Transform, Load and Transform) area with database pushdown optimization. This is standard ETL delivering to a target database and some extra sexy moves done moving it into more tables. Microsoft SSIS also has good ETLT capabilities within the SQL Server database.
Pros of each
I haven't had a lot of experience with ELT products but fortunately Dan Lindstedt from the B-Eye-Network blogs has been talking about this topic for years now and his recent entry ETL, ELT - Challenges and Metadata has a great comparison. Here are his pros of each list below, visit his blog for further discussion and the Cons of each tool:
Pros:
* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware
I would add to this data quality. The ETL tools have a head start over ELT in terms of data quality integration with Informatica and DataStage integrating closely. The row-by-row processing method of ETL works well with third party products such as data quality or business rule engines.
And the Pros of ELT.
Pros:
* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.
I'm not sure whether that final point refers to data that is being processed from one RDBMS table to another on the same server or whether it also applies to cross database migrations. If you are moving between database platforms ELT may need to transfer the data and save it before it commences transformation whereas ETL can transfer and transform in the one stream.
Another Pro of ELT is that once the database is on the target platform it no longer places stress on the network, all further transformation is done on the RDBMS server. These Pros of ELT is why ETL tools have an ELT capability, either the simple interface of DataStage in running user-defined SQL and procedures or the advanced capabilities of Informatica to generate SQL transformation code.
Informatica Pushdown ELT
In 2006 Informatica made an each way bet in the challenge by putting ELT style push down optimization into the PowerCenter ETL tool. You can see a very good list of 8.1.1 and Pushdown optimization best practices at the ETL/ELT forum.
Sunopsis ELT
You can see Sunopsis ELT in action at Mark Rittmans Sunopsis Data Conductor : Creating an Oracle Project and in Moving Global Electronics Data using Sunopsis he evaluates the Oracle SQL generated.
DataStage ELT
In DataStage ELT is available through running a SQL statement from a database stage or by calling a stored procedure. It can generate SQL statements but not SQL procedural code so it is not as flexible or advanced as PowerCenter or Sunopsis. IBM may add this functionality within the next 18 months. They already have a DB2 ELT tool in the DB2 DataWarehouse Edition in the form of a SQL language generator for populating a Data Warehouse.
The Wrap
I am not convinced advanced ELT is good an ETL tool as it increases upgraditis.
Upgraditis afflicts ETL tools when they struggle to keep up with all the upgrades in the products the integrate with. Good ETL tools have native access to most common database engines and ERP systems. They constantly need to upgrade and certify these items against new versions of these products. When Informatica added pushdown optimization they bought into a whole new area of upgraditis and will have to keep the pushdown compatible with new versions of all the databases it supports.
相关推荐
《Kettle解决方案:使用PDI构建开源ETL解决方案》是一份深入探讨数据集成技术的文档,主要关注开源工具PDI(Pentaho Data Integration)在ETL(提取、转换、加载)过程中的应用。ETL是数据仓库和大数据处理的核心...
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,书中介绍了如何使用PDI来实现数据的剖析、清洗、校验、抽取、转换、加载等各类常见的ETL类工作。 除了ODS/DW...
解决方案:使用PDI构建开源ETL解决方案-460页.pdf
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》主要介绍如何使用开源ETL工具来完成数据整合工作。 《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案...
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,书中介绍了如何使用PDI来实现数据的剖析、清洗、校验、抽取、转换、加载等各类常见的ETL类工作。 除了ODS/DW...
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,书中介绍了如何使用PDI来实现数据的剖析、清洗、校验、抽取、转换、加载等各类常见的ETL类工作。 除了ODS/DW...
标题中的“解决方案:使用PDI构建开源ETL解决方案源码示例”暗示了这是一个关于如何利用Pentaho Data Integration(PDI),也被称为Kettle,来构建数据提取、转换和加载(ETL)流程的实践指南。这个压缩包可能包含了...
解决方案:使用PDI构建开源ETL解决方案.pdf 完整中文版,带目录,460页
开始使用 Hadoop 和 NoSQL 以及免费的开源 ETL 和 ELT 软件,在任何地方进行大数据集成和转换。 只需拖放和配置预构建的组件,生成本机代码,然后部署到 Hadoop,即可轻松地将 EDW 卸载和摄取、加载和卸载数据到本地...
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》主要介绍如何使用开源ETL工具来完成数据整合工作。 《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,...
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》主要介绍如何使用开源ETL工具来完成数据整合工作。 《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,...
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》主要介绍如何使用开源ETL工具来完成数据整合工作。 《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,...
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》主要介绍如何使用开源ETL工具来完成数据整合工作。 《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,...
请下载《解决方案:使用PDI构建开源ETL解决方案(part2)》共同使用。
Snowflake:Snowflake数据加载与ETL流程.docx
《Pentaho Kettle解决方案:使用PDI构建开源ETL解决方案》介绍的PDI(Kettle)是一种开源的 ETL 解决方案,书中介绍了如何使用PDI来实现数据的剖析、清洗、校验、抽取、转换、加载等各类常见的ETL类工作。 除了ODS/DW...
4. 如何使用Mysql的Sakila演示数据库构建一个完整的端到端的ETL解决方案 5. 34中ETL子系统,如何使用Kettle实现这34种子系统 6. Kettle如何完成数据抽取/清洗和确认,处理维度表,加载事实表,操作OLAP立方体 .....
解决方案:使用PDI构建开源ETL解决方案-460页.pdf