`

Slowly Changing Dimensions Type 1 and Type 2(原创)

 
阅读更多

Slowly Changing Dimensions

The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below:

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:

Customer Key Customer_ID Name Birthday State
1001 999 Christina 1990-01-01 Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record exists.

Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.

Type 3: The original record is modified to reflect the change.

Type 1 Slowly Changing Dimension

When the source of a dimension value changes, and it is not necessary to preserve its history in the star schema, a type 1 response is employed. The dimension is simply overwritten with the new value. This technique is commonly employed in situations where a source data element is being changed to correct an error.
By overwriting the corresponding dimension in the star schema, the type 1 change obliterates the history of the data element. The star carries no hint that the column ever contained a different value. While this is generally the desired effect, it can also lead to confusion. If there were any associated facts before the change occurred, their historic context is retroactively altered.
When a record is updated in a dimension table, the context for existing facts is restated. This effect
can give rise to confusion. Steps can be taken to minimize the confusion caused by type 1 changes. Systems analysts responsible for supporting the data warehouse users must be aware of this phenomenon so they are prepared to address confusion. Developers of reports can place the query execution date within the report footer or cover page, signaling to readers the date as of which the report was current. Any reports that are pre-run and stored for users can be automatically updated on a regular basis so users do not inintentionally access “stale” data.

In our example, recall we originally have the following table:

Customer Key Customer_ID Name Birthday State
1001 999 Christina 1990-01-01 Illinois

Developers found Operating system record Christina's birthday incorrectly at the begining, and they correct the birtday value to 1990-02-01, the new information replaces the new record, and we have the following table:

Customer Key Customer_ID Name Birthday State
1001 999 Christina 1990-02-01 Illinois

Advantages:

- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages:

- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.

Preexisting Facts Have a New Context
When a record is updated in a dimension table, the context for existing facts is restated. This effect
can give rise to confusion.

Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Type 2 Slowly Changing Dimension

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own surrogate key.

In our example, recall we have corrected the birthday value, and Christina moved from Illinois to California this time, the following table:

Customer Key Customer_ID Name Birthday State
1001 999 Christina 1990-02-01 Illinois

The Order fact table shows as below

Customer Key day_key product_key quantity_ordered
1001 2322 10119 5

After Christina moved from Illinois to California, we add the new information as a new row into the table:

Dimension table

Customer Key Customer_ID Name Birthday State
1001 999 Christina 1990-02-01 Illinois
1005 999 Christina 1990-02-01 California

Fact table

Customer Key day_key product_key quantity_ordered
1001 2322 10119 5
1005 4377 20111 1

Historic Context of Facts Is Preserved
By creating multiple versions of the dimension, a type 2 response avoids restating the context of previously existing facts. Old facts can remain associated with the old row; new facts can be associated with the new row. This has the desired effect of preserving past history, while allowing new activity to be associated with the new value.

History of Dimension Is Partially Maintained
A type 2 change results in multiple dimension rows for a given natural key. While this serves to preserve the historic context of facts, it can trigger new forms of confusion. Users may be confused by the presence of duplicate values in the dimension tables. Designers may be lulled by a false sense that they are preserving dimensional history.
Type 2 changes can confuse end users because they cause duplicate values to appear in dimension tables. For example, after Christina's change of address, there are two rows in the dimension table for her customer key. If someone were to query the dimension table to get the name associated with Christina, both rows would be returned. This side effect can be avoided by issuing browse queries that select distinct values. A flag may also be added to indicate the current row for a given natural key value.

Although the type 2 change preserves the historic context of facts, it does not preserve history in the dimension. It is easy to see that a given natural key has taken on multiple representations in the dimension, but we do not know when each of these representations was correct. This information is only provided by way of a fact.
For example, after the change to Christina’s address has occurred, the dimension table in Figure 3-8 shows that there have been two versions of Christina, but it cannot tell us what Christina looked like on any given date. Where was she living on January 1, 2008? The dimension table does not carry this information. If there is an order for January 1, 2008, we are in luck, because the orders fact table will refer to the version of Sue that was correct at the time of the order. If there is not an order on that date, we are unable to determine what Christina looked like at that point in time.
It may be clear to you that this problem is easily rectified by adding a date stamp to each version of Christina. This technique allows the dimension to preserve both the history of facts and the history of dimensions. Another possibility is to build an additional fact table that associates versions of Sue with various dates.

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:

- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.

- This necessarily complicates the ETL process.

- It's not easy to confirm which version is correct in dimension table without timestamp information.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

In addition to the type 1 and type 2 techniques introduced in this chapter, additional responses to source data changes are possible. Options include the type 3 response, hybrid responses, and time-stamped variations. Though less common, these techniques meet additional analytic challenges that I plan to elobrate it by other post.

 

参考至:《Star Schema The Complete Reference》

                http://www.1keydata.com/datawarehousing/slowly-changing-dimensions.html

                http://www.1keydata.com/datawarehousing/slowly-changing-dimensions-type-1.html

                http://www.1keydata.com/datawarehousing/slowly-changing-dimensions-type-2.html

本文原创,转载请注明出处,作者

如有错误,欢迎指正

邮箱:czmcj@163.com

0
1
分享到:
评论

相关推荐

    Informatica designer

    1. **启动向导**:通过菜单“Mappings” -> “Wizards” -> “Slowly Changing Dimensions”启动向导。 2. **选择处理类型**:根据业务需求选择合适的处理类型。 3. **指定源表与目标表**:指定要处理的源表和目标表...

    微软数据库考试70-768: Exam Ref 70-768 Developing SQL Data Models

    Course OutlineModule 1: Introduction to Business Intelligence and Data ModelingThis module introduces key BI concepts and the Microsoft BI product suite.Lessons Introduction to Business Intelligence...

    Informatica缓慢变化维探究.pdf

    在数据仓库领域,**缓慢变化维**(Slowly Changing Dimensions, SCD)是一种重要的数据处理策略,它旨在跟踪和记录随着时间推移维度数据的变化情况。不同于在线交易处理(OLTP)系统,数据仓库系统不仅需要提供实时...

    小班英语优质课教案《Slowly and Fast》润新教育.txt

    小班英语优质课教案《Slowly and Fast》润新教育.txt

    informatica-scd-type1

    SCD(Slowly Changing Dimension)是处理数据仓库中维度表变化的一种策略。Type 1 SCD是最基础的类型,它涉及对历史数据进行覆盖更新,适用于那些不保留历史记录或只需要最新数据的场景。 【描述】:Informatica的...

    Talend组件帮助文档手册

    **组件描述:** `tDB2SCD` 组件用于处理 IBM DB2 数据库中的缓慢变化维度(Slowly Changing Dimensions,简称 SCD)。该组件可以检测到数据中的变化,并根据配置的规则更新或插入新的记录。 - **属性:** - **输入...

    POCS算法.zip_POCS_POCS算法_POCS算法原理_pocs算法代码_slowly2bc

    标题中的“POCS算法.zip_POCS_POCS算法_POCS算法原理_pocs算法代码_slowly2bc”表明这个压缩包文件主要包含关于POCS(Projection onto Convex Sets,投影到凸集)算法的信息,特别是它的原理和实现代码。POCS算法是...

    Exploring C++ 11

    or any other C-based language, you’ll be taught everything you need to know in a logical progression of small lessons that you can work through as quickly or as slowly as you need. C++ can be a ...

    Informatica 开发规范

    - Slowly Changing Dimensions 方式处理缓慢变化的维度,通过表达式模块 EXP_SOURCE 和 EXP_TARGET 实现。 - 转换模块的命名应简明,例如 LKP_表名。 - 转换模块设置:删除无用字段,正确设置 Input 和 Output ...

    HEF4016英文资料

    Its Schmitt trigger action provides enhanced noise immunity and the ability to "square up" slowly changing waveforms. Understanding its DC and AC characteristics is crucial for optimizing its ...

    quicksort_slowly5vx_Vc_

    在"quicksort_slowly5vx_Vc_"的标题中,"slowly5vx"可能指的是在某种特定情况下快速排序的性能下降,例如数据已经部分有序或者最坏情况(所有元素都相同)。这通常会导致快速排序的时间复杂度从平均情况下的O(n log ...

    dtm_dtm_slowly3i8_

    "dtm_dtm_slowly3i8_"可能是一个项目名或文件命名规范,其中"dtm"再次强调了这是与数字地形模型相关的项目,"slowly3i8"可能是对这个模型创建或更新过程的一个描述,可能意味着这个过程是逐步的、分阶段的,或者需要...

    《The Pleasures of Pi, e and Other Interesting Numbers》作者: Adrian, Y. E.

    and (II) A Feast for the Mind (slowly getting deeper for the more adventurous). The author covers beautiful infinite series beginning with those that a young child can understand to one that even ...

    多表查询-相关子查询-SCD-答案[文].pdf

    SCD主要分为三种类型:SCD Type 1、SCD Type 2 和 SCD Type 3,分别对应于不同类型的维度数据变更策略。在本例中,我们主要关注SQL查询,因此不再详细介绍SCD的具体实现,但请注意,SCD在处理历史数据变化时的应用。...

    Black Hat PytHon Python Programming for Hackers and Pentesters

    To start, I lay down some networking fundamentals in Chapter 2 and slowly work our way through raw sockets in Chapter 3 and using Scapy in Chapter 4 for some more interesting network tooling....

    国赛板球控制.rar_duerut_slowly1rx_板球圆环控制_板球控制_板球系统控制程序

    标题中的“国赛板球控制.rar_duerut_slowly1rx_板球圆环控制_板球控制_板球系统控制程序”暗示了一个项目或竞赛,可能是关于板球控制系统的设计和实现,涉及到STM32F103微控制器以及摄像头对红色小球的识别技术。...

Global site tag (gtag.js) - Google Analytics