`

Type 3 Slowly Change Dimension(原创)

 
阅读更多

Type 3 Changes
In most dimensional schemas, the bulk of changes to source data generate type 1 and type 2 changes. Occasionally, neither technique satisfies. A third type of change response is called into play when there is a need to analyze all facts, those recorded before and after the change, with either the old value or the new value. Neither type 1 nor type 2 does the job here. Solving this problem by storing natural keys in the fact table renders queries difficult to write and opens the door to serious double-counting issues. The preferred approach is to include two attributes for the changed data element, one to carry the current value and one to carry the prior value. Both are updated when a change occurs. Known as a type 3 solution, this allows either to be used to study the entire history.
Study All Facts with Old or New Dimension Values
During the course of planning the appropriate slow change response for each column in a dimension table, you will occasionally identify an attribute with two seemingly contradictory requirements:

  • The ability to analyze all facts, recorded before and after the change occurred, using the new value.
  • The ability to analyze all facts, before and after the change occurred, using the old value.

In this situation, neither a type 1 nor a type 2 response is sufficient to keep everyone satisfied. This is best understood in the context of an example.
For example, there were just two regions: East and West. As the company grew, these groupings became insufficient. What began as the East region was broken down into two regions: Northeast and Southeast.When changes like this occur, management begins using the new designations immediately.
Management also needs to hang onto the old designations, at least for a little while, in order to compare this year’s performance to what was reported for the previous year. The requirements for managing changes to a customer’s region are enumerated as follows:

  • The ability to use the new regions to analyze all facts, whether recorded before or after the changeover
  • The ability to use the old regions to analyze all facts, whether recorded before or after the changeover

The first requirement allows managers to move forward with the new groupings immediately. All order history can be grouped into the new Northeast and Southeast categories. The managers can do “this year versus last” comparisons as if the new groupings had always been in place. The second requirement allows the company to move forward using the old groupings. All orders can be grouped under the old value (East) as if it had never changed. This permits the results for the current year to be reported externally in a format consistent with the way in which forecasts were provided.

Previous Techniques Not Suitable
Neither a type 1 nor a type 2 response is able to address both these requirements. On its own, the first requirement can be satisfied with a type 1 change, overwriting the old value with the new value. A customer in the region “East,” for example, is updated to show “Northeast.” The new region value is, therefore, associated with all facts in any associated fact table—both those recorded before the changeover and after the changeover. Unfortunately, the type 1 response does not meet the second requirement, which calls for the old value to be available. A type 1 change wipes away the previous value of the attribute, as if it never existed. There is no record that the customer whose region has been updated to “Northeast” was ever classified as being in the region “East.”

A type 2 response fares worse; it is not able to meet either requirement. A type 2 response associates each fact with the version that was current at the time it was recorded. Hence, the old region value (East) is associated with facts recorded prior to the change, and the new values (Northeast, Southeast) are associated with facts recorded after the change. These new values cannot be used to analyze past history, conflicting with the first requirement. The old values cannot be applied to facts recorded after the change, conflicting with the second requirement.
The Type 3 Solution
When requirements call for using the old or new values of a changed attribute to study all facts, a type 3 response is used. A pair of attributes is modeled, one attribute for the current value and one for the previous value. When a change occurs, both columns are updated; no rows are added. Users can employ either column in their analysis as desired. The process is repeatable, and the solution can be extended if multiple versions of the changed attribute are required.
A Pair of Attributes, Both to Be Updated
When a dimension attribute is singled out for type 3 treatment, it is modeled as a pair of columns. One column is designated to hold the current value; the other is designated to hold the old value. In the case of customer regions, they might be called region_current and region_previous, as depicted below figure.

When a change to the source value occurs, both attributes are updated. The new value is placed in the column representing the current value; this is a standard type 1 response. The old value is moved to the column representing the previous value. No additional rows are created. Since each fact is linked to a single dimension row containing both values, it is possible to create reports that use either value to group all data.

The Type 3 Change in Action
The mechanics of a type 3 change are illustrated below Figure. In this example, customers in the region called “East” are redesignated as belonging to regions called “Northeast” or “Southeast.” The top half of the diagram shows the state of affairs before the change. Several rows from the customer dimension table are shown. Notice that region_current


contains each customer’s region—”East” or “West.” There have been no changes yet, so the region_previous is initialized to contain the same values. In the fact table, two rows have been recorded for the customer identified by customer_key 1011. Tracing this back to the customer dimension, you can see that these orders were placed by Robert Davis, who is in the region “East.”
Now suppose that the business decides to change the regional groupings. The region that was formerly known as East will be subdivided into Northeast and Southeast. For each customer who was in the East region, a type 3 change occurs. The region_current is updated with the new values. The result can be seen in the above Figure. Robert Davis now has a region_current of “Northeast.”The two orders that Robert Davis placed before the change are still in the fact table. In addition, Robert has placed two additional orders since the change occurred. All four rows are associated with the same row in the dimension table, identified by surrogate key 1011.
This solution allows you to employ either version of the region categorization in queries or reports, simply by using the appropriate column. When you want to study orders using the new scheme, you group order_dollars by region_current. This causes all the orders from Robert Davis to be rolled up under the designation of “Northeast,” whether they were recorded before or after the change. When you want to study orders using the old scheme, you group order_dollars by region_previous. Robert’s orders now appear under the grouping for “East,” regardless of when they occurred.
Tip A type 3 response associates the old and new values of the changed attribute with all transactions.
Two dimension attributes are required. One will hold the current value; the other will hold the previous
value. When the source changes, both are updated
Notice that the region_current attribute provides a standard type 1 response to the changed data. It always contains the most recent value; it is always overwritten when the value changes. For this reason, the region_current column can be designated as a type 1 attribute in design documentation if you prefer. You may even prefer to think of the change as a type 1/3 change to region. Designating the attributes that produce a type 3 response as a pair, however, may be more useful in communicating to ETL developers how changes should be processed.
It is important to recognize that a type 3 change does not preserve the historic context of facts. Each time a type 3 change occurs, the history is restated. There is no way to know, for example, what region Robert Davis was a part of when a particular order was placed. This requires a type 2 approach. Type 2 and type 3 responses can be combined, in a hybrid response known as type 1/2/3.

Type 3 changes are comparatively rare, but they are occasionally warranted. Redistricting and geographic realignments are the most common examples. Finance and accounting groups often designate new names or classifications for accounts on an annual basis. Such cases have the lucky distinction of occurring at predictable intervals, although this is not always the case.

Repeatable Process

A type 3 change is not a one-time-only process. It can be repeated once, twice, or as many times as needed. Robert Davis’s region, for example, may subsequently be changed again. Perhaps as a result of redrawing the region boundaries, he is redesignated as being in the Southeast. When this occurs, what was formerly in region_current (Northeast) is moved to region_previous. The new value (Southeast) is placed in region_current.
A type 3 change preserves only one old version of the changed attribute. Once a second change occurs, the fact that Robert Davis was once in the East region is lost. In some cases, it may be useful to maintain multiple values. This can be done by designating three, four, or however many versions of the region column are required. Each time a change occurs, the values are all shifted among these columns.

With multiple versions, the columns can be difficult to name. If the changes occur on a predictable basis, they might be named region_2009, region_2008, region_2007, and so forth. This approach has two distinct disadvantages. First, it requires the DBA to periodically change the names of columns. For example, if requirements call for the current version and the two previous versions, it is necessary at the end of 2009 year to drop the 2007 column and add one for 2010. Second, encoding the year into the column names means new queries and reports will need to be designed each year to refer to the new column names. An alternative that avoids these drawbacks is to name the columns region_current, region_last_year, and region_two_years_ago.
Special Scheduling Not Necessary
The type 3 change is often misunderstood as requiring special advanced scheduling. Although developers may take advantage of the predictability of a type 3 change and handle it with a special process, this is not strictly required. As with any other slow change processing, it is possible to design an ETL process that looks for changes to the attribute in question and applies them if and when they are detected. All that is necessary is to anticipate the possibility of the type 3 change and engineer the ETL process accordingly. There may be good reasons to handle a type 3 change via a special process.
• In most cases, type 3 changes occur en masse. A change to the region designations, for example, may affect every customer in the dimension table.
• The information that drives a type 3 change may come from a separate source or involve special rules. Region designations, for example, may be driven by a spreadsheet that maps to Zip codes.
• A type 3 change with multiple versions may involve renaming database columns and developing new reports. Its implementation must be coordinated with these activities.
• The change may occur on a predicable basis, in which case it is not necessary to monitor for it as part of standard processing.

For any of these reasons, it may be useful to handle the type 3 change through a separate scheduled process. This is not mandatory, but it may be convenient for processing purposes.
Documenting the Type 3 Change
When it comes to documenting a type 3 change, designers have varying preferences. In figure above, each attribute is designated as a surrogate key, as a natural key, or according to its slow change characteristic. The type 3 attribute pair is highlighted and labeled as type 3.
As noted, you may prefer to designate the current member of the attribute pair as a type 1 attribute, since it behaves in exactly the same way as any other type 1 attribute. Modeling tools often impose limitations on how you can document the slow change behavior associated with an attribute. Some dimensionally aware modeling tools allow you to note slow change characteristics, but most tools will not let you highlight a pair of type 3 attributes. Naming conventions and comments may be your only route to associating them.
Some modeling tools do not include the concept of a type 3 change, limiting your options to type 1 and type 2. Using these tools, you may be forced to label both members of the attribute pair as type 1. This will require some careful annotation to distinguish each attribute’s purpose. You may convey the behavior of the attribute containing the previous value by noting its data source is the attribute that holds the current value.

 

参考至:《Star Schema The Complete Reference》

本文原创,转载请注明出处、作者

如有错误,欢迎指正

邮箱:czmcj@163.com

0
0
分享到:
评论

相关推荐

    Line Item Dimension

    此外,为了优化性能,可能还会涉及到维度表的粒度控制、延迟整合策略以及维度变化处理(如 Slowly Changing Dimension,SCD)。 工具方面,除了ETL工具,可能还需要使用数据库管理系统(如Oracle、MySQL、SQL ...

    dtm_dtm_slowly3i8_

    "dtm_dtm_slowly3i8_"可能是一个项目名或文件命名规范,其中"dtm"再次强调了这是与数字地形模型相关的项目,"slowly3i8"可能是对这个模型创建或更新过程的一个描述,可能意味着这个过程是逐步的、分阶段的,或者需要...

    quicksort_slowly5vx_Vc_

    在"quicksort_slowly5vx_Vc_"的标题中,"slowly5vx"可能指的是在某种特定情况下快速排序的性能下降,例如数据已经部分有序或者最坏情况(所有元素都相同)。这通常会导致快速排序的时间复杂度从平均情况下的O(n log ...

    Informatica缓慢变化维探究.pdf

    3. **记录最新记录及上一次历史(Type 3 Dimension)** - **定义**:这种方法会在新记录中追加新值,同时保留旧值,通常是在字段末尾添加日期戳或版本号。 - **应用场景**:适用于需要保留特定历史版本但又不想...

    informatica-scd-type1

    SCD(Slowly Changing Dimension)是处理数据仓库中维度表变化的一种策略。Type 1 SCD是最基础的类型,它涉及对历史数据进行覆盖更新,适用于那些不保留历史记录或只需要最新数据的场景。 【描述】:Informatica的...

    POCS算法.zip_POCS_POCS算法_POCS算法原理_pocs算法代码_slowly2bc

    标题中的“POCS算法.zip_POCS_POCS算法_POCS算法原理_pocs算法代码_slowly2bc”表明这个压缩包文件主要包含关于POCS(Projection onto Convex Sets,投影到凸集)算法的信息,特别是它的原理和实现代码。POCS算法是...

    小班英语优质课教案《Slowly and Fast》润新教育.txt

    小班英语优质课教案《Slowly and Fast》润新教育.txt

    【高考领航】最新2014-2015学年高中英语 Unit7 Lesson3 The Sea World课时训练 北师大版必修3

    2. "I have piles of papers typed but I type so slowly." 错误,应改为 "I have piles of papers to type but I type so slowly." 3. "People have always been curious at how living things on the earth ...

    Liver-Tumors-master.zip_CT图像分割_Cancer Segmentation_ct_slowly82u_

    本项目"Liver-Tumors-master.zip"专注于这个领域,提供了一个用于肝癌图像分割的算法,其特点是“ct_slowly82u”,暗示可能是一个对CT图像处理速度优化的方法。 CT图像分割是指通过计算技术和图像处理技术,将CT...

    国赛板球控制.rar_duerut_slowly1rx_板球圆环控制_板球控制_板球系统控制程序

    标题中的“国赛板球控制.rar_duerut_slowly1rx_板球圆环控制_板球控制_板球系统控制程序”暗示了一个项目或竞赛,可能是关于板球控制系统的设计和实现,涉及到STM32F103微控制器以及摄像头对红色小球的识别技术。...

    ETL架构中的38个子系统1

    10. 缓慢变化维处理系统(Slowly Changing Dimension Processor, SCD): 针对维度表中随着时间变化的属性,SCD处理器支持三种常见处理策略:类型1(直接覆盖)、类型2(创建新行)和类型3(添加新列),以反映历史...

    多表查询-相关子查询-SCD-答案[文].pdf

    在本文中,我们将深入探讨相关子查询和SCD(Slowly Changing Dimension,慢速变化维度)的概念,并结合给出的数据进行实例解析。 首先,SCD是数据仓库领域中的一个术语,用于处理随着时间变化的数据。SCD主要分为三...

    datastage面试300题

    10. **使用LOOKUP阶段实现SCD(Slowly Changing Dimension)**:LOOKUP阶段可以处理维度表的变化,如类型2的SCD,记录历史数据。 11. **DataStage常见错误及其解决**:理解错误代码和日志,针对性地解决数据质量、...

    eclipse-mars免安装版

    Eclipse Mars免安装版是一款非常实用的Java集成开发环境(IDE),它无需经过繁琐的安装过程,用户只需下载解压后即可直接启动进行编程工作。Eclipse Mars版本是Eclipse IDE的一个重要里程碑,发布于2015年,带来了...

    C语言经典例题100道.zip_C语言_c语言例子_c语言例题_slowly1l4_经典例题

    3. **运算符与表达式**:了解算术、关系、逻辑、赋值等运算符的用法。 4. **流程控制**:条件语句(if-else)、循环语句(for, while, do-while)的应用。 5. **数组**:一维、二维数组的声明、初始化和操作。 6. **...

    go-clone:深度克隆任何Go数据

    go-clone:深度克隆任何Go数据 包clone提供了对任何Go数据进行深度克隆的功能。 它还提供了一个包装器,以保护指针免受任何意外的突变。 Clone / Slowly可以克隆任何结构的未导出字段。...node3 := & Lis

    数仓相关的pdf文档,维度建模

    7. **延迟维(Slowly Changing Dimension, SCD)**:处理维度表中的历史变化问题,有类型1、2、3等多种处理方式,确保数据的历史准确性。 8. **层次维度(Hierarchical Dimension)**:允许用户通过层级结构浏览...

    数据仓库维度建模指南

    - **缓慢变化维度**(Slowly Changing Dimension, SCD):维度属性随时间变化,需要处理历史数据的变更。 - **快速变化维度**:属性变化频繁,需要特殊处理以保持数据一致性。 - **大维**(Huge Dimension):...

Global site tag (gtag.js) - Google Analytics