`

Type 2 Slowly Change Dimension with Timestamp(原创)

 
阅读更多

Time-Stamped Dimensions
If there is any uncertainty about requirements for historic data, the most common response to changes in source data is the type 2 slowly changing dimension.It is the safe choice because it preserves the association of historic dimension values with facts that have been recorded in fact tables. No information is discarded.
The type 2 response has one glaring shortcoming: it cannot tell you what the dimension looked like at any point in time. This is a particular concern if you have a dimensional data warehouse architecture or stand-alone data mart. In these architectures, the dimensional model doubles as the integrated repository of granular data. In a Corporate Information Factory, maintaining a full history in the dimensional data mart is a lesser concern since there is also an enterprise data warehouse repository to hold it. Although not configured for direct access, at least the information is not thrown away.
An additional fact table can come to the rescue and be used to track the history of changes to the dimension. This fact table has the odd characteristic of having exactly the same number of rows as the dimension table, but it does the job. Many designers instinctively gravitate to a more flexible alternative: supplementing the type 2 response with time stamps. The time-stamped dimension permits three forms of point-in-time analysis within the dimension table itself:

  • Easily order a chronological history of changes
  • Quickly select dimension rows that were in effect for a particular date
  • Easily identify the dimension rows currently in effect

The time-stamped dimension has an unusual property. Joined to a fact table, it behaves like any other dimension table. Used on its own, it also exhibits some of the characteristics of a fact table. The time-stamped approach can also be tremendously useful to ETL developers charged with loading data into fact tables.

Point-in-Time Status of a Dimension

Often, one or more dimension tables in the data warehouse represent closely watched entities. The history of attribute values is significant and is often monitored irrespective of any associated transactions. Documents, contracts, customers, and even employees may be subjected to this deep scrutiny. When it is necessary to support point-in-time analysis within a dimension table, type 2 changes alone will not do the job.

Type 2 Not Sufficient

A type 2 slowly changing dimension preserves the history of values of an attribute and allows each fact to be associated with the correct version. Although this preserves the history of facts, it is not sufficient to provide for point-in-time analysis. What version was current on a particular date? Unless a fact exists for the date in question, it is impossible to know. This is best understood via an example. The star in below has a policy dimension. The table includes numerous attributes describing significant characteristics of a health insurance policy.

 

The history of these attributes is carefully followed by the business, so they are designated as type 2. The policy dimension table is associated with a fact table that tracks policy payments that have been made by the policy holder.

The slow changes that accumulate for one particular policy are illustrated in the lower part of the figure. Policy number 40111 is held by someone named Hal Smith and apparently has been active for quite some time. You can see that the policy has undergone several changes. Initially, Hal was single and his policy covered himself alone. Later, he married, but coverage was not added for his spouse. Subsequently, his spouse did become covered, and still later coverage was added for a child. When coverage was added for a second child, you can see that Hal also increased his deductible.

The insurance company needs to be able to understand what each policy looked like at any given point in time. For example, users might want to know how many policy holders were married versus how many were single on a particular date, or what the total number of covered parties was at the close of a fiscal period. Policy payments are completely irrelevant to this analysis.

Unfortunately, the design in Figure above is not able to answer these questions. Although the dimension table records all the changes to policies, it does not associate them with specific time periods. Was Hal married on November 1, 2005? The dimension table tells us that at different times he has been single and married, but not when. Unless there happens to be a row recorded for the policy in payment_facts for November 1, 2005, there is no way to know what the policy looked like on that date.

Tracking Change History Through a Fact Table

Point-in-time analysis of a closely watched dimension can be supported by creating a fact table expressly for the purpose. A row is recorded in the fact table each time the dimension changes. Each row in this fact table contains a foreign key identifying the new row in the dimension table, and one identifying the date it became effective. An additional foreign key can be maintained to indicate the date on which the row expired, which will help produce point-in-time analysis. If each change occurs for a particular reason, this may be captured via an additional dimension. An example appears in Figure below. 

The fact table in the figure, policy_change_facts, logs changes to the policy dimension. Its grain is one row for each policy change. Each row contains a policy_key representing the changed policy. A transaction type dimension contains reference information indicating the reason for the change—a new policy, a policy change, or policy cancellation. Two keys refer to the day dimension. Day_key_effective indicates the day on which the policy change went into effect. Day_key_expired will be used to indicate when it was superseded by a new version.

You may notice that this fact table does not contain any facts. That is okay; it is still useful. For example, it can be used to count the number of policies in effect on a particular date with married versus single policy holders. Fact tables like this are known as factless fact tables.

The dates associated with each row merit some additional scrutiny. As noted, the fact table includes a day_key_effective, indicating when the change went into effect, as well as a day_key_expired, indicating when the change was superseded by a new version. It is useful to avoid overlapping dates, in order to avoid any confusion when building queries. For each row in the fact table, the effective and expiration dates are inclusive. If a policy changes today, today’s date is the effective date for the new version of the policy. Yesterday’s date is the expiration date for the previous version. The pair of dates can be used to determine the policy’s status at any point in time. Add one filter looking for an effective date that is before or equal to the date in question, and another for an expiration date that is greater than or equal to the date in question.

The current version of a policy has not expired, and it has no expiration date. Instead of recording a NULL key, it refers to a special row in the day dimension. This row typically contains the largest date value supported by the relational database management system (RDBMS), such as 12/31/9999. This technique avoids the need to test for null values when building queries. When the row does expire, the date will be replaced with the actual expiration date.

It is also necessary to look a bit more closely at the grain of the fact table, described earlier as “one row for each policy change.” It is important to document what exactly a “policy change” is. If Hal Smith’s family size and number of children change on the same day, is that one change or two? In the case of policy changes, it is likely to represent a set of changes that are logged in a source system as part of a single transaction. In other cases, it may be necessary to log each change individually. This may require the ability to store multiple changes for a single day. To support this, you can add a time dimension and a pair of time_keys to supplement the day_keys. The time_keys refer to the time of day the record became effective (time_key_effective) and the time of day it expired (time_key_expired.) 

The policy_change_facts star effectively captures the history of changes to the policy dimension, so that people can identify what policies looked like at any particular point in time. This information can now be accessed, even for days when there are no payments. You may have noticed something peculiar about this fact table. Since it contains a row for each change to a policy, it has the same number of rows as the policy dimension itself. This is not necessarily a bad thing, but it suggests that there may be a more effective way to gather the same information.

A Time-Stamped Dimension in Action

A time-stamped version of the policy dimension table is illustrated in below Figure. The first block of attributes contains the usual dimension columns: a surrogate key, the natural key, and a set of dimension attributes. Four additional columns have been added: transaction_type, effective_date, expiration_date, and most_recent_version.

To the right of the table diagram, a grid illustrates the rows recorded in this table for policy 40111, which belongs to our old friend Hal Smith. The first row shows that Hal’s policy went into effect on February 14, 2005. At the time, he was single. The policy remained in this state through February 11, 2006. On the next day, February 12, a policy change caused the next row to become effective. This row updated the policy to show that Hal married, with his family size increasing from one family member to two. The row shows that there was only one covered party, which means that Hal’s spouse was not covered by the policy. Additional changes to the policy are reflected in the subsequent rows. Study the table closely, and you will see that the effective_date and expiration_date for each changed row line up closely; there are no gaps.

The last row in the illustration shows the current status of Hal Smith’s policy. Since it has not expired, the expiration_date is set to 12/31/9999. This date has been specifically designated for use when the policy has not expired. When we are querying the table, this value allows us to avoid the additional SQL syntax that would be necessitated if a NULL had been used.

To filter a query for policy versions that were in effect on 12/31/2006, for example, we only need to add the following to the query predicate:

WHERE

12/31/2006 >= effective_date AND

12/31/2006 <= expiration_date

The most_recent_version flag is added as a convenience, allowing browse queries to filter a dimension table for current records only. In keeping with guidelines from Chapter 3, this column contains descriptive text rather than Boolean values or “Y” and “N.” To filter the dimension for the current policy versions, a simple constraint is added:

WHERE most_recent_version = "Current"

It is also quite simple to single out a single policy and produce an ordered list of changes:

WHERE policy_number = 40111

ORDER_BY effective_date

TIP:A time-stamped dimension adds non-overlapping effective and expiration dates for each row, allowing each type 2 change to be localized to a particular point in time. The effective date can be used to order a transaction history; the effective and expiration dates allow filtering for a specific point in time; a most_recent_version column can simplify filtering for current status.

 

This technique can be elaborated upon for situations where multiple discrete changes must be captured during a single day. While it represents additional work to maintain a time-stamped dimension, it pays dividends when it comes to maintaining associated fact tables.

Multiple Changes on a Single Day

The policy example only allows for one version of a policy on a given day. If it is necessary to track changes at a finer grain, this can be achieved by adding columns to capture the time of day at which the record became effective and expired: effective_time and expiration_time. Depending on the business case, these columns may represent hours (24 possible values), minutes (1,440 values), or seconds (8,640 possible values). Together with the effective_date and expiration_date, they allow for multiple versions within a given day.

As with the day columns in the prior example, the time columns must allow for no overlap and leave no gaps. For policy changes recorded at the minute level, for example, if one row becomes effective at 12:48 PM, the prior row must expire at 12:47 PM. With the time columns in place, it is possible to have multiple versions of a policy on a given day. This means that qualifying a query for a particular date may pick up multiple versions of the same policy. This is not desirable if, for example, you are counting the number of covered parties across all policies on a particular date. You must now supplement the date qualifications with time qualifications, as in:

WHERE

12/31/2006 >= effective_date AND

12/31/2006 <= expiration_date AND

24:00 >= effective_time AND

24:00 <= expiration_time

You can add a last_change_of_day flag to simplify this kind of query. It will be set to “Final” for the last change to a given policy on a day. End-of-day status for a particular date can now be captured by using this SQL:

WHERE

12/31/2006 >= effective_date AND

12/31/2006 <= expiration_date AND

last_change_of_day = "Final"

Time-Stamped Dimensions and the ETL Process

Construction of a time-stamped dimension places an extra burden on the developers of the process that loads and maintains the dimension table. When a type 2 change occurs, it will be necessary to identify the prior row to update its expiration_date and most_recent_version columns. Extract, transform, load (ETL) developers may also be charged with grouping a set of changes that occur on a single day into a single time-stamped dimension record. This additional work is not trivial, but ETL developers will find benefits in other areas.

When loading transactions into a fact table, such as above policy_payment_facts, the ETL process must determine what foreign key to use for each dimension. If a payment is recorded for policy 40111, for example, which policy_key is used in the fact table? Without time stamps, it would be necessary to compare the policy characteristics in the source system with the type 2 attributes in the dimension table. If a payment comes in for policy 40111, with a family size of four and a deductible amount of 500, for example, then surrogate key 14922 should be used. 

When dimension rows are time-stamped, this work is greatly simplified. To choose the correct surrogate key value, the ETL process need only know the natural key and the date of the transaction. This date can be compared to the effective_date and expiration_date columns in the dimension table to identify the correct key value for use in the fact table. This characteristic of a time-stamped dimension is particularly valuable when fact tables are implemented incrementally. Suppose, for example, that a year after the policy payment star is implemented, another star is added to track claims against policies. Like the policy payments star, policy_claim_facts will involve the policy dimension table. If each row in the policy dimension has been time-stamped, it will be easy to load the last year’s worth of claims into the new fact table. For each historic claim, the date of the transaction is used to single out the appropriate row in the policy table. Without these time stamps, it would be necessary for the source system to supply not only the date of each claim but also a complete picture of the policy, including the policy_number and each of the type 2 attributes. This set of values would be used to identify the associated row in the policy table. 

It is not necessary to time-stamp every dimension. The additional work may not be necessary if point-in-time analysis is not required. Dimensions that contain reference data, such as the transaction_type table in above figure, do not require time stamps. For core dimensions that conform across multiple subject areas, however, time-stamping is a useful enhancement.

Dimension and Fact

As with any other dimension table, the time-stamped dimension can be joined to a fact table to analyze facts. From above figure, for example, can be joined to a policy_payment_facts table to study payments in the usual way. Need to know what were the total January payments on policies covering two or more people? Join the policy dimension to the fact table and constrain on covered_parties. Want to compare the payment habits of single versus married policy holders? Group the query results by marital_status.

When analyzing the dimension table without a fact table, something interesting happens. Some of the dimension attributes take on the characteristics of facts, exhibiting the property of additivity. For example, the following query captures the number of covered parties by state on December 31, 2006:

SELECT

state,

sum(covered_parties)

FROM

policy

WHERE

12/31/2006 >= effective_date AND

12/31/2006 <= expiration_date

GROUP BY

state 

In this query, the dimension column covered_parties is behaving as a fact. Some people like to call it a hybrid attribute since it can be used either as a dimension or as a fact. This is a common phenomenon in time-stamped dimensions.

 

参考至:《Star Schema The Complete Reference》

本文原创,转载请注明出处、作者

如有错误,欢迎指正

邮箱:czmcj@163.com

0
0
分享到:
评论

相关推荐

    Oracle Timestamp with Time zone & java

    Oracle的Timestamp with Time Zone类型与Java的交互是数据库编程中一个重要的知识点,特别是在处理跨越时区的数据时。本文将深入探讨这两个概念以及它们在实际应用中的互动。 Oracle的Timestamp with Time Zone类型...

    MySQL 5.6 中的 TIMESTAMP 和 explicit_defaults_for_timestamp 参数

    在安装MySQL 5.6时,你看到的警告提示`TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details)`意味着在旧版本...

    oracle中TIMESTAMP与DATE比较

    在Oracle数据库中,`TIMESTAMP`与`DATE`两种数据类型是用于存储日期和时间信息的关键组成部分,但它们之间存在显著的区别,特别是在处理时间和精度方面。本文将深入探讨这两种数据类型的特点,以及如何在实际应用中...

    oracle --timestamp

    根据提供的标题、描述、标签及部分内容,我们可以了解到这段文本主要涉及Oracle数据库中处理时间戳(`TIMESTAMP`)的相关操作。接下来将详细解释这些内容所包含的关键知识点。 ### 关键知识点解析 #### 1. `...

    【源码阅读】 protobuf 中的 timestamp 包

    文章目录Timestamptimestamp.go如何使用 Timestamp path: google/protobuf/timestamp.proto 在 timestamppb 中 Timestamp 包含两个字段 seconds 表示秒 nanos 表示纳秒 message Timestamp { int64 seconds = 1; ...

    MySQL 5.6 中TIMESTAMP with implicit DEFAULT value is deprecated错误

    在MySQL 5.6中,当你遇到"TIMESTAMP with implicit DEFAULT value is deprecated"这个警告时,这表明你在数据库的表定义或数据插入语句中使用了时间戳(TIMESTAMP)字段,并且该字段没有显式地设定默认值。这种做法在...

    python timestamp和datetime之间转换详解

    def timestamp2string(timestamp): try: d = datetime.fromtimestamp(timestamp) str1 = d.strftime("%Y-%m-%d %H:%M:%S.%f") # 返回去掉不必要的微秒后缀的字符串 return str1[:-3] except Exception as e: ...

    ICMP timestamp请求响应漏洞 修复 Traceroute探测漏洞 修复 linux 7

    ICMP timestamp请求响应漏洞 修复 Traceroute探测漏洞 修复 使用firewall-cmd打开关闭防火墙与端口 linux 7 ICMP timestamp请求响应漏洞 修复 Traceroute探测漏洞 修复 使用firewall-cmd打开关闭防火墙与端口 linux ...

    MySQL错误TIMESTAMP column with CURRENT_TIMESTAMP的解决方法

    there can be only one TIMESTAMP column with CURRENT_TIMESTAMP in DEFAULT or ON UPDATE clause." 这意味着在MySQL 5.5及更早版本中,表中只能有一个TIMESTAMP列具有默认值CURRENT_TIMESTAMP或ON UPDATE CURRENT...

    timestamp2.data

    timestamp2.data

    有关java中的Date,String,Timestamp之间的转化问题

    Java 中的 Date、String 和 Timestamp 之间的转换问题 Java 中的日期和时间处理是编程中非常重要的一方面,Date、String 和 Timestamp 是三种常用的日期和时间类型,本文将详细介绍它们之间的转换问题。 一、获取...

    使用TimeStamp控制并发问题示例

    2. 用户界面:在Web页面中,你需要展示Timestamp值给用户,并在更新操作中隐藏此字段,防止用户直接编辑。例如,上述代码片段展示了如何在ASP.NET网页中创建一个只读的Timestamp输入框。 ```html ...

    API接口设计之token、timestamp、sign

    使用Spring Security框架,我们可以配置OAuth2或JWT的安全认证。对于`Timestamp`和`Sign`的验证,可以创建一个验证服务,该服务接收请求,提取相关参数,然后执行签名验证。 总的来说,`Token`、`Timestamp`和`Sign...

    关于Hinbernate中TimeStamp类型字段处理的小例子

    2. **Hibernate中的Timestamp映射**: - Hibernate允许开发者通过配置映射文件(如.hbm.xml)或注解方式将Java对象的属性映射到数据库表的列上。对于Timestamp类型,我们需要明确指定其映射关系。 - 使用XML配置,...

    Timestamp与Date互转.docx

    2. Date转Timestamp 相反,如果我们需要将Date类型转换为Timestamp类型,该如何实现呢?在这种情况下,我们可以使用SimpleDateFormat类来格式化Date类型,然后使用valueOf()方法将格式化后的字符串转换为Timestamp...

    TimeStamp(用java实现时间戳)

    TimeStamp(用java实现时间戳)

    C#更新SQLServer中TimeStamp字段(时间戳)的方法

    在C#编程中,SQL Server的时间戳(TimeStamp)字段是一个特殊的数据类型,它与我们通常理解的日期时间无关,而是用来记录数据行的版本或更改信息。本文将深入探讨如何在C#中读取和更新SQL Server中的Timestamp字段。...

    timestamp_v1.zip_stonetnd_timer_timestamp

    Timestamp程序会创建用户指定数量的线程,每一个线程都将执行一个简单的for循环。每次循环都会读取时钟计数器的值并将其保存到一个数组中。每个线程将这些值保存到各自的数组中。 通过观察timestamp的值,程序可以...

Global site tag (gtag.js) - Google Analytics