Keys and history
In a star schema, each dimension table is given a surrogate key. This column is a unique identifier, created exclusively for the data warehouse. Surrogate keys are assigned and maintained as part of the process that loads the star schema. The surrogate key has no intrinsic meaning; it is typically an integer. Surrogate keys are sometimes referred to as warehouse keys. The surrogate key is the primary key of the dimension table.
Figure 1-5 A simple star schema for the orders process
In this article, surrogate keys will be easily identifiable by the suffix “_key” in the column name. In Figure 1-5, for example, the surrogate key for the customer dimension is called customer_key, the surrogate key for the salesperson dimension is called salesperson_key, and so forth. Illustrations in this book will always list the surrogate key for a dimension table as its first attribute.
Dimension tables also contain key columns that uniquely identify something in an operational system. Examples in Figure 1-5 include customer_id, sku, and salesperson_id. In the operational systems, these columns identify specific customers, products, and salespeople, respectively. These key columns are referred to as natural keys.
The separation of surrogate keys and natural keys allows the data warehouse to track changes, even if the originating operational system does not. For example, suppose that customer ABC Wholesalers is identified by customer_id 10711 in an operational system.
If the customer changes its headquarters location, the operational system may simply overwrite the address for customer_id 10711. For analytic purposes, however, it may be useful to track the history of ABC Wholesalers. Since the star schema does not rely on customer_id to identify a unique row in the customer dimension, it is possible to store multiple versions of ABC Wholesalers, even though both have the same customer_id—10711. The two versions can be distinguished by different surrogate key values. While it would also be possible to support change tracking by supplementing a natural key with a sequence number, the surrogate key allows fact and dimension tables to be joined based on a single column.
Surrogate Keys and Natural Keys
In the star schema in Figure, surrogate keys for each dimension table are labeled “SK.” Separate and distinct from surrogate keys, one or more natural keys will also be present in most dimension tables.
The natural keys are identifiers carried over from source systems. They may not uniquely identify a row in the data warehouse, but they do identify a corresponding entity in the source system. In above Figure, natural key columns are designated “NK.”
Unlike surrogate key values, the values in natural key columns may have meaning to users of the data warehouse. Even when they do not carry significant meaning, their presence is necessary for the ETL routines that load fact tables.
Sometimes, the natural key for a dimension table consists of more than one column.This occurs when the source system uses a multi-part key to identify the entity. For example,a purchasing contract may be identified by a type code and sequence number. When more than one system can be the source for a dimension, the natural key may be composed of the identifier from the source system and an additional identifier that indicates which source it came from. For example, a bank may have more than one system for deposit accounts after acquiring another bank. The natural key for a customer dimension might,therefore, consist of an identifier used in a source system, in conjunction with a column that indicates the system from which the identifier came.
The use of surrogate keys as unique identifiers allows the data warehouse to respond to changes in source data in whatever manner best fits analytic requirements. Because the dimension table does not rely on the natural key to identify unique rows, it can maintain history even if the source system does not. For example, an order entry system might contain a record for customer_id 404777, which includes the customer’s address. If the system overwrites the address when a customer moves, it is not tracking history. Were the
customer dimension table to use the same customer_id to identify unique rows, it would be able to store only one row for customer_id 404777. It would be unable to maintain the history of the address. By using a surrogate key, it becomes possible to maintain two versions of customer_id 404777. This technique is known as a type 2 slow change.
A possible alternative to the use of a surrogate key is to supplement the natural key with a sequence number. For example, the primary key of the customer dimension table might consist of the customer_id and a version_number column that contains a sequence number. Like the use of surrogate keys, this technique permits the data warehouse to track history independently of the source system, allowing the table to store multiple versions of a customer. This approach provides no value in simplifying the schema design or load
process, however, which must still identify and maintain version history. More importantly,this technique requires multi-part foreign keys to be maintained in the fact table. If customers are identified by customer_id and sequence_no, this pair of columns must be present in order_facts as well. This multi-column foreign key complicates the join process, makes SQL more difficult to read, and in some cases may befuddle efforts to optimize the RDBMS for star join query execution.
Another theoretical alternative to the use of a surrogate key is to supplement a natural key with time stamps. While time stamps may be useful, a multi-part foreign key would still be required in fact tables, potentially leading to the same difficulties as the sequence number. In addition, a designer may be tempted to eliminate multi-column joins by storing only the natural key in the fact table; however, this severely complicates queries and risks error. For example, assume that customer_id 404777 has moved, and therefore has two
rows in the dimension table. Each fact table row contains only the natural key 404777. To identify which version of the customer corresponds to each fact table row, it is necessary to compare order date with the time stamps in the dimension table. This process can be onerous, particularly if one is constructing a report that aggregates a large number of orders. It also becomes difficult for database administrators to tune the system, preventing them, for example, from declaring a foreign key in the fact table and potentially leading to
poor query optimization. Worse, if the date qualification is omitted, facts associated with customer_id 404777 will be double-counted.
It is not common practice to use version numbers or time stamps as part of a unique identifier. Surrogate keys simplify the schema design and allow for clean, single-column joins. Time stamps are frequently included in dimension tables, but not as part of the unique identifier.
参考至:《Star Schema The Complete Reference》
如有错误,欢迎指正
邮箱:czmcj@163.com
相关推荐
the world's leading data warehousing authority, whose previous books have sold more than 150,000 copies Delivers real-world solutions for the most time- and labor-intensive portion of data ...
使用 SAP Data Warehouse Cloud 进行数据获取、准备和建模 本文档主要讲解了如何使用 SAP Data Warehouse Cloud 进行数据获取、准备和建模。SAP Data Warehouse Cloud 是一个云端的数据仓库解决方案,旨在帮助用户...
Ralph Kimball,作为全球数据仓库权威,与Joe Caserta共同编写的《The Data Warehouse ETL Toolkit》一书,为数据仓储的ETL流程提供了实际的解决方案。本书聚焦于数据仓储中最耗时、最耗力的部分——数据预处理,即...
### DATA WAREHOUSE心得 #### 一、为什么需要推动Data Warehouse? 推动Data Warehouse的主要原因在于解决企业内部数据管理和分析方面存在的各种挑战。以下是从标题和描述中提炼出的知识点: 1. **数据可信性问题...
### Azure SQL 数据仓库知识点详解 #### 一、Azure SQL 数据仓库简介 **Azure SQL 数据仓库** 是微软提供的一项关系型数据仓库服务,属于完全由微软管理的平台即服务(PaaS)。该服务支持从小到大的各种数据存储...
数据仓库(Data Warehouse)入门教程。
2. **加载CSV数据**:将"datawarehouse"中的CSV文件加载到Hive表中,可以使用`LOAD DATA LOCAL INPATH 'path_to_csv_file' INTO TABLE table_name;`命令,其中'path_to_csv_file'是本地文件系统上的CSV文件路径。 ...
《Oracle Data Warehouse Management》是一本深入探讨Oracle数据仓库管理的专业著作。在当今大数据时代,数据仓库作为企业决策支持系统的核心,对于企业的运营和发展至关重要。Oracle作为业界领先的数据管理解决方案...
### 数据仓库工具包(The Data Warehouse Toolkit)第二版——维度建模完全指南 #### 书籍概览 《数据仓库工具包》(The Data Warehouse Toolkit)是业界公认的一本关于数据仓库设计与构建的经典著作,由数据仓库...
The first edition of Ralph Kimball's The Data Warehouse Toolkit introduced the industry to dimensional modeling, and now his books are considered the most authoritative guides in this space....
data warehouse 1
数据仓库(Data Warehouse) 面向主题的、集成的、相对稳定的、反映历史变化的数据集合; 作为一种“数据存储”的体系结构,其支持结构化、启发式、标准化查询、分析报告和决策支持。
- **自顶向下的开发方法**:首先构建企业数据仓库(Enterprise Data Warehouse, EDW),然后根据各个业务领域的信息需求创建相应的数据集市(Data Mart)。 - **自底向上的开发方法**:先建立独立的数据集市,并计划在...
根据提供的文件信息,“The Data Warehouse ETL Toolkit”是一本关于数据仓库中提取、清洗、转换和交付数据(ETL)过程的专业书籍。本书由Ralph Kimball和Joe Caserta撰写,并由Wiley Publishing, Inc.出版。以下是...
数据仓库工具箱的最新版 数据仓库领域大师级作品 英文版 [目前国内没出中文版] 2013年出版
首先,数据仓库(Data Warehouse)的概念是由数据仓库之父W.H.Inmon提出,他定义数据仓库为面向主题的、集成的、时变的、非易失的数据集合,用于支持管理决策过程。数据仓库的设计遵循从上至下的方法,它与操作...
根据提供的文件信息,“Datawarehouse ETL Toolkit”是Ralph Kimball的经典著作之一,该书主要聚焦于数据仓库的提取(Extract)、转换(Transform)和加载(Load)过程,即ETL流程的技术细节与实践方法。以下是对这...
Ralph Kimball. The Data Warehouse Toolkit.. The Complete Guide to Dimensional Modelling