135-2011: Best Solutions for Tuning Performance of ETL Jobs in SAS® Data Integration Studio
نویسنده
چکیده
SAS® Data Integration Studio is a great tool for building and maintaining data warehouses and data marts. The performance of the extract, transform, and load (ETL) job is critical for building data warehouses and data marts. This paper discusses the time-consuming data transformations related to ETL processes in SAS Data Integration Studio. The performance for each data transformation is benchmarked and compared. The best solutions to speed up ETL job performance are discussed in detail. The paper also suggests the best practices for designing efficient ETL jobs. INTRODUCTION When data warehouses and data marts are built, significant numbers of ETL (Extract, Transform, Load) processes need to be implemented. The time to build and to update data warehouses and data marts depends heavily on the performance of the ETF processes. SAS Data Integration (DI) Studio is a special tool to help to simplify the ETL process. It offers some important features which were really beneficial to the ETL process. First, the ETL process design in DI Studio is an intuitive, point-and-click tool which makes it easy to build logical process workflows and to quickly identify input and output data flows. Second, this tool provides a powerful yet easy-to-use transformation language that supports collaboration and the reuse of processes and ETL jobs. Transformations can run on any platform with any data source. SAS programmers can also create their own custom transformations which incorporate specific business logic. Third, it includes an embedded data quality feature which allows data quality processing to be imbedded within ETL jobs. As data volume and complexity increase tremendously in recent years, the performance of the ETL job is very critical to build data warehouses and data marts. This paper focuses on improving the efficiency of the ETL job on EXTRACT, TRANSFORM and LOAD processes on large volume of data. In terms of large volume of data, table should have at least more than 10 million records. The transformation and techniques available in DI Studio for each process will be addressed and the performance will be compared. The paper will also provide best practices for ETL processes. EXTRACT In EXTRACT process, data from source systems is extracted to build the target table. The job of this process creates either a view or physical work table for each data set that will be used as a source for the next job. It is also possible to consider data type transformations at this stage as well to ensure that the structure of the extract table mimics that of the final target table. Note that if subsequent jobs were to fail, the staging file generated in this step can be used to avoid having to re-run the job that created it. Several different transformations can be used to extract data. • EXTRACT TRANSFORMATION • SORT TRANSFORMATION • USER DEFINED SET TRANSFORMATION EXTRACT TRANSFORMATION The SAS code behind the EXTRACT transformation is the SELECT statement from PROC SQL. The target table can be sorted or unsorted depending on whether the ORDER BY statement is used. Data Integration SAS Global Forum 2011
منابع مشابه
119-2010: Blistering ETL Performance Using the Intelligent, Dynamic, and Parallel Capabilities of SAS®
ETL from billion-record+ databases is a non-trivial task. By using the unique capabilities of SAS® to analyze the database in advance and then dynamically generating parallel SAS® jobs based on this interrogation, SAS® can effectively performance-tune itself each run for maximum benefit. Get in, get out, get the results. The net effect is to get the fastest possible results with the minimum win...
متن کاملStars and Models: How to Build and Maintain Star Schemas Using SAS Data Integration Server in SAS®9
ABSTRACT Star schemas are used in data warehouses as the primary storage mechanism for dimensional data that is to be queried efficiently. Star schemas are the final result of the extract, transform, and load (ETL) processes that are used in building the data warehouse. Efficiently building the star schema is important, especially as the data volumes that are required to be stored in the data w...
متن کامل117-2010: The Power of User-Defined SAS® Code in SAS® Data Integration Studio
Extensive use of standard SAS® Data Integration Studio Transformations is often made when designing data warehouses, thereby reducing maintenance requirements and the need for user defined code. This paper instead advocates using the SAS Data Integration Studio API against user defined code. Thus using the full power of SAS to build and maintain warehouse processes may yield several advantages,...
متن کاملFormalizing ETL Jobs for Incremental Loading of Data Warehouses
Extract-transform-load (ETL) tools are primarily designed for data warehouse loading, i.e. to perform physical data integration. When the operational data sources happen to change, the data warehouse gets stale. To ensure data timeliness, the data warehouse is refreshed on a periodical basis. The naive approach of simply reloading the data warehouse is obviously inefficient. Typically, only a s...
متن کاملWerkzeugunterstützung für ETL-Prozesse mit Geodaten (Tool support for ETL processes with spatial data)
Data Warehousing and Spatial Data Infrastructures (SDI) are becoming more and more accepted in public administrations, also in environment administrations and geo data authorities. Hence, the importance of professional ETL (extract transform load) processes for data acquisition, integration, cleansing, and storage is also growing. Though there are numerous ETL tools on the market since many yea...
متن کامل