Monday, February 21, 2011

-- By RK...

Designing the Mapping/Workflow Efficiently
(Performence Wise....)
  1. I would always suggest you to think twice before using an Update Strategy, though it adds me certain level of flexibility in the mapping. If you have a straight-through mapping which takes data from source and directly inserts all the records into the target, you wouldn’t need an update strategy.
  2. Use a pre-SQL delete statement if you wish to delete specific rows from target before loading into the target. Use truncate option in Session properties, if you wish to clean the table before loading. I would avoid a separate pipe-line in the mapping that roans before the load with update-strategy transformation.
  3. You have 3 sources and 3 targets with one-on-one mapping. If the load is independent according to business requirement, I would create 3 different mappings and 3 different session instances and they all run in parallel in my workflow after my “Start” task. I’ve observed that the workflow runtime comes down between 30-60% of serial processing.
  4. PowerCenter is built to work of high volumes of data. So let the server be completely busy. Induce parallelism as far as possible into the mapping/workflow.
  5. Needless to say, if any transformation waits for complete source data to be arrived in before it can proceed further in the mapping, use a sorter transformation to speed up the process. For eg: an Aggregator transformation, a Joiner transformation etc will perform well if sorted data is given to them and that a sorter transformation at the start will make the efficient.
  6. Push as much filtering as possible into the SQL override of the source-qualifier/lookup to make the database handle most of the filtering as databases are experts in doing that stuff than using a filter transformation.
  7. As I always say, the above task is similar to manual push-down optimization.
Try to push as much processing as possible into the database source or target based on the necessity. Few instances where we can do so.
Using analytics
Instead of using aggregator transformation or a combination of sorter/expression/filter for occassions when they are demanded, we can make use of the anlytic functions of the database. For eg: While reading your source data from the table, you may need to identify the last record from the source. For this requirement, the first option that comes into mind is something like this:
-One source pipeline picking up the data records and the second pipeline counting the number of source records and eventually you equate that number to the current record number. When they are equal, it implies that the current record is the last record from the source.
-People resort to a count(*) and GROUP BY from SQL or an aggregator/sorter/expression/filter combination for the same. Instead of this, the analytic functions provide you greater flexibility. The same task can be acheived with the below SQL.

SELECT col1,col2,col2,LEAD(col1) OVER (ORDER BY input_sort_criteria) next_record
FROM table_name
For the last record with the given "input_sort_criteria", the value will be NULL.

Avoid Stored Procedures
I have seen many mappings with unnecessary usage of Stored Procedures. Stored Procedures "are" performance killers. If run for every record in the source, stored procedures consume much of the workflow run-time. A database connection has to be established each time for a record and then the execution takes place in the database. As you can see, this beahviour leads to worst performing interfaces.
A general guideline is that when you can do it in PowerCenter using transformations, do not use store procedures.
        The two sections in this article seem to be conflicting with each other. In the first section, urged you to push as much processing as possible to the database and in the second section,  suggesting you to avoid database.
         When you connect to database using a Lookup or Source qualifier or pre/post-SQL procedures, you are really connecting to the database only ONCE for all the source records. When you connect to the database using a stored procedure transformation that is called for each record, then you are connecting to the database for as many times as the number of records in the source.
Ex: When you have a vehicle (a loading/unloading truck) at your disposal, would you take 100 bricks from one place to the other place at a time or each brick at a time in the vehicle?

No comments:

Post a Comment