Monday, February 21, 2011

 

-- By RK

Database Indexing for the Lookup SQL - A myth

I have seen people suggesting an index to improve the performance of any SQL. This suggestion is incorrect - many times. Specially when talking about indexing the condition port columns of Lookup SQL, it is far more "incorrect".

Before explaining why it is incorrect, I would try to detail the functionality of Lookup. To explain the stuff with an example, we take the usual HR schema EMP table. I have EMPNO, ENAME, SALARY as columns in EMP table.

Let us say, there is a lookup in ETL mapping that checks for a particular EMPNO and returns ENAME and SALARY from the Lookup. Now, the output ports for the Lookup are "ENAME" and "SALARY". The condition port is "EMPNO". Imagine that you are facing performance problems with this Lookup and one of the suggestion was to index the condition port.

As suggested (incorrectly) you create an index on EMPNO column in the underlying database table. Practically, the SQL the lookup executes is going to be this:

SELECT EMPNO, ENAME, SALARY
FROM EMP
ORDER BY EMPNO, ENAME, SALARY;

The data resulted from this query is stored in the Lookup cache and then, each record from the source is looked up against this cache. So, the checking against the condition port column is done in the Informatica Lookup cache and "not in the database". So any index created in the database has no effect for this.

You may be wondering if we can replicate the same indexing here in Lookup Cache. You don't have to worry about it. PowerCenter create "index" cache and "data" cache for the Lookup. In this case, condition port data - "EMPNO" is indexed and hashed in "index" cache and the rest along with EMPNO is found in "data" cache.

I hope now you understand why indexing condition port columns doesn't increase performance.

Having said that, I want to take you to a different kind of lookup, where you would've disabled the caching. In this kind of Lookup, there is no cache. Everytime a row is sent into lookup, the SQL is executed against database. In this scenario, the database index "may" work. But, if the performance of the lookup is a problem, then "cache-less" lookup itself may be a problem.

I would go for cache-less lookup if my source data records is less than the number of records in my lookup table. In this case ONLY, indexing the condition ports will work. Everywhere else, it is just a mere chanse of luck, that makes the database pick up index.

I'm sure you will have many questions regarding Lookups after reading this blog article. I'm ready to answer, fire away.
-- By RK...


Designing the Mapping/Workflow Efficiently
(Performence Wise....)
  1. I would always suggest you to think twice before using an Update Strategy, though it adds me certain level of flexibility in the mapping. If you have a straight-through mapping which takes data from source and directly inserts all the records into the target, you wouldn’t need an update strategy.
  2. Use a pre-SQL delete statement if you wish to delete specific rows from target before loading into the target. Use truncate option in Session properties, if you wish to clean the table before loading. I would avoid a separate pipe-line in the mapping that roans before the load with update-strategy transformation.
  3. You have 3 sources and 3 targets with one-on-one mapping. If the load is independent according to business requirement, I would create 3 different mappings and 3 different session instances and they all run in parallel in my workflow after my “Start” task. I’ve observed that the workflow runtime comes down between 30-60% of serial processing.
  4. PowerCenter is built to work of high volumes of data. So let the server be completely busy. Induce parallelism as far as possible into the mapping/workflow.
  5. Needless to say, if any transformation waits for complete source data to be arrived in before it can proceed further in the mapping, use a sorter transformation to speed up the process. For eg: an Aggregator transformation, a Joiner transformation etc will perform well if sorted data is given to them and that a sorter transformation at the start will make the efficient.
  6. Push as much filtering as possible into the SQL override of the source-qualifier/lookup to make the database handle most of the filtering as databases are experts in doing that stuff than using a filter transformation.
  7. As I always say, the above task is similar to manual push-down optimization.
Try to push as much processing as possible into the database source or target based on the necessity. Few instances where we can do so.
Using analytics
Instead of using aggregator transformation or a combination of sorter/expression/filter for occassions when they are demanded, we can make use of the anlytic functions of the database. For eg: While reading your source data from the table, you may need to identify the last record from the source. For this requirement, the first option that comes into mind is something like this:
-One source pipeline picking up the data records and the second pipeline counting the number of source records and eventually you equate that number to the current record number. When they are equal, it implies that the current record is the last record from the source.
-People resort to a count(*) and GROUP BY from SQL or an aggregator/sorter/expression/filter combination for the same. Instead of this, the analytic functions provide you greater flexibility. The same task can be acheived with the below SQL.

SYNTAX
SELECT col1,col2,col2,LEAD(col1) OVER (ORDER BY input_sort_criteria) next_record
FROM table_name
Ex:
SELECT EMPNO,ENAME,DEPTNO, LEAD(ENAME) OVER (ORDER BY EMPNO) NEXT_RECORD FROM EMP
For the last record with the given "input_sort_criteria", the value will be NULL.

Avoid Stored Procedures
I have seen many mappings with unnecessary usage of Stored Procedures. Stored Procedures "are" performance killers. If run for every record in the source, stored procedures consume much of the workflow run-time. A database connection has to be established each time for a record and then the execution takes place in the database. As you can see, this beahviour leads to worst performing interfaces.
A general guideline is that when you can do it in PowerCenter using transformations, do not use store procedures.
Conclusion:
        The two sections in this article seem to be conflicting with each other. In the first section, urged you to push as much processing as possible to the database and in the second section,  suggesting you to avoid database.
         When you connect to database using a Lookup or Source qualifier or pre/post-SQL procedures, you are really connecting to the database only ONCE for all the source records. When you connect to the database using a stored procedure transformation that is called for each record, then you are connecting to the database for as many times as the number of records in the source.
Ex: When you have a vehicle (a loading/unloading truck) at your disposal, would you take 100 bricks from one place to the other place at a time or each brick at a time in the vehicle?

Lookup - Performance Techs..

-- By RK


                Lookup is an important and a useful transformation when used effectively. What is a lookup transformation? It is just not another transformation which fetches you data to look against the source data. It is a transformation when used improperly, makes your flow run for ages.
Different scenarios where you can face problems with Lookup and also how to tackle them.

Unwanted columns:
By default, when you create a lookup on a table, PowerCenter gives you all the columns in the table, but be sure to delete the unwanted columns from the lookup as they affect the lookup cache very much. You only need columns that are to be used in lookup condition and the ones that have to get returned from the lookup.
SQL query:
We will start from the database. Find the execution plan of the SQL override and see if you can add some indexes or hints to the query to make it fetch data faster. You may have to take the help of a database developer to accomplish this if you, yourself are not an SQLer.
Size of the source versus size of lookup:
Let us say, you have 10 rows in the source and one of the columns has to be checked against a big table (1 million rows). Then PowerCenter builds the cache for the lookup table and then checks the 10 source rows against the cache. It takes more time to build the cache of 1 million rows than going to the database 10 times and lookup against the table directly. Use uncached lookup instead of building the static cache, as the number of source rows is quite less than that of the lookup.
Conditional call of lookup:
Instead of going for connected lookups with filters for a conditional lookup call, go for unconnected lookup. Is the single column return bothering for this? Go ahead and change the SQL override to concatenate the required columns into one big column. Break them at the calling side into individual columns again.
JOIN instead of Lookup:
In the same context as above, if the Lookup transformation is after the source qualifier and there is no active transformation in-between, you can as well go for the SQL over ride of source qualifier and join traditionally to the lookup table using database joins, if both the tables are in the same database and schema.
Increase cache:
If none of the above seems to be working, then the problem is certainly with the cache. The cache that you assigned for the lookup is not sufficient to hold the data or index of the lookup. Whatever data that doesn't fit into the cache is spilt into the cache files designated in $PMCacheDir. When the PowerCenter doesn't find the data you are lookingup in the cache, it swaps the data from the file to the cache and keeps doing this until it finds the data. This is quite expensive for obvious reasons being an I/O operation. Increase the cache so that the whole data resides in the memory.
What if your data is huge and your whole system cache is less than that? Don't promise PowerCenter the amount of cache that it can't be allotted during the runtime. If you promise 10 MB and during runtime, your system on which flow is running runs out of cache and can only assign 5MB. Then PowerCenter fails the session with an error.
Cachefile file-system:
In many cases, if you have cache directory in a different file-system than that of the hosting server, the cache file piling up may take time and result in latency. So with the help of your system administrator try to look into this aspect as well.
Useful cache utilities:
If the same lookup SQL is being used in someother lookup, then you have to go for shared cache or reuse the lookup. Also, if you have a table that doesn't get data updated or inserted quite often, then use the persistent cache because

Error Logging in PowerCenter

--By RK

In order to capture any Informatica PowerCenter errors into a flat file or database during runtime, Informatica Corporation suggests row-level error logging. The major disadvantage with this is that the performance of the workflow is affected because of row-level processing as opposed to block processing.

In order to overcome this, a simple approach can be followed that can be a common approach for all the workflows. This approach is based on the fact that $Session_Name.ErrorMsg will store NULL value if the session runs fine, otherwise stores the latest error message from the Session run.

1) Create two workflow variables - one for Error message $$ERROR_MESSAGE and the other $$SESSION_NAME to store the failed session name.

2) Create an assignment task in the workflow and create links to it from each of the sessions. Please note that the flow should be TOWARDS the assignment task from the sessions.



3) Modify the link expression for all these links to $Session_Name.PrevTaskStatus = FAILED.

4) In the assignment task, assign $Session_Name.ErrorMsg to the workflow variable $$ERROR_MESSAGE and assign Session_Name to $$SESSION_NAME.



5) You need a bit of nested iifs to achieve this.

For variable $$ERROR_MESSAGE, the expression contains
:udf.if_null_or_blank($Session_Name_1.ErrorMsg,
    :udf.if_null_or_blank($Session_Name_2.ErrorMsg,
        :udf.if_null_or_blank($Session_Name_3.ErrorMsg,
            :udf.if_null_or_blank($Session_Name_4.ErrorMsg ,
                :udf.if_null_or_blank($Session_Name_5.ErrorMsg ,
                    :udf.if_null_or_blank($Session_Name_6.ErrorMsg ,
                        :udf.if_null_or_blank($Session_Name_7.ErrorMsg ,
                            :udf.if_null_or_blank($Session_Name_8.ErrorMsg ,
                                :udf.if_null_or_blank($Session_Name_9.ErrorMsg ,
                                    :udf.if_null_or_blank($Session_Name_10.ErrorMsg ,
                                        :udf.if_null_or_blank($Session_Name_11.ErrorMsg ,
                                            :udf.if_null_or_blank($Session_Name_12.ErrorMsg ,
                                                 'A Fatal Error occurred' 
                                 ,$Session_Name_12.ErrorMsg
                                 )
                              ,$Session_Name_11.ErrorMsg
                              )
                           ,$Session_Name_10.ErrorMsg
                           )
                        ,$Session_Name_9.ErrorMsg
                        )
                     ,$Session_Name_8.ErrorMsg
                     )
                  ,$Session_Name_7.ErrorMsg
                  )
               ,$Session_Name_6.ErrorMsg
               )
            ,$Session_Name_5.ErrorMsg
            )
         ,$Session_Name_4.ErrorMsg
         )
      ,$Session_Name_3.ErrorMsg
      )
   ,$Session_Name_2.ErrorMsg
   )
,$Session_Name_1.ErrorMsg
)




:udf.if_null_or_blank(input_String_Argument, output_if_NULL_Blank, output_if_not_NULL_Blank) is a user-defined function with expression contents

iif(isnull(input_String_Argument) or length(ltrim(rtrim(input_String_Argument)))
= 0, output_if_NULL_Blank, output_if_not_NULL_Blank)

In the same way, create the expression for the $$SESSION_NAME. It should be the same expression as for the $$ERROR_MESSAGE but in the else part of the iif(), the session names should be specified instead of ErrorMsg.

6) From this assignment task take a link to a session which stores the contents of these workflow variables into a database table or a flat file. let us call this session mapping as LOG mapping.

You may question the scope of these workflow variable inside the Log mapping. If you can use the workflow variable in the source qualifier SQL override, then you can get the data from the same. Like this:
select '$$ERROR_MESSAGE', '$$SESSION_NAME'
from dual




Take 2 ports out of the source qualifier onto a expression transformation and then continue loading into a relation table target or a flat file target.

7) It is IMPORTANT to make sure that the General tab property of the Assignment task --> Treat Input Links as "OR". This makes sure if at least one session fails, the assignment task is triggered and the error is logged.

If you implement the Error Logging this way, you will be able to catch all kinds of Informatica errors.

Informatica PowerCenter Repository Tables



I am sure every PowerCenter developer either has an intention or necessity to know about the Informatica metadata tables and where information is stored etc. For the starters, all the objects that we create in Informatica PowerCenter - let them be sources, targets, mappings, workflows, sessions, expressions, be it anything related to PowerCenter, will get stored in a set of database tables (call them as metadata tables or OPB tables or repository tables).

* I want to know all the sessions in my folder that are calling some shell script/command in the Post-Session command task.
* I want to know how many mappings have transformations that contain "STOCK_CODE" defined as a port.
* I want to know all unused ports in my repository of 100 folders.

In repositories where you have many number of sessions or workflows or mappings, it gets difficult to achieve this with the help of Informatica PowerCenter client tools. After all, whole of this data is stored in some form in the metadata tables. So if you know the data model of these repository tables, you will be in a better position to answer these questions.

Before we proceed further, let me clearly urge for something very important. Data in the repository/metadata/OPB tables is very sensitive and that the modifications like insert or updates are to be made using the PowerCenter tools ONLY. DO NOT DIRECTLY USE UPDATE OR INSERT COMMANDS AGAINST THESE TABLES.

Please also note that there is no official documentation from Informatica Corporation on how these tables act. It is purely based on my assumption, research and experience that I am providing these details. I will not be responsible to any of the damages caused if you use any statement other than the SELECT, knowing the details from this blog article. This is my disclaimer. Let us move on to the contents now.

There around a couple of hundred OPB tables in 7.x version of PowerCenter, but in 8.x, this number crosses 400. In this regard, I am going to talk about few important tables in this articles. As such, this is not a small topic to cover in one article. I shall write few more to cover other important tables like OPB_TDS, OPB_SESSLOG etc.
We shall start with OPB_SUBJECT now.

OPB_SUBJECT - PowerCenter folders table

This table stores the name of each PowerCenter repository folder.

Usage: Join any of the repository tables that have SUBJECT_ID as column with that of SUBJ_ID in this table to know the folder name.

OPB_MAPPING - Mappings table

This table stores the name and ID of each mapping and its corresponding folder.

Usage: Join any of the repository tables that have MAPPING_ID as column with that of MAPPING_ID in this table to know the mapping name.

OPB_TASK - Tasks table like sessions, workflow etc

This table stores the name and ID of each task like session, workflow and its corresponding folder.

Usage: Join any of the repository tables that have TASK_ID as column with that of TASK_ID/SESSION_ID in this table to know the task name. Observe that the session and also workflow are stored as tasks in the repository. TASK_TYPE for session is 68 and that of the workflow is 71.

OPB_SESSION - Session & Mapping linkage table

This table stores the linkage between the session and the corresponding mapping. As informed in the earlier paragraph, you can use the SESSION_ID in this table to join with TASK_ID of OPB_TASK table.

OPB_TASK_ATTR - Task attributes tables

This is the table that stores the attribute values (like Session log name etc) for tasks.

Usage: Use the ATTR_ID of this table to that of the ATTR_ID of OPB_ATTR table to find what each attribute in this table means. You can know more about OPB_ATTR table in the next paragraphs.

OPB_WIDGET - Transformations table

This table stores the names and IDs of all the transformations with their folder details.

Usage: Use WIDGET_ID from this table to that of the WIDGET_ID of any of the tables to know the transformation name and the folder details. Use this table in conjunction with OPB_WIDGET_ATTR or OPB_WIDGET_EXPR to know more about each transformation etc.

OPB_WIDGET_FIELD - Transformation ports table

This table stores the names and IDs of all the transformation fields for each of the transformations.

Usage: Take the FIELD_ID from this table and match it against the FIELD_ID of any of the tables like OPB_WIDGET_DEP and you can get the corresponding information.

OPB_WIDGET_ATTR - Transformation properties table

This table stores all the properties details about each of the transformations.

Usage: Use the ATTR_ID of this table to that of the ATTR_ID of OPB_ATTR table to find what each attribute in this transformation means.

OPB_EXPRESSION - Expressions table

This table stores the details of the expressions used anywhere in PowerCenter.

Usage: Use this table in conjunction with OPB_WIDGET/OPB_WIDGET_INST and OPB_WIDGET_EXPR to get the expressions in the Expression transformation for a particular, mapping or a set.

OPB_ATTR - Attributes

This table has a list of attributes and their default values if any. You can get the ATTR_ID from this table and look it up against any of the tables where you can get the attribute value. You should also make a note of the ATTR_TYPE, OBJECT_TYPE_ID before you pick up the ATTR_ID. You can find the same ATTR_ID in the table, but with different ATTR_TYPE or OBJECT_TYPE_ID.

OPB_COMPONENT - Session Component

This table stores the component details like Post-Session-Success-Email, commands in Post-Session/pre-Session etc.

Usage: Match the TASK_ID with that of the SESSION_ID in OPB_SESSION table to get the SESSION_NAME and to get the shell command or batch command that is there for the session, join this table with OPB_TASK_VAL_LIST table on TASK_ID.

OPB_CFG_ATTR - Session Configuration Attributes

This table stores the attribute values for Session Object configuration like "Save Session log by", Session log path etc.

Wednesday, February 16, 2011

Using Incremental Aggregation



 
Overview

I had seen several posts related to Incremental Aggregation in Informatica-I group of www.ittoolbox.com. Below is the step by step process to apply incremental aggregation.

Incremental aggregation is used when you need to apply captured changes in the source to aggregate calculations in a session. If the source changes only incrementally and you can capture changes, you can configure the session to process only those changes. This allows the Informatica Server to update your target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session.

When you run an incremental aggregation session for the first time, IS process the entire source and creates Data File and Index File in cache directory specified.  The Integration Service names the index file as PMAGG*.idx* and the data file as PMAGG*.dat*.

Note:
1.  If you are using percentile or median functions, do not use Incremental Aggregation because IS uses system memory to process these functions in addition to the cache memory configured in the session properties.
2.  Don’t move the aggregate files without correcting the configured path or directory for the files in the session properties.
3. Don’t change the configured path or directory for the aggregate files without moving the files to the new location.
4. Don’t delete cache files or change the number of partitions.

Step-by-step Process

Source Table
empno
sal
deptno
hire_dt
1
2000
10
Nov 29 2010
2
5000
10
Nov 29 2010
3
8000
10
Nov 29 2010
4
2000
20
Nov 29 2010
5
10000
20
Nov 29 2010
6
12000
30
Nov 29 2010
7
22000
30
Nov 29 2010
8
31000
30
Nov 29 2010
9
1000
10
Nov 30 2010
10
1000
10
Nov 30 2010
11
1000
30
Nov 30 2010

Below is the flow of mapping




The data will be pulled from the source on the basis of hiredate. We will create a parameter to pass hiredate as shown below.


Filter transformation

Filter transformation will be used to filter the data on the basis of hiredate as shown below


Aggregator Transformation

Aggregator transformation will aggregate department wise salary.


Update Strategy

Update Strategy transformation is used to flag rows for update.


Once you are done with the mapping design, next step is to create a Session and Workflow.

Create a session and set the properties for Target as shown below.


Check the below session properties as shown below.


Set the value of $$hiredt parameter to 11/29/2010 and run the session. You will get the below rows in target.

Deptno
Sal
10
15000
20
12000
30
65000

After first run, go to session properties and uncheck the option “Reinitialize aggregate cache” as shown below.


Set the value of $$hiredt parameter to 11/30/2010 and run the session again. Now the value in target table will be as shown below.
Deptno
Sal
10
17000
20
12000
30
66000

Common Error:

Below is the common error person used to get while using Incremental Aggregation.

TE_7051 : Aggregate Error: Index file timestamp is earlier than the timestamp of AGG transformation or Mapping
TE_7029 : Aggregate Information: Creating New Index and Data Files

Resolution:
This error occurs when the mapping associated to the incremental aggregation session is modified. Make sure that the associated mapping should not be modified.

PowerCenter Repository Queries


This article contains some of the repository queries I have worked with.

Note: All the queries given below are for repository in Oracle Database. Please check the syntax before running on other Database.

1. Get Mapping Level Lookup SQL Override, SQL Query of Lookup Procedure and Source Qualifier Transformations

SELECT DISTINCT
d.subject_area AS Folder, d.mapping_name, a.widget_type_name AS Transformation_Type,
a.instance_name as Transformation_Name, b.attr_name, b.attr_value, c.session_name 
FROM
REP_WIDGET_INST a, REP_WIDGET_ATTR b, REP_LOAD_SESSIONS c, REP_ALL_MAPPINGS d
WHERE b.widget_id = a. widget_id
AND b.widget_type = a. widget_type
AND b.widget_type in (3, 11)
AND c.mapping_id = a.mapping_id
AND d.mapping_id = a.mapping_id
AND b.attr_id= 1
AND b.attr_datatype=2 and b.attr_type=3
ORDER BY d.subject_area, d.mapping_name

2. Get list of tables used in Lookup transformation

SELECT
b.mapping_name,a.instance_name, c.attr_name, c.attr_value,
a.widget_type_name,b.parent_subject_area
FROM
REP_WIDGET_INST a INNER JOIN REP_ALL_MAPPINGS b ON a.mapping_id = b.mapping_id INNER JOIN
REP_WIDGET_ATTR c ON a.widget_id = c.widget_id
WHERE
(c.attr_description LIKE 'Lookup source table')

3. Get list of all sources along with mappings using the source

SELECT
b.source_database_name,b.source_name,a.mapping_name
FROM
REP_SRC_MAPPING a,REP_ALL_SOURCES b
WHERE a.source_id(+) =   b.source_id
AND   a.source_name(+) = b.parent_source_name
AND   a.subject_id(+) =  b.subject_id

4. Get Lookup connection information

SELECT
d.subject_area AS Folder, a.mapping_name AS Mapping, b.instance_name AS Transformation, c.attr_value as Connection
FROM
REP_ALL_MAPPINGS a, REP_WIDGET_INST b, REP_WIDGET_ATTR c, REP_SUBJECT d
WHERE d.SUBJECT_ID = a.SUBJECT_ID
AND a.MAPPING_ID = b.MAPPING_ID
AND b.WIDGET_TYPE = 11
AND b.WIDGET_ID = c.WIDGET_ID
AND c.ATTR_ID = 6

5. Get Stored Procedure connection information

SELECT
d.subject_area AS Folder, a.mapping_name AS Mapping, b.instance_name AS Transformation, c.attr_value as Connection
FROM
REP_ALL_MAPPINGS a, REP_WIDGET_INST b, REP_WIDGET_ATTR c, REP_SUBJECT d
WHERE d.SUBJECT_ID = a.SUBJECT_ID
AND a.MAPPING_ID = b.MAPPING_ID
AND b.WIDGET_TYPE = 11
AND b.WIDGET_ID = c.WIDGET_ID
AND c.ATTR_ID = 1

6. Get number of sessions executed on a specific day

SELECT
server_name AS Integration_Server,count(*) AS Total_Jobs
FROM
REP_TASK_INST_RUN
WHERE TO_CHAR(start_time,'mm/dd/yyyy') = '12/06/2010'
GROUP BY server_name

7. Get run details for a worklet in a specific workflow for a particular time period

SELECT DISTINCT
b.workflow_name,a.task_name, b.start_time,b.end_time
FROM
REP_ALL_TASKS a, REP_TASK_INST_RUN b
WHERE (a.task_id = b.task_id AND a.version_number = b.task_version_number )
AND (b.workflow_name = 'workflow_name' AND
b.start_time > TO_DATE('11/08/2010 06:00:00','MM/DD/YYYY HH24:MI:SS')
AND a.task_name like 'workletname%' )
ORDER BY
b.start_time


8. Get list of all the emails with attachment used as part of email task or post-session commands

SELECT DISTINCT
d.subj_name AS Folder_Name, c.workflow_name AS Workflow_Name,a.task_name AS Task_Name, b.attr_value AS Value
FROM
OPB_TASK a, OPB_TASK_ATTR b, REP_TASK_INST_RUN c, OPB_SUBJECT d
WHERE a.task_id = b.task_id
AND a.task_type = b.task_type
AND c.subject_id = a.subject_id
AND a.subject_id = d.subj_id
AND a.task_type = 65
AND b.attr_id in (2,3)
AND(b.attr_value LIKE '%\%a%' ESCAPE '\' OR b.attr_value LIKE '%\%g%' ESCAPE '\')

9. Get all the post session commands used inside each session

SELECT
a.task_name AS Task,c.subj_name AS Folder, b.PM_value AS Command
FROM
OPB_TASK a,OPB_TASK_VAL_LIST b, OPB_SUBJECT c
WHERE a.task_type=58 AND a.task_name='post_session_command'
AND b.task_id=a.task_id AND b.subject_id=c.subj_id

10. Get all the post session success commands used inside each session

SELECT
a.task_name AS Task,c.subj_name AS Folder, b.PM_value AS Command
FROM
OPB_TASK a,OPB_TASK_VAL_LIST b, OPB_SUBJECT c
WHERE a.task_type=58 AND a.task_name='post_session_success_command'
AND b.task_id=a.task_id AND b.subject_id=c.subj_id

11. Get all the post session failure commands used inside each session

SELECT
a.task_name AS Task,c.subj_name AS Folder, b.PM_value AS Command
FROM
OPB_TASK a,OPB_TASK_VAL_LIST b, OPB_SUBJECT c
WHERE a.task_type=58 AND a.task_name='post_session_failure_command'
AND b.task_id=a.task_id AND b.subject_id=c.subj_id



12. Get DTM buffer size, buffer block size and line sequential buffer length for each session

SELECT
a.SUBJECT_AREA AS Folder_Name, a.task_name AS Session_Name, b.ATTR_VALUE AS DTM_BUFFER_SIZE, c.ATTR_VALUE AS Buffer_Block_Size,
d.ATTR_VALUE AS Line_Sequential_Buffer_Length
FROM
REP_ALL_TASKS a, REP_TASK_ATTR b, REP_SESS_CONFIG_PARM c, REP_SESS_CONFIG_PARM d
WHERE a.TASK_ID = b.TASK_ID
AND a.TASK_ID = c.SESSION_ID
AND a.TASK_ID = d.SESSION_ID
AND b.ATTR_ID = 101
AND c.ATTR_ID = 5
AND d.ATTR_ID = 6
ORDER BY 1, 2


13. Get All Version Comments (For versioning enabled repositories)

SELECT
b.subject_area AS Folder_Name, b.subject_id, a.comments, a.object_name, a.version_number, a.object_type, a.user_id, a.saved_from
FROM
rep_version_props a, rep_subject b
WHERE b.SUBJECT_ID = a.subject_id
AND a.comments is not null;

14. List all workflows and their associated Integration Service

SELECT
workflow_name,server_name
FROM
REP_WORKFLOWS

Handling duplicate rows with Flat file as source






The below two methods can be used to get rid of duplicate rows in case you are using flat file as source.
1)      Using Aggregator Transformation
2)      Using Rank Transformation
We will take the below file as source for this example.


Using Aggregator Transformation

Design a mapping as shown below:


 Sorter Transformation

Sorter is used to sort the rows according to roll.


Aggregator Transformation

The Aggregator will group the rows together based on roll so that there will be only one row per roll sent to the target. The Aggregator sends the last row in each group to the target.



Output

You will get only one row per roll in the target file as shown below.


Using Rank Transformation

Design a mapping as shown below:


Rank Transformation

Set the properties for Rank Transformation as shown below. This groups the rows by Roll, and ranks rows according to Name. The Top option returns the values with the greatest value. Setting Number of Ranks to 1 returns the row with the highest value.


Output

You will get only one row per roll in the target as shown below.
 

Tuesday, February 15, 2011



How to load session statistics into a Database Table

The below solution will help you to load the session statistics into a database table which can be used for audit purpose.  In real life, developer generally doesn’t have access to the metadata tables. This solution will help to get the session statistics for audit purpose.

Solution:

Create a Database table to store the session statistics.
Note: The following syntax is for Sybase DB. Please change according to your DB.
create table Infa_Audit
(
workflow_name varchar(50),
start_time datetime,
end_time datetime,
success_rows numeric,
failed_rows numeric
)
Create two sessions e.g. Session1 and Session2.  

Session1:
The Session1 will be your actual session for which you want to load the statistics.

Session2:
Session2 will be used to load the statistics of Session1 into database table. For this, create a mapping and define the below mapping variables


The flow of Mapping should look like this


Where, Source will be a dummy source and inside the expression assign all the mapping variables to output ports. The audit table will be the target.

The workflow will look like as shown below.

Create workflow variables as shown below


Assign the values to workflow variables in Assignment Task as shown below


In the Pre-session variable assignment tab of Session2, assign the mapping variables to workflow variable as shown below:


Execute the workflow.

Sunday, February 6, 2011

How to get last 50 records in miilion of records using Informtica


Use rank transformation and choose bottom 50...so that it
will load only last 50 records. there is option select BOTTOM
Number of Ranks : 50