Data Flow
Performance:
I do
consider tuning at ETL( Extract, Transform and Load) sections.
Extraction
Level:-
Network
Tuning:
1. Change the network packet size in the
connection manager
·
Higher
values typically yield fast through put
·
By
Default, it will be ZERO, you can change MAX value as: 32767
2. Try to experiment with shared memory
vs TCP/IP
3. Enable JUMBO frames on network. To do
this please consult your network specialists
4. 4. If you have OLEDB Command
statements, then try to create another connection manager with
low package size.( Not go with 32767 max value )
SQL
Queries:
1. Consider using NOLOCK hint on table
in your source query.
·
Removes
locking overhead
·
Improves
the speed of large tables scan
·
Risky side
effects
·
Understand
before using hints in the source query
2. SELECT query with only selected
columns
3. Do require conversions in the source
query, don't go with Data Conversion T/F later.
LookUp:-
1. Changes SELECT statement to only use
the columns you need. So that, it optimizes memory usage.
2. Consider adding NOLOCK hint.
3. In SSIS 2008, use shared lookup
cache. Create cache transform and cache connection manager.
4. Use SQL query for reference table.
5. Use WHERE condition in SQL query in
LookUp T/F.
6. When your input row count is large
then go for using partial cache or full cache lookup to improve performance.
7. No cache lookup is a row based
operation.
8. A partial cache builds the cache as
the lookup T/F is executing and also comes with high transaction impact. A
partial cache approach is viable solution if you have large number of input
rows.
9. However, best option is use
"Full Cache" of lookup and filter the reference table.
10. Apply an 80/20 rule and load one
lookup T/F with 20 percent of the most
common matching records, which will allow 80 percent matches in
the full cache.
Transform
Level:-
Different
transformations:
1. Row based(synchronous)
·
Logically
works row by row
·
Data
conversion,Derived column examples
·
Buffer
reused
2. Partial blocking(Asynchronous)
·
Works with
groups of rows
·
Merge,Merge
Join,Union All,LookUp examples
·
Data
copied to new buffers
3. Blocking( Asynchronous)
·
Need all
input rows before producing any output rows
·
Aggregate,Sort,Pivot&UnPivot
examples
·
Data
copied to new buffers
Data
Types:
1. Make data types as narrow as possible
so you will allocate less memory for your transformation.
2. Do not perform excessive casting of
data types
· It will
degrade performance
· Cast
source types at the database using CAST/CONVERT functions where ever possible
Push Up or
Down where ever possible:
1. Data flow transformations can often
be optimized
· Sort: Push
to source queries when possible, instead of using SORT transformation for
sorting cross database joins.
· Use
IsSorted and SortKeyPosition option to sort data fro OLEDB source data instead
of going with SORT transformation.
· Go with
MERGE sql statement instead of SCD and LookUp T/F's for SCD's data handling.
·
Use GROUP
BY in sql source query instead of Aggregation T/F.
·
Use INSERT
INTO statement instead of a data flow task on a single instance
·
Find about
DELTA load VS RELOAD data
Loading
Level:-
1. Use SQL Server destination
·
only when
package and server are on the same server
·
Error
handling weaker than OLEDB Destination
2. Commit size = 0, then fast
3. Drop some indexes based on load
growth %
4. Load data in to partitons tables
5. Truncate table instead of DELETE
statement.
Other options to consider:
1. BLOB Temp Storage Path: Binary Large Object.
2. Buffer Temp Storage Path: change the Buffer Temp Storage Path and BLOB
Temp Storage Path to drive locations that have been optimized and ensure they
are not using the C:/ system drive.
3. Default Buffer Max Rows - 10,000
means no single buffer can have more than 10,000 rows.
4. Default Buffer Size - 10,48,5760
Specify number of bytes that a single buffer cannot exceed.
5. If single row is 1200 bytes the max
rows (10,000) times row width (1200)
10,000 * 1200 ==> 12000000
which is greater than default buffer size
6. Engine Threads - 20
7.
Run in
Optimized Mode - True - Ignores unused source columns, destination columns and
any T/F's.
No comments:
Post a Comment