Labels

Apache Hadoop (3) ASP.NET (2) AWS S3 (2) Batch Script (3) BigQuery (21) BlobStorage (1) C# (3) Cloudera (1) Command (2) Data Model (3) Data Science (1) Django (1) Docker (1) ETL (7) Google Cloud (5) GPG (2) Hadoop (2) Hive (3) Luigi (1) MDX (21) Mongo (3) MYSQL (3) Pandas (1) Pentaho Data Integration (5) PentahoAdmin (13) Polybase (1) Postgres (1) PPS 2007 (2) Python (13) R Program (1) Redshift (3) SQL 2016 (2) SQL Error Fix (18) SQL Performance (1) SQL2012 (7) SQOOP (1) SSAS (20) SSH (1) SSIS (42) SSRS (17) T-SQL (75) Talend (3) Vagrant (1) Virtual Machine (2) WinSCP (1)

Monday, November 14, 2011

Use of Database lookup in Pentaho Kettle

Database Lookup is used to map the columns of source and destination and get the surrogate key of lookup tables.

For example,

  1. My source table (dbo.Employee) contains employee details including DepartmentID
  2. My destination table (DimEmployee) including a column DepartmentKey.
  3. DepartmentKey is the SurrogateKey in the table dbo.DimDepartment with the following columns: (DepartmentKey, DepartmentID, DepartmentName,...).
  4. Now if you want to load the department key into Dimemployee table, we need to map the DepartmentID of the source with DepartmentID of the DimDepartment table (lookup table) to get the DepartmentKey as output.

This can be achieved in Pentaho, create a transformation as shown below:




















Wednesday, November 9, 2011

Incremental Data load in Pentaho using Insert/Update control

Incremental data loading can be achieved in pentaho using Inser/Update controls. The following steps helps one to understand the control:

1. Create a new transformation, containing source and target data connections.
2. Add a "Table Input" configure the connection properties and input columns. In below example two columns EmployeeID and EmployeeName are selected from Employee table.
3. Add a Insert/Update control join by hop.
4. Configure the insert update control as shown below (Steps 1, 2, & 3).
5. My target table is DimEmployee, and look up column in EmployeeID (Steps 4 &5).
6. Click on Get Update fields (Step 5). Map Input and Output columns.
7. Set update Y or N for columns (Y for allow updates, N for fixed attribute).
8. Click OK.


Tuesday, November 8, 2011

Effective Merge Statements in SQL

-- Step 1: Create and Populate Customers table.

-- Note: Population uses row constructors, which are new in SQL Server 2008

--

CREATE TABLE dbo.Customers (

customer_id int IDENTITY PRIMARY KEY,

customer_name varchar(30),

policy_type int CHECK(policy_type > 0 AND policy_type < 4),

deductable money,

number_of_claims int

)

 

INSERT Customers VALUES

('Brian Smith', 1, 1000, 0),

('David Johnson', 1, 5000, 0),

('Lisa Miller', 2, 2000, 1),

('Nancy Anderson', 2, 1000, 1)

 

--

-- Step 2: Create and Populate QuarterlyUpdate table.

-- Note: Population uses row constructors, which are new in SQL Server 2008

-- Note: Not every Customer must appear in the QuarterlyUpdate table.

--



CREATE TABLE QuarterlyUpdate (

customer_id int,

customer_name varchar(30),

policy_type int CHECK(policy_type > 0 AND policy_type < 4),

deductable money,

new_claims int

)



INSERT QuarterlyUpdate VALUES

(1, 'Brian Burke', 1, 1000, 0),

(2, 'David Johnson', 1, 5000, 2),

(5, 'Michelle Fredette', 2, 2000, 0), -- New Customer

(6, 'Phil Gibbins', 2, 2000, 0) -- New Customer

 

SELECT * FROM Customers

SELECT * FROM QuarterlyUpdate

--

-- Step 3: MERGE

-- Note: No action is taken when row appears in Customer but not QuarterlyUpdate

-- Note: Uses increment operator, new in SQL Server 2008 T-SQL

--

MERGE Customers C

USING QuarterlyUpdate Q

ON C.customer_id = Q.customer_id

WHEN MATCHED THEN

UPDATE SET C.customer_name = Q.customer_name,

C.policy_type = Q.policy_type,

C.deductable = Q.deductable,

C.number_of_claims += new_claims

WHEN NOT MATCHED THEN

INSERT VALUES(customer_name, policy_type, deductable, new_claims)

OUTPUT $action, INSERTED.customer_id,

INSERTED.customer_name AS [New Name],

INSERTED.policy_type AS [New Policy],

INSERTED.deductable AS [New Deductable],

INSERTED.number_of_claims AS [New Number Of Claims],

DELETED.customer_id AS [Original Customer Id],

DELETED.customer_name AS [Original Name],

DELETED.policy_type AS [Original Policy],

DELETED.deductable AS [Original Deductable],

DELETED.number_of_claims AS [Original Number Of Claims];

DELETE QuarterlyUpdate;

--

-- Step 5: Next quarter, a new business rule is introduced.

-- A row for each current customer is required in QuarterlyUpdate

-- If no row is present, customer will be deleted

--

INSERT QuarterlyUpdate VALUES

(1, 'Brian Burke', 1, 1000, 0),

(2, 'David Johnson', 1, 5000, 2),

(3, 'Lisa Miller', 2, 2000, 1),

(5, 'Michelle Fredette', 2, 2000, 1),

(6, 'Phil Gibbins', 2, 2000, 0),

(7, 'Sam Raymond', 3, 10000, 0)

 

--

-- Step 3: MERGE

-- Note: When row appears in Customer but not QuerterlyUpdate, Customer is deleted

-- Note: Uses increment operator, new in SQL Server 2008 T-SQL

-- Note: Uses WHEN NOT MATCHED BY SOURCE - keyword SOURCE is required

-- and Uses WHEN NOT MATCHED BY TARGET - keyword TARGET is optional

--

MERGE Customers C

USING QuarterlyUpdate Q

ON C.customer_id = Q.customer_id

WHEN MATCHED THEN

UPDATE SET C.customer_name = Q.customer_name,

C.policy_type = Q.policy_type,

C.deductable = Q.deductable,

C.number_of_claims += new_claims

WHEN NOT MATCHED BY SOURCE THEN

DELETE

WHEN NOT MATCHED BY TARGET THEN

INSERT VALUES(customer_name, policy_type, deductable, new_claims)

OUTPUT $action, INSERTED.customer_id,

INSERTED.customer_name AS [New Name],

INSERTED.policy_type AS [New Policy],

INSERTED.deductable AS [New Deductable],

INSERTED.number_of_claims AS [New Number Of Claims],

DELETED.customer_id AS [Original Customer Id],

DELETED.customer_name AS [Original Name],

DELETED.policy_type AS [Original Policy],

DELETED.deductable AS [Original Deductable],

DELETED.number_of_claims AS [Original Number Of Claims];

SELECT * FROM Customers

SELECT * FROM QuarterlyUpdate