Archive

Posts Tagged ‘TSQL – OPENROWSET’

TSQL – OPENROWSET

July 30, 2011 2 comments

I have got a challenge last week from a business analyst to identify duplicate records in Excel files

Scenario of Issue:

Business analyst team need to provide the business data to load into data warehouse but over time they maintained business data in Excel 2003  and they have more than 500 files.
The spread sheets have a lot of duplicate records and not worth to load into Data Warehouse so we need a solution to find duplicate records in a simple way.
All Excel sheets related to one table and having row number for each record.

Solution Requirements:

1. Solution needs to be simple to understand
2. Must be reusable without any technical support

Solution:

Initially, I am not surprised technically to solve the issue but here interesting is to develop a simple solution with no maintenance.
 
Step 1: To simulate the scenario I have created a sample source data and highlighted the duplicate records only on ProductName so the solution needs to report Duplicate status on ProductName columns

Sample source data

Sample source data

Step 2: Created a table in SQL DB, same as the definition of Excel source.

CREATE TABLE [dbo].[Product](
	[ProductID] [int] NOT NULL,
	[ProductName] [VARCHAR](50) NOT NULL,
	[DeliveryUnits] [int] NOT NULL,
 CONSTRAINT [PK_Product_ProductID_ProductName] PRIMARY KEY CLUSTERED
(
	[ProductID] ASC,
	[ProductName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

I plan to give a solution using simple SQL Query to identify duplicates so I used SQL OPENROWSET to get data from Excel.

Step 3:

SELECT * FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0'
,'Excel 8.0;Database=F:\TSQL - OPENROWSET\Product1.xls;HDR=YES'
,'SELECT * FROM [Product$]')

Query Executed with Error message as:
Msg 7308, Level 16, State 1, Line 3
OLE DB provider ‘Microsoft.Jet.OLEDB.4.0’ cannot be used for distributed queries because the provider is configured to run in single-threaded apartment mode

Enabling ad hoc property on SQL Server

USE [master]
GO
sp_configure 'show advanced options'
GO

USE [master]
GO
sp_configure 'show advanced options', 1
GO

USE [master]
GO
sp_configure 'ad hoc distributed queries', 1
GO

Message information:
Configuration option ‘Ad Hoc Distributed Queries’ changed from 0 to 1

Step 4: Source Query to get data from source and presenting the duplicate description on duplicate rows

DELETE FROM dbo.Product

INSERT INTO dbo.Product (ProductID, ProductName,DeliveryUnits)
SELECT * FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0'
,'Excel 8.0;Database=F:\TSQL - OPENROWSET\Product1.xls;HDR=YES'
,'SELECT * FROM [Product$]')
;WITH CTE AS (
    SELECT ROW_NUMBER() OVER (PARTITION BY ProductName ORDER BY ProductID ASC)
           AS CheckDuplicate
          ,ProductID, ProductName,DeliveryUnits FROM dbo.Product )

SELECT CASE WHEN (CheckDuplicate > 1) THEN 'DUPLICATE'
            ELSE 'No Duplicate'  END AS CHECK_DUPLICATE
      ,ProductID, ProductName,DeliveryUnits
FROM CTE
ORDER BY ProductID

DELETE FROM dbo.Product
GO
Result set

Result set

 

Points to Consider:

1. There are many ways to solve this issue, using SSIS and Excel 2010 conditional format etc.
2. I used OPENROWSET to give a simple query to business user with no maintenance work except changing the variable source file name.
3. I have used ROW_NUMBER() function so identify the unique Product Name values.
4. If Excel x64bit as Source , the provider ‘Microsoft.Jet.OLEDB.4.0’ may not work.

Advertisements
Categories: TSQL Tags: