Dynamicallly writing those pesky Update Statements

Whenever you are loading a data warehouse, often one of the most time consuming tasks is writing the update statements that you need in order to properly handle changing records coming from your system. Further adding to development time, columns are then split into type 1 or type 2 attributes, and in my current shop, processes for deletes out of the source system as well. After you write a few of these, that loving feeling you got when you started starts to dissipate, and you start to feel bored with the same statements. So with that in mind, I wrote the below procedure that takes a table name and an update_type variable of U, V, or D and then generates the appropriate update, type 2 version, or delete statement you need. The procedure makes 2 assumptions, one in that you have a separate etl_working database to handle set based updates, and the second that your table primary keys contain of column named “version_key” as a part of the naming convention for your tables primary key. If this isn’t your naming convention, feel free to change it.

UPDATE: 01/21/2015: found a bug in the version code, fixed. Sorry about that!

/*
Author: Christopher Schmidt
Date: 2015-01-19
Description: taking the table name and an update type of U, V, or D as inputs, generates an update statement against the target table. This proc makes an assumption that
there is a separate database titled “etl_working”, that consists of a mirror copy of the target table to handle updates in a set based manner. This can be adjusted in the
code below on lines 55,58,65, and 67.

It also assumes that your primary_key of the target table contains the words
“version_key”. if you do not use this naming convention, you can update them in the where clauses of the column selects on lines 34,40,46, and 52.

As always, with any code you find on the internet, test and use in development.

*/

CREATE PROCEDURE [dbo].[update_generator](
@tbl varchar(200)
,@update_type char(1)
)
AS

DECLARE @sqlu varchar(5000)
DECLARE @sqld varchar(5000)
DECLARE @sqlv varchar(5000)
DECLARE @cols varchar(4000)
DECLARE @colsi varchar(4000)
DECLARE @colsv varchar(4000)
DECLARE @keycol varchar(200)
DECLARE @keycold varchar(200)

set @cols = stuff(( select ‘,’+quotename(C.name) + ‘=Src.’ + quotename(c.name)
from sys.columns as C
where (C.object_id = object_id(@tbl)
AND c.name NOT LIKE ‘%version_key’)
for xml path(”)), 1, 1, ”)

SET @colsi = stuff(( select ‘,’+quotename(C.name)
from sys.columns as C
where (C.object_id = object_id(@tbl)
AND c.name NOT LIKE ‘%version_key’)
for xml path(”)), 1, 1, ”)

SET @keycol = stuff(( select ‘,’+quotename(C.name) + ‘=Src.’ + quotename(c.name)
from sys.columns as C
where (C.object_id = object_id(@tbl)
AND c.name LIKE ‘%version_key’)
for xml path(”)), 1, 1, ”)

SET @keycold = stuff(( select ‘,src.’+quotename(C.name) + ‘= dw.’ + quotename(c.name)
from sys.columns as C
where (C.object_id = object_id(@tbl)
AND c.name LIKE ‘%version_key’)
for xml path(”)), 1, 1, ”)

SET @sqlu = N’UPDATE [‘ + @tbl + ‘] SET ‘ + @cols + ‘ FROM change.[‘ + @tbl + ‘] [src] JOIN dbo.[‘ + @tbl + ‘] [dst] ON dst.’ + @keycol + ‘ WHERE Update_Type = ”U”’

SET @sqld = N’UPDATE [‘ + @tbl + ‘] SET dmproc_batch_key=src.dmproc_batch_key, source_revision_action = ”D”, version_end_dtmx = GETDATE(), version_current_ind = 0
FROM change.[‘ + @tbl + ‘] [src] JOIN dbo.[‘ + @tbl + ‘] [dst] ON dst.’ + @keycol + ‘ WHERE Update_Type = ”D”’

SET @sqlv = N’UPDATE ‘ + @tbl + ‘ SET version_current_ind = 0
,version_latest_ind = 0
,dmproc_batch_key=src.dmproc_batch_key
,version_end_dtmx=GETDATE()
,source_revision_end_dtmx = GETDATE()
,source_revision_action = ”V” FROM change.[‘ + @tbl + ‘] [src] JOIN dbo.[‘ + @tbl + ‘] [dst] ON dst.’ + @keycol + ‘ WHERE Update_Type = ”V”
GO
INSERT INTO dbo.[‘ + @tbl + ‘] ( ‘ + @colsi + ‘) SELECT ‘ + @colsi + ‘FROM change.[‘ + @tbl + ‘] WHERE Update_Type = ”V”’

SELECT
CASE
WHEN @update_type = ‘U’ THEN @sqlu
WHEN @update_type = ‘D’ THEN @sqld
WHEN @update_type = ‘V’ THEN @sqlv
END [sql]

 

 

Hope it saves you some time!

Data Warehouse/Mart Loading: An alternate loading approach (HASHBYTES)

When loading dimensions into a warehouse, one of the more time consuming pieces of development is the comparison between the source data, and the destination dimension. Additionally, comparing the source against the destination with all of those OR statements is time consuming on the engine, too. As ETL developers, we can make this process much simpler and easier to develop. By streamlining this process, not only can we make our development time quicker, but it’s less error prone & more efficient too. This “secret sauce” is called the hashbytes function in SQL, and it works by combining all of the data points in our source together, and generating a hash that represents it, making for a simpler comparison. As we generate this same hash against the dimension as well, we can then compare 1 column in the source to 1 column in the destination. To demonstrate, let’s take the product dimension in the Adventure Works data warehouse and explore it in a little more detail:

This dimension consists of 36 columns, coming from a couple of tables in the Adventure Works OLTP database, starting with Production.Product and joining some additional tables in. For simplicity sake, I am just going to use the Production.Product table. Let’s assume that after assessment we decide that the best ETL pattern for our development consists of a design using a lookup to the existing dimension, with a conditional split for changes, and then our destinations:

image

 (side note: in an actual production environment, you probably wouldn’t use the OLE DB command, and would have it replaced with a staging table and update statement)

Looking at the conditional split in a little more detail,

image

 

We can see two things: one, that we had to output every column from our dimension, appending something like _DW to the column for comparison, and two, that condition looks like it was a &!#$ to write! And regrettably, this is a true statement. Enough columns, one error, and you can spend hours trying to get that one statement working just right. Instead of writing that horrendous Changed statement, fortunately we have an alternative, the HASHBYTES function.

The HASHBYTES function was first introduced in SQL Server 2005, and can accept different hashes based on how we want to output it. There are 7 different outputs, and the function accepts two arguments, the first being that hash type you want, and the second being the columns you want to generate the hash on. A simple hash would look like:

HASHBYTES(‘’SHA1’, Product.ProductName) AS [Hash]

In a real world scenario, we need to handle null values and ensure that our data types are consistent along every column. Generating a hash against a null is going to result in… NULL, and different data types will cause different results to render. To start, create a hash on the source data consisting of the data you want to compare. For this demo, let’s just create a hash on the first three columns, but you can expand this out as much as you need. A simple hash on the Product source table would look like this:

CAST(HASHBYTES(‘SHA1’, ISNULL(CAST([Class] AS VARCHAR), ‘UNKNOWN’)
            + ‘|’ + ISNULL(CAST([Listprice] AS VARCHAR), ‘UNKNOWN’)
            + ‘|’ + ISNULL(CAST([ProductLine] AS VARCHAR), ‘UNKNOWN’)
            ) AS BIGINT) AS [SourceHash]

We need to cast the whole hash as a bigint in order for our comparison to work later in SSIS. Now that we have this, we can input it our source select statement. In Visual Studio, open up the source, and we’ll replace our original source with the original columns plus our hash.

image 

Then in the Lookup component, add the same function, but make sure that the column names point to your dimension table (or whatever you’re looking up to). Once you’ve got your statement right, add it as an additional output along with your surrogate key:

image

Remember up above in our conditional split for all of the changed records? Change it so that it is simply a [SourceHash] != [DestinationHash]. Our new conditional split looks like this:

image

And our complete data flow looks exactly the same as before! This is a great way to keep the speed of development up. Note that you could also do this even if you have both type 1 and type 2 attributes in your dimension, you would just generate one hash consisting of only the type 1 attributes and then another column consisting of only the type 2 attributes.

Below is a table of the algorithms and the different outputs:

image 

If you want more info on hashing from a security perspective in SQL Server, Don Pinto has a good article on it here.

Julian Gregorian Date Bridge

Recently I was tasked with pulling dates out of a JDEdwards source system, which stores dates in a Julian style format. This is pretty typical of the mainframe era when they were initially built, and we began discussing how to pull the data out and converting it for consumption in a date table in the data warehouse. As the data was being staged prior to loading into the warehouse, it made sense to convert the dates into actual readable gregorian dates as they were being brought into the staging environment. Initially it was discussed to create a function that you could call for any date that you wanted to convert, but that approach had difficulties in that you would have to call the function for every date that you wanted. A colleague suggested a date bridge table, and I came up with a script that populates it. Below is my script:

CREATE TABLE dbo.JulianGregorianBridge
(JulianDate int NOT NULL,
GregorianDate int NOT NULL
)

DECLARE @StartDate DATE
,@EndDate DATE

SET @StartDate = ‘1/1/1949′
SET @EndDate = ’12/31/2050′

WHILE @StartDate <= @EndDate

BEGIN

INSERT INTO dbo.JulianGregorianBridge(
JulianDate
,GregorianDate)
SELECT
(Year(@StartDate) -1900) * 1000 + datediff(dd, ’12/31/’ + cast(year(@StartDate) -1 as varchar(4)), @StartDate) AS ‘JulianDate’
,YEAR(@StartDate) *10000 + MONTH(@StartDate) *100 + DAY(@StartDate) AS [DateKey]

SET @StartDate = DATEADD(d, 1, @StartDate)
END
GO

Straightforward, simple, and best of all, efficient. The alternative design approach using a function was more challenging as every function call would have caused row by row evaluation of the date to convert it, with this pseudo-date dimension table you can do a lookup between the date you want to convert and the Julian evaluation to return the smart key integer, or go from smart key to Julian with high performance.

A view to a…warehouse

As I sit here in Boston, I can’t help but find myself pondering views. At its core, what is a view, and specifically in regards to business intelligence, why is it such good practice to create views on our data warehouse tables? So as any sane person would, I immediately pulled out my computer and starting looking up best practices and asking the all important question: “Why?”

Immediately, I stumbled across a Microsoft whitepaper: http://msdn.microsoft.com/en-us/library/aa902663%28v=SQL.80%29.aspx

 

And in reading it, 1 particular section stuck out to me more than anything else:

“Using views is good system design for a number of reasons:

  • Using views improves flexibility by providing a buffer between the physical database and downstream systems, like user access and cube population.
  • It increases the likelihood that you can make significant physical changes to the system while minimizing disruption to other applications. “

In particular, one thing I’ve really begun to preach is the first bullet point in particular – “Providing a buffer between the physical database and downstream systems.”

It used to be that you would build your data warehouse, build your SSAS cube on top of it, and build your SSRS reports off of either a) your SSAS cubes, b) your data warehouse, or c) both. However, with the recent advent of Power Pivot, Power View, and the BISM tabular model, times are changing. Before if you ignored this practice, it wasn’t that bad, because more likely than not, it was just a field in a DSV in Analysis Services you had to change. But now, with all of the additions to the family, that is A LOT more stuff you have to update if you ignore it. Nowadays, ignoring and physically tying to the table creates dependencies on an almost infinite amount of systems, which could have you transforming from a BI developer into a DBA to watch all of those tables and make sure they perform. Views are  a tool in the BI developers arsenal to give systems, and business users, insight into what’s going on while still keeping the structure in relative obscurity.

The other item I really like about is the “significant physical changes” piece. I’ve seen time and time again where business try to create something themselves, it doesn’t work, so they call someone, but their are already reports and cubes in production that reference the data warehouse. So it’s not an option to just “blow it away” and start over again. Views give us that functionality to support existing processes while we rebuild the inside.

Adventures in Time: The Date Dimension

In Business Intelligence, I think one of the hardest things for people to get a handle on is our good friend the date dimension. It  should always be the very first dimension you create in a warehouse or mart, and it’s either going to make the rest of your tables and cubes much easier to report off of or much harder. Sorry Microsoft, but the 20 columns or so you give us in Adventure Works just doesn’t really cover all of the different angles, as we all know that even though Sally in accounting likes to see dates by fiscal month and quarter, John in sales like to see dates by calendar week, month and year. When building a date dimension, there are several very distinct things we need to consider when creating it:

  1.  What are the different ways business users will want to look at it? Long month names, short month names, weekdays, weekends, holidays, etc. ?
  2. How can I structure my table so it’s efficient for all of my BI applications to look at it?
  3. Can this actually be easy to maintain if I have to build a humongous date table to answer all of the questions above?

Let’s start with Question 1. To really get an answer for this question, the best person to ask is the person who writes the reports. What do they typically see from the business? Who asks for what?

Question 2 is a bit longer than the answer to Question 1. The answer is yes, we can. By building all of our columns into our table, we can make it extremely easy and efficient for us to report off of them from SQL, Analysis Services, SSRS, and SharePoint. SSRS and SharePoint wind up deriving off of Analysis Services, so that’s a little bit of duplication there. 🙂 But we COULD go back straight to the date dimension if we wanted to.

And question 3 is yes again. The trick is to just build up your shortcuts up front to make Analysis Services, etc. quick when they process, but to do it in the table directly up front and not want and do some funky concatenation in AS.

A friend of mine gave me a date dimension script a while ago that generated the actual table and the initial population of it. I added some extra columns to it, wrapped it in a proc, and put it below. There are quite a few different pieces to it, I’ll outline them below:

The proc at the end that gets created sets the current day flag equal to today’s date. If you put that in a job and execute at midnight every night, your current day flag will update every day.

In the proc Generate Date, it’s set to check and make sure that the default member exists while it’s running. That way if it’s missed when the table is initially built, or is truncated for some reason, it will always be added. I generally use a default date dimension member as 19000101 (01/01/1900), but if you use something else feel free to modify it.

The function came from Tim Cullen over at MSSQLTips (http://www.mssqltips.com/sqlservertip/1537/tsql-function-to-determine-holidays-in-sql-server/) to calculate Easter. Thank you Tim, you have no idea how much time/frustration you saved me.

The Set Holidays proc is wrapped so it’s run as a part of the Generate Date dimension, so as new dates are added the holiday’s are automatically populated.

Thanks all!

 

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[Common].[DimDate]’) AND type in (N’U’))
DROP TABLE [Common].[DimDate]
GO

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[Common].[GenerateDate]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [Common].[GenerateDate]
GO

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[Common].[SetCurrentDay]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [Common].[SetCurrentDay]
GO

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[Common].[SetHolidays]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [Common].[SetHolidays]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [Common].[DimDate](
[DateKey] [int] NOT NULL,
[FullDate] [datetime] NOT NULL,
[DisplayDate] [char](10) NOT NULL,
[DayShortName] [char](3) NULL,
[DayName] [varchar](9) NULL,
[CalendarYear] [smallint] NOT NULL,
[CalendarQuarterKey] [int] NOT NULL,
[CalendarQuarterNumber] [int] NOT NULL,
[CalendarQuarterShortName] [varchar](10) NULL,
[CalendarQuarterName] [char](9) NULL,
[CalendarQuarterYearName] [char](7) NULL,
[CalendarMonthKey] [int] NOT NULL,
[CalendarMonthNumber] [tinyint] NULL,
[CalendarMonthShortName] [char](3) NULL,
[CalendarMonthName] [varchar](9) NULL,
[CalendarMonthYearName] [char](8) NULL,
[CalendarWeekKey] [int] NOT NULL,
[CalendarWeekNumber] [char](2) NULL,
[CalendarWeekShortName] [char](3) NULL,
[CalendarWeekName] [varchar](10) NULL,
[CalendarWeekInMonth] [tinyint] NULL,
[CalendarWeekYearName] [varchar](12) NULL,
[CalendarDayOfYear] [smallint] NOT NULL,
[CalendarDayOfQuarter] [smallint] NOT NULL,
[CalendarDayOfMonth] [tinyint] NOT NULL,
[CalendarDayOfWeek] [tinyint] NOT NULL,
[IsWeekDay] [char](1) NULL,
[IsCurrentDay] [char](1) NULL,
[IsHoliday] [char](1) NULL,
[HolidayName] [varchar](25) NULL,
PRIMARY KEY CLUSTERED
(
[DateKey] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

CREATE FUNCTION [dbo].[fn_EasterSundayByYear]
(@Year char(4))
RETURNS smalldatetime
AS
BEGIN
http://aa.usno.navy.mil/faq/docs/easter.php
http://www.mssqltips.com/sqlservertip/1537/tsql-function-to-determine-holidays-in-sql-server/
declare
@c int
, @n int
, @k int
, @i int
, @j int
, @l int
, @m int
, @d int
, @Easter datetime

set @c = (@Year / 100)
set @n = @Year – 19 * (@Year / 19)
set @k = (@c – 17) / 25
set @i = @c – @c / 4 – ( @c – @k) / 3 + 19 * @n + 15
set @i = @i – 30 * ( @i / 30 )
set @i = @i – (@i / 28) * (1 – (@i / 28) * (29 / (@i + 1)) * ((21 – @n) / 11))
set @j = @Year + @Year / 4 + @i + 2 – @c + @c / 4
set @j = @j – 7 * (@j / 7)
set @l = @i – @j
set @m = 3 + (@l + 40) / 44
set @d = @l + 28 – 31 * ( @m / 4 )

set @Easter = (select right(‘0’ + convert(varchar(2),@m),2) + ‘/’ + right(‘0’ + convert(varchar(2),@d),2) + ‘/’ + convert(char(4),@Year))

return @Easter
END
GO

CREATE PROCEDURE Common.SetHolidays
AS
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Thanksgiving’
WHERE CalendarMonthNumber = 11
AND CalendarWeekInMonth = 4
AND CalendarDayOfWeek = 5
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Independence Day’
WHERE CalendarMonthNumber = 7
AND CalendarDayOfMonth = 4
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Christmas Day’
WHERE CalendarMonthNumber = 12
AND CalendarDayOfMonth = 25
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Christmas Eve’
WHERE CalendarMonthNumber = 12
AND CalendarDayOfMonth = 24
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘New Year”s Day’
WHERE CalendarMonthNumber = 1
AND CalendarDayOfMonth = 1
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Valentine”s Day’
WHERE CalendarMonthNumber = 2
AND CalendarDayOfMonth = 14
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘St. Patrick”s Day’
WHERE CalendarMonthNumber = 3
AND CalendarDayOfMonth = 17
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Memorial Day’
WHERE DateKey In (SELECT MAX(DateKey)
FROM Common.DimDate
WHERE CalendarMonthNumber = 5
AND CalendarDayOfWeek = 2)
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Labor Day’
WHERE DateKey In (SELECT MIN(DateKey)
FROM Common.DimDate
WHERE CalendarMonthNumber = 9
AND CalendarDayOfWeek = 2)
UPDATE Common.DimDate
SET IsHoliday = ‘Y’,
HolidayName = ‘Easter’
WHERE DateKey In (SELECT DISTINCT
cast(convert(char(8), dbo.fn_EasterSundayByYear(CalendarYear), 112) as int)
FROM Common.DimDate)
UPDATE Common.DimDate
SET IsHoliday = ‘N’
WHERE IsHoliday IS NULL

/****** Object:  StoredProcedure [dimensions].[GenerateDate]    Script Date: 07/18/2012 08:51:20 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [Common].[GenerateDate]
@StartDate DATE
,@EndDate DATE
AS
–SET @StartDate = ‘1/1/2004′
–SET @EndDate = ’12/31/2013′
SET NOCOUNT ON;

IF NOT EXISTS(SELECT DateKey FROM [Common].[DimDate] WHERE DateKey = 19000101)
BEGIN
INSERT INTO [Common].[DimDate]
([DateKey]
,[FullDate]
,[DisplayDate]
,[DayShortName]
,[DayName]
,[CalendarYear]
,[CalendarQuarterKey]
,[CalendarQuarterNumber]
,[CalendarQuarterShortName]
,[CalendarQuarterName]
,[CalendarQuarterYearName]
,[CalendarMonthKey]
,[CalendarMonthNumber]
,[CalendarMonthShortName]
,[CalendarMonthName]
,[CalendarMonthYearName]
,[CalendarWeekKey]
,[CalendarWeekNumber]
,[CalendarWeekShortName]
,[CalendarWeekName]
,[CalendarWeekInMonth]
,[CalendarWeekYearName]
,[CalendarDayOfYear]
,[CalendarDayOfQuarter]
,[CalendarDayOfMonth]
,[CalendarDayOfWeek]
,[IsWeekDay]
,[IsCurrentDay]
,[IsHoliday]
,[HolidayName])
VALUES(
19000101 –[DateKey]
,’1900-01-01′–[FullDate]
,’01/01/1900′–[DisplayDate]
,’Mon’–[DayShortName]
,’Monday’–[DayName]
,1900–[CalendarYear]
,190001–[CalendarQuarterKey]
,1–[CalendarQuarterNumber]
,’Q1′–[CalendarQuarterShortName]
,’Quarter 1′–[CalendarQuarterName]
,’Q1 1900′–[CalendarQuarterYearName]
,190001–[CalendarMonthKey]
,1–[CalendarMonthNumber]
,’Jan’–[CalendarMonthShortName]
,’January’–[CalendarMonthName]
,’Jan 1900′–[CalendarMonthYearName]
,190001–[CalendarWeekKey]
,’01’–[CalendarWeekNumber]
,’W1′–[CalendarWeekShortName]
,’Week 1′–[CalendarWeekName]
, 1 –[CalendarWeekInMonth]
,’W1 1900′–[CalendarWeekYearName]
,1–[CalendarDayOfYear]
,1–[CalendarDayOfQuarter]
,1–[CalendarDayOfMonth]
,2–[CalendarDayOfWeek]
,’Y’–[IsWeekDay]
,’N’–[IsCurrentDay]
,’Y’–[IsHoliday]
,’New Year”s Day’) –[HolidayName]
END
WHILE @StartDate <= @EndDate

BEGIN
IF NOT EXISTS(SELECT DateKey FROM [Common].[DimDate] WHERE DateKey=CONVERT(VARCHAR,@StartDate,112))

BEGIN

DECLARE @Year CHAR(4)
DECLARE @Quarter CHAR(2)
DECLARE @Month CHAR(2)
DECLARE @Week CHAR(2)
DECLARE @Day CHAR(2)

SET @Year = CAST(DATEPART(YEAR, @StartDate) AS VARCHAR)
SET @Quarter = RIGHT(‘0’ + CAST(DATEPART(QUARTER, @StartDate) AS VARCHAR), 2)
SET @Month = RIGHT(‘0’ + CAST(DATEPART(MONTH, @StartDate) AS VARCHAR), 2)
SET @Week = RIGHT(‘0’ + CAST(DATEPART(WEEK, @StartDate) AS VARCHAR), 2)
SET @Day = RIGHT(‘0’ + CAST(DATEPART(DAY, @StartDate) AS VARCHAR), 2)

INSERT INTO [Common].[DimDate](
DateKey
,FullDate
,DisplayDate
,DayShortName
,DayName
,CalendarYear
,CalendarQuarterKey
,CalendarQuarterNumber
,CalendarQuarterShortName
,CalendarQuarterName
,CalendarQuarterYearName
,CalendarMonthKey
,CalendarMonthNumber
,CalendarMonthShortName
,CalendarMonthName
,CalendarMonthYearName
,CalendarWeekKey
,CalendarWeekNumber
,CalendarWeekShortName
,CalendarWeekName
,CalendarWeekInMonth
,CalendarWeekYearName
,CalendarDayOfYear
,CalendarDayOfQuarter
,CalendarDayOfMonth
,CalendarDayOfWeek
,IsWeekDay
,IsCurrentDay
)
VALUES(
@Year + @Month + @Day–DateKey
,@StartDate –FullDate
,@Month + ‘/’ + @Day + ‘/’ + @Year –DisplayDate
,LEFT(DATENAME(WEEKDAY, @StartDate), 3) –DayShortName
,DATENAME(WEEKDAY, @StartDate) –DayName
,@Year –CalendarYear
,@Year + @Quarter –CalendarQuarterKey
,@Quarter –CalendarQuarterNumber
,’Q’ + CAST(CAST(@Quarter AS TINYINT) AS VARCHAR) –CalendarQuarterShortName
,’Quarter ‘ + CAST(CAST(@Quarter AS TINYINT) AS VARCHAR) –CalendarQuarterName
,’Q’ + CAST(CAST(@Quarter AS TINYINT) AS VARCHAR) + ‘ ‘ + @Year –CalendarQuarterYearName
,@Year + @Month –CalendarMonthKey
,@Month –CalendarMonthNumber
,LEFT(DATENAME(MONTH, @StartDate), 3) –CalendarMonthShortName
,DATENAME(MONTH, @StartDate) –CalendarMonthName
,LEFT(DATENAME(MONTH, @StartDate), 3) + ‘ ‘ + @Year –CalendarMonthYearName
,@Year + @Week –CalendarWeekKey
,@Week –CalendarWeekNumber
,’W’ + CAST(CAST(@Week AS TINYINT) AS VARCHAR) –CalendarWeekShortName
,’Week ‘ + CAST(CAST(@Week AS TINYINT) AS VARCHAR) –CalendarWeekName
,DATEDIFF(WEEK, DATEADD(MONTH, DATEDIFF(MONTH, 0, @StartDate), 0), @StartDate) + 1 –CalendarWeekInMonth
,’W’ + CAST(CAST(@Week AS TINYINT) AS VARCHAR) + ‘ ‘ + @Year –CalendarWeekYearName
,DATEPART(DAYOFYEAR, @StartDate) –CalendarDayOfYear
,DATEDIFF(DAY, DATEADD(QUARTER, DATEDIFF(QUARTER, 0, @StartDate), 0), @StartDate) + 1 –CalendarDayOfQuarter
,DATEPART(DAY, @StartDate) –CalendarDayOfMonth
,DATEPART(WEEKDAY, @StartDate) –CalendarDayOfWeek
,CASE WHEN DATEPART(WEEKDAY, @StartDate) IN (1,7) THEN ‘N’ ELSE ‘Y’ END –WeekDay
,’N’ –CurrentDay
)
END

SET @StartDate = DATEADD(d, 1, @StartDate)

EXEC Common.SetHolidays

END

GO

CREATE procedure [Common].[SetCurrentDay]
AS

UPDATE Common.DimDate
SET IsCurrentDay = ‘N’
WHERE IsCurrentDay = ‘Y’

UPDATE Common.DimDate
SET IsCurrentDay = ‘Y’
WHERE DateKey = ((Year(GETDATE())*10000)+(Month(GETDATE())*100)+(Day(GETDATE())))

GO

Business Intelligence Interview Questions

Hello all,

So in major news for me I’ve recently accepted a senior consulting position at Pragmatic Works. I’m really excited, and can’t wait to get started as I start this new journey. As a part of this transition, my current employer has requested that I come up with a list of interview questions for them to ask my successor. I would like to take a moment to post these questions here, as I believe they cover a wide width and breadth of knowledge over different levels of SSIS and SSAS. Additionally, as I believe in giving credit where credit is due, some of these questions I pulled from other blogs, and some I pulled the answer straight out of BOL. The sources for these are all referenced at the bottom of this post.

For ease of reading, I’m going to break these up into sections:

SSIS

Beginner

1) In SSIS, what is the difference between the Control Flow and the Data Flow?

a. The control flow is used to control the flow of components in a set logical order and the data flow is used to control what the data is used

2) What is a buffer?

a. A buffer is simply storing the data in memory. As SSIS runs, it stores the data in memory (buffers) and performs the requested transformations as data moves through the data flow pipeline. This is also what makes SSIS so fast at what it does. By alleviating the need to store the data physically, SSIS can take huge amounts of data and move it through very quickly, as it’s all been cached

3) What’s a precedence constraint?

a. SSIS allows you to place as many as tasks you want to be placed in control flow. You can connect all these tasks using connectors called Precedence Constraints. Precedence Constraints allow you to define the logical sequence of tasks in the order they should be executed. You can also specify a condition to be evaluated before the next task in the flow is executed.

Intermediate

1) What are the 3 different cache modes in SSIS?

a. Full

b. Partial

c. None

2) What are variables and what is variable scope?

a. A variable is used to store values. There are basically two types of variables, System Variable (like ErrorCode, ErrorDescription, PackageName etc) whose values you can use but cannot change and User Variable which you create, assign values and read as needed. A variable can hold a value of the data type you have chosen when you defined the variable.

b. Variables can have a different scope depending on where it was defined. For example you can have package level variables which are accessible to all the tasks in the package and there could also be container level variables which are accessible only to those tasks that are within the container.

Advanced

1) What are the 3 different cache modes in SSIS?

a. Full

b. Partial

c. None

2) What are the different SSIS log providers?

a. There are several places where you can log execution data generated by an SSIS event log:

i. SSIS log provider for Text files

ii. SSIS log provider for Windows Event Log

iii. SSIS log provider for XML files

iv. SSIS log provider for SQL Profiler

v. SSIS log provider for SQL Server, which writes the data to the msdb..sysdtslog90 or msdb..sysssislog table depending on the SQL Server version.

3) How many different configuration properties are there? What are they? What is the Microsoft recommended best practice for configuration in SSIS 2008?

a. There are 5 different ways to configure an SSIS package. They are:

i. XML Configuration File

ii. Environment Variable

iii. Registry Entry

iv. Parent Package

v. SQL Server

b. The Microsoft recommended best practice is to use a parent environment variable and point to a SQL server table for the remainder of the connection strings. This is the most dynamic/flexible solution that you can use in BIDS (SSIS) 2008.

SSAS

Beginner

1) In SSAS, what is the Data Source View?

a. A data source view (DSV) is an abstraction of a relational data source that becomes the basis of the cubes and dimensions you create in a multidimensional project

b. Analysis Services 2008 stores this data in a Unified Dimensional Model (UDM)

i. At a minimum, when the UDM is constructed merely as a thin layer over the data source, the advantages to the end user are a simpler, more readily understood model of the data, isolation from heterogeneous backend data

sources, and improved performance for summary type queries. In some scenarios a simple UDM like this is constructed totally automatically. With greater investment in the construction of the UDM, additional benefits accrue from the richness of metadata that the model can provide.

2) What is a friendly name?

a. A friendly name is an alias for a column/table in the Analysis Services Data Source View (UDM)

3) In a cube, what are the different kinds of actions that are available?

a. Action, Reporting Action, Drillthrough Action

i. A reporting action can be used to link to an SSRS report

ii. A drillthrough action can be used to define the drillthrough that a user sees when viewing the data

iii. A regular action can be used in 5 different ways.

1. A dataset returns a Multidimensional Expressions (MDX) statement, representing a multidimensional dataset, to be run and displayed by the client application.

2. A proprietary action returns a proprietary string that can be interpreted by client applications associated with the Application setting for this action.

3. A Rowset action returns a Multidimensional Expressions (MDX) statement, representing a tabular rowset, to be run and displayed by the client application.

4. A statement action returns a command string to be run by the client application.

5. A URL action returns a Uniform Resource Location (URL) string to be opened by the client application, typically with an Internet browser.

Intermediate

1) What are attribute relationships?

a. Attributes within a dimension are always related either directly or indirectly to the key attribute. When you define a dimension based on a star schema, which is where all dimension attributes are derived from the same relational table, an attribute relationship is automatically defined between the key attribute and each non-key attribute of the dimension.

2) What is the difference between an aggregation and a partition?

a. A partition limits the cube space and a an aggregation improves performance

Advanced

1) In MDX, what is the ParallelPeriod function and what does it do?

a. The ParallelPeriod functions returns a member from a prior period in the same relative position as a specified member. (For example, June 2011 in the month of June 2012)

2) What is the Microsoft recommended partition size?

a. 20 Million members, or 250 MB (no, that’s not a typo J) There are some exceptions:

i. If you’re building a 1TB cube, the first rule takes precedence and you should exceed the 250 MB suggestion to keep the partition count down. In this case, sizing partitions between 500MB to 1 GB is reasonable.

ii. If records come in at the end of each day and history does not change you can get the quickest processing with daily partitions for recent data and monthly or annual partitions for historical data.

iii. If you have to reprocess a good portion of your cube on some regular basis, the size of partitions can be driven by how many can be processed in parallel keeping CPU and/or I/O usage as high as possible.

3) Is it possible to expose an analysis services cube through an application? How?

a. Part 1: Yes

b. Part 2: Use IIS and have Analysis Services use an HTTP pump to show the data out OR you can write custom .NET assemblies.

T-SQL

1) What is a Common Table Expression? How is it different from a subquery?

a. In general, a CTE can be used recursively, a sub-query cannot.

2) What is De-normalization?

a. De-normalization is the process of attempting to optimize the performance of a database by adding redundant data. It is sometimes necessary because current DBMSs implement the relational model poorly. A true relational DBMS would allow for a fully normalized database at the logical level, while providing physical storage of data that is tuned for high performance. De-normalization is a technique to move from higher to lower normal forms of database modeling in order to speed up database access.

Sources:

http://www.mssqltips.com/sqlservertip/2485/ssis-interview-questions–part-1/

http://msdn.microsoft.com/en-us/library/ms345143%28v=SQL.90%29.aspx

http://blog.sqlauthority.com/2007/04/20/sql-server-interview-questions-part-6/