Refreshing Data in Test From Prod

Before my Pop and I purchased our current boat last year, the only boat I had access to was a 12 foot fiberglass boat with a trolling motor and oars. Fiberglass boats are rather heavy, and the trolling motor would only push you around for a little while before it became necessary to row around with the oars. I’d spend hours rowing around our 750 acre lake trying to get to the good spots so I could catch fish. Often the wind would be blowing and I’d spend far more time rowing than fishing. It was a lot of work, but I did it happily because I love the sport.

As I gained experience, I realized it was wise to save the trolling motor for when I really needed it. The battery was a finite supply of power, and once I used it all up I was stuck rowing for the rest of the day.
I have an interesting situation at my current workplace where we have test data that needs to be refreshed periodically. This test data will either need to be refreshed in our test environment or our production environment. Often refreshing test data means taking a backup of a production database and restoring it over a test version, but in this case I actually need to delete out old test data from every table and insert new test data, based on production data. Each table has a key field with an identifier defining which set of data it belongs to – Production, Test in Production, and Test Environment. My process needs to take the Production data and replace the Test in Production or Test Environment data with it. The real challenge with this process is the massive amount of deletes and inserts cause the transaction log to overflow.

When I arrived at the workplace there was an existing process created by the software provider that was able to complete the job, but I observed it was not doing it very efficiently. They had a cursor that looped through each table, built four dynamic statments for each, and then executed those statements. The first statement deleted the Test data from the table. The second statement captured the Production data in a temporary staging table. The third statement updated the temporary staging table so the identifier indicated it was Test data. The fourth statement inserted the Prod data from the temporary staging table as Test data. The cursor would then move onto the next table. This process did a good job of running on its own, but it had to be babysat so the transaction log would not blow up.

The first thing I noticed is there is an extra set of reads, writes, and an update (which is a delete and write) for each table. There is no reason to put the Prod data into a temp table, update the table, then read from the temp table to put the data back into the original table. I rewrote the process so it would grab the prod data and insert it into the table as test data by substituting the Test identifier in a single step. This cut out the temporary staging table and the extra update.

I’m also not a fan of cursors. SQL is a set based language, and while looping through the tables is a legitimate reason for using a cursor, I decided to change it so it would print out the list of DELETE and INSERT statements. I would then take the output and paste it into a new SSMS window and run it. Doing this gave me greater control and flexiblity over which statement are run when, but it also made it easier to debug problems. Since arriving at this workplace we have moved the Test environment onto its own server. This server had a rather small drive for transaction logs, which is usually ok since we don’t run much on the Test environment, but was not sufficient for this process of reloading test data. I got to a specific table that was so large I couldn’t delete all the data in one statement; I needed to do it in chunks. I was able to tell which table was causing problems right away because I was running the delete statements in blocks of 50 tables at a time. If I had been using the original cursor method, I would have had a much harder time determining which table was causing problems.

One other solution I had tried was to change the DELETE statements so they only deleted rows that were different, and change the INSERT statement so they only inserted rows that were missing. I figured this would significantly reduce the amount of writes to the transaction log. However, I found this slowed down the overall processing way too much to be effective.

Here is the SQL I used to build my INSERT and DELETE statements:

--This script was created by Gugg to more efficiently copy data, for example to refresh
--test data from Prod.  Before running change your SSMS output
--to text, and make sure the max columns returned is set to 8192, the max allowed.  It will also help to turn off headers and change
--output to tab delimited.

--Company Copy
USE DBName

DECLARE @ToCo VARCHAR(3) = '125' --Test Data Identifier
DECLARE @FCo VARCHAR(3) = '820' --Production Data Identifier
DECLARE @ToDB VARCHAR(25) = 'DatabaseName_Test' --Destination Database Name
DECLARE @FromDB VARCHAR(25) = 'DatabaseName' --Source Database Name
DECLARE @ToServer VARCHAR(25) = 'TestServer' --Destination Server
DECLARE @FromServer VARCHAR(25) = 'TestServer' --Source Server, if different the servers must be Linked

--Don't need rowcount
SET NOCOUNT ON

--Make sure ToCo is not 820 - we will never want to overwrite production data
IF @ToCo = '820'
   RAISERROR ('Production data 820 should NOT be overridden',18,1)

--List of all tables and fields with a company column on it.
DECLARE @TablesAndFields TABLE (TID INT IDENTITY(1,1), tablename VARCHAR(255), compcolumnname VARCHAR(255), allcolumnnames VARCHAR(8000))

INSERT INTO @TablesAndFields
   (tablename, compcolumnname, allcolumnnames)
SELECT DISTINCT columns.TABLE_NAME ,columns.column_name, dbo.fnTableColumnsImproved(COLUMNS.TABLE_NAME,columns.TABLE_SCHEMA)
FROM information_schema.COLUMNS columns
INNER JOIN sys.objects tables ON columns.TABLE_NAME = tables.name
WHERE columns.table_schema = 'aaa'  --Only grab from TPS schema
AND columns.column_name <> 'id' --Don't need identity columns
AND columns.COLUMN_NAME LIKE '__COMP' --This is the Identifier naming convention
AND tables.[type] = ( 'U' ) --User tables only 
AND tables.schema_id = 5 --Redundant, only grab from TPS schema
AND tables.[name] NOT IN ( 'SCUSER', 'OEZPCS', 'PRPROC', 'SCUCLS', 'OEITXX', 'DWTRAN' ) --Exclude certain tables

DECLARE @r INT;

SET @r = 1;

--Loop until there are no more tables to go through
WHILE @r >0
BEGIN
   --Set up delete statements
   SELECT TOP 50 'DELETE ToCo FROM ' + @ToServer + '.' + @ToDB + '.cwi.' + tablename + ' ToCo WHERE ToCo.' + compcolumnname + ' = ' + @ToCo + ';' AS SQLDeleteStatement FROM @TablesAndFields ORDER BY tablename;

   --Set up insert statements
   SELECT TOP 50 'INSERT INTO '  + @ToServer + '.' + @ToDB + '.cwi.' + tablename + ' (' + allcolumnnames + ') SELECT ' + REPLACE(allcolumnnames,compcolumnname,@ToCo + ' AS ' + compcolumnname) + ' FROM '  + @FROMServer + '.' + @FROMDB + '.cwi.' + tablename + ' ToCo WHERE ' + compcolumnname + ' = ' + @FCo + ';' AS SQLInsertStatement FROM @TablesAndFields ORDER BY tablename;

   --delete out tables that have been processed
   ;WITH cte AS (SELECT TOP 50 * FROM @TablesAndFields ORDER BY tablename)
   DELETE 
   FROM cte 

   SET @r = @@ROWCOUNT

END

GO

Here is the function that is used to get the columns:

CREATE FUNCTION [dbo].[fnTableColumnsImproved]
   (@tablename VARCHAR(255),@schemaname VARCHAR(255) )

RETURNS VARCHAR(5000)
AS
BEGIN

DECLARE @output VARCHAR(5000)
    SELECT @output = COALESCE(@output + ', ', '') + column_name
    FROM information_schema.COLUMNS
    WHERE TABLE_NAME = @tablename
   AND TABLE_SCHEMA = @schemaname
   AND COLUMN_NAME <> 'ID'

    RETURN @output
END

GO

Using this script I was able to significantly improve a rather cumbersome process that has to be run two to four times per year.