Typically with a table in SQL Server if you want to enable compression, you’d have to duplicate the table and have downtime while you get the 2 tables in sync and swap them over. Using computed columns, you can compress a table with no downtime, and without requiring any extra database or disk space.
So you have a table that is along the lines of this:
CREATE TABLE WebRequest ( WebRequestID INT IDENTITY(1,1) NOT NULL, RequestType INT NOT NULL, RequestDate DATETIME NOT NULL, RequestXML VARCHAR(MAX) NOT NULL )
I’m using a table that’s auditing requests to a web service, but this could also be something storing free text notes in your application – basically anything with a LOB data type (the VARCHAR(MAX) RequestXML column in my example).
The data in it looks like this:
Your table has got to the point where it has hundreds of millions of rows in it, and it’s 800GB in size. You want to compress the table rather than have to pay for more disk space, but copying all the data into a new table and compressing it as you do so means you need alot of extra database space (my 800GB table would have needed an extra 180GB of space available), and getting the original and the new table in sync means downtime for the application while you stop new rows going into the original table.
You could add a new column to the table and insert the compressed data into it to, then remove the data from your uncompressed column, and while this means you don’t need the extra disk space, your application then needs to retrieve both the compressed and uncompressed columns each time and check which column it should use.
Here’s how you can compress this table without using any extra space, and with no downtime.
Prepare your application.
First, look at how you are going to compress the data.
SQL Server 2016 onwards, you can use the native COMPRESS() and DECOMPRESS() functions.
For versions of SQL Server prior to 2016, you can implement a CLR function as a wrapper for something like gzip, to mimic the 2016 functions (search for GZipStream SQL CLR for examples).
Next, you need to make sure all references to the table in your application use stored procedures, and that all columns used are specified (no using SELECT * !)
For example, to retrieve a specific row from my table, I have the SP:
CREATE PROCEDURE R_WebRequest @WebRequestID INT AS BEGIN SELECT RequestType, RequestDate, RequestXML FROM WebRequest WHERE WebRequestID = @WebRequestID END
And to insert into the table:
CREATE PROCEDURE I_WebRequest @RequestType INT, @RequestDate DATETIME, @RequestXML VARCHAR(MAX) AS BEGIN INSERT INTO WebRequest ( RequestType, RequestDate, RequestXML ) VALUES (@RequestType, @RequestDate, @RequestXML) END
You also need to make your LOB column allow NULL values, if it doesn’t already:
ALTER TABLE dbo.WebRequest ALTER COLUMN RequestXML VARCHAR(MAX) NULL
Changing a column to allow NULL values is instant, regardless of the size of the table.
Add columns to your table
We need a column to store our compressed data in:
ALTER TABLE dbo.WebRequest ADD RequestXMLCompressed VARBINARY(MAX) NULL
Again, this is instant to add to a table, as it’s adding a NULL value rather than a default data value. (Note that if you have a long running statement accessing the table, that may block you from adding the column. You should be fine with quick running functions, even if they’re very frequent).
Now we add the computed column.
Computed columns are supported in all versions of SQL Server from 2008 onwards (and if you’re still using something earlier than 2008 you really need to upgrade!) and are added to a table in the same way as a regular column, but instead of specifying a data type, you provide a SQL statement to execute.
From Microsoft’s article:
“A computed column is a virtual column that is not physically stored in the table but is computed from an expression that uses other columns in the same table”
You can set a “persisted” flag on computed columns, that will store the calculated value in the database – don’t enable it for this task, as doing so will store the uncompressed as well as the compressed data!
ALTER TABLE dbo.WebRequest ADD RequestXMLRead AS CASE WHEN RequestXMLCompressed IS NULL THEN RequestXML ELSE CONVERT(VARCHAR(MAX), DECOMPRESS(RequestXMLCompressed)) END
We’re adding a column with a CASE statement in it, that will return our existing RequestXML column if our new RequestXMLCompressed column doesn’t have any data in it, or it will return the decompressed RequestXMLCompressed column if it does have data in it.
Note the CONVERT(VARCHAR(MAX) around the DECOMPRESS() statement. SQL Server doesn’t know what data type you want your decompressed column to be in, so you need to convert it appropriately here (to the same data type as your uncompressed column).
Our table now looks like this:
You can see the new RequestXMLCompressed column with nothing in it yet, and the new computed column RequestXMLRead, which looks identical to the RequestXML column, but isn’t taking up any extra space in the database.
Update Stored Procedures
We can now update our stored procedures to use the new RequestXMLRead column.
Important: you must update all stored procedures that read from the table first, before any inserts or updates, or you’ll fail to read the data for your new or updated rows.
My read stored procedure now looks like this:
ALTER PROCEDURE R_WebRequest @WebRequestID INT AS BEGIN SELECT RequestType, RequestDate, RequestXMLRead FROM WebRequest WHERE WebRequestID = @WebRequestID END
The only change here is that instead of returning the RequestXML column, we’re returning the RequestXMLRead column.
The insert stored procedure needs to be changed to this:
ALTER PROCEDURE I_WebRequest @RequestType INT, @RequestDate DATETIME, @RequestXML VARCHAR(MAX) AS BEGIN INSERT INTO WebRequest ( RequestType, RequestDate, RequestXMLCompressed ) VALUES (@RequestType, @RequestDate, COMPRESS(@RequestXML)) END
We’ve changed the column specified in the INSERT to be RequestXMLCompressed, ignoring the RequestXML column, and we’ve wrapped the passed in @RequestXML in the COMPRESS() function.
Now all values going into the table will be compressed, and we’ll start saving disk space over what we would have used previously.
If we call the insert stored procedure and add a new row to the table:
EXEC I_WebRequest 2, '05 Sep 2018 19:30', '<root><upper>people</upper><bet>2028355427</bet><border><riding>139679720.9040308</riding><cloud>382980347</cloud></border><atom>-14478772.229665756</atom><rapidly>instrument</rapidly><pond>pink</pond></root>'
The table now looks like this:
You can see our new row at the bottom – the RequestXML column is NULL, the data is stored in the RequestXMLCompressed column, but the RequestXMLRead column contains the readable XML that we inserted.
An added bonus to using this approach is that the table is still human readable without needing to remember to decompress the data first.
Update the existing data
As all the new data going into the table is now compressed, we’ve slowed the rate at which the table is growing, and we can start to compress the old data to recover space in the database.
Before you start, run this SQL to get the current size of the table, so you can compare it to when it’s all compressed:
EXEC sp_spaceused 'WebRequest'
This is just a test table so it’s tiny, but we can still see the impact after the data has all been compressed.
We need to take the existing RequestXML data and compress it into the RequestXMLCompressed column. When this is done, we need to set the RequestXML column to be NULL to recover the space used by it.
Run something along these lines to update your existing data:
DECLARE @IDToUpdate INT WHILE EXISTS (SELECT 1 FROM WebRequest WHERE RequestXML IS NOT NULL) BEGIN SET @IDToUpdate = (SELECT TOP 1 WebRequestID FROM WebRequest WHERE RequestXML IS NOT NULL) UPDATE wr SET wr.RequestXMLCompressed = COMPRESS(wr.RequestXML) FROM WebRequest wr WHERE wr.WebRequestID = @IDToUpdate UPDATE wr SET wr.RequestXML = NULL FROM WebRequest wr WHERE wr.WebRequestID = @IDToUpdate END
The data in our table now looks like this:
All the rows have data in the RequestXMLCompressed column, and nothing in the RequestXML column, but the RequestXMLRead column returns the same readable XML.
If we check the size of the table again, we can see the data size has halved (16KB to 8KB):
Note that this isn’t a great data saving because my XML data is pretty small. For tables with large amounts of LOB data, you can see the table reduce significantly – an 800GB table I’ve performed this on was 180GB after the compression.
You may need to wait for SQL Server to hit the now unused pages in the database before the sp_spaceused command returns a smaller value – this is due to how SQL Server cleans up deleted pages in the database. You can force this on the table by rebuilding the clustered index, as long as it is safe to do so in your system.
This update SQL can be left running for as long as it takes to complete – even if you only run it over night when the system is quiet and it takes 3 weeks to run, all the time it’s compressing data and making your database smaller and your application is retrieving the data with no extra code required.