Querying 'top sequences' on SQL Server

by Carl Johansen

(This article originally appeared on the SQL Server Worldwide Users Group site, on 22-Jun-06)

Summary

This article describes a technique for analysing sets of rows that are to be considered as ordered sequences.  It uses the example of sequences of page hits by users of a web site, but the method could be used in many other situations.

Recently I worked on a usage reporting system for a website.  I captured the visits and requests from the site and stored them in SQL Server and I produced the usual reports like Most Requested Pages, Top Referrers, etc.  Then the users asked, "can you show us the most popular paths that people take when they enter the site?"  In other words, what are the most popular sequences of requests?  An example of a sequence of length three would be [home.aspx » news.aspx » contactus.aspx].  Sequences are ordered, so that sequence is not the same as [home.aspx » contactus.aspx » news.aspx].  This is an interesting problem and I would like to show you a simplified version of my solution, which does not use any fancy tricks like user-defined aggregates or cursors and which works on SQL Server 2000 or later.

First, let's define a simple data model:

A visit consists of a single row in tblVisit, associated with a set of requests in tblRequest (ordered by sintSeqNum) to represent the requests made by a single user on a single occasion.  The request rows contain pointers to the rows in tblObject, which contain the names of the pages in the website.  In other words, tblRequest makes a many-to-many join between tblVisit and tblObject.

Suppose we have these rows in tblObject:

intObjectID vchrURL
1001 home.aspx
1002 news.aspx
1003 contactus.aspx

Now if the user visits these pages in the order home.aspx » news.aspx » contactus.aspx, then we will have the following rows in tblRequest:

intRequestID intVisitID sintSeqNum intObjectID
5678 1234 1 1001
5679 1234 2 1002
5680 1234 3 1003

(The intRequestID value is basically irrelevant here.  The intVisitID value of 1234 simply identifies that all these rows belong to the same visit.)

These rows taken together constitute a sequence (of length three).  In order to generate the Top Sequences report, we need a way to uniquely identify each such sequence and to treat it as a single unit.  This is a little tricky, because SQL is oriented towards treating rows independently of each other.  A promising idea is somehow to generate a hash value for each sequence.  It doesn't matter what the value is, as long as it is different for different sequences (including two sequences that differ only in the ordering of their requests).  If we had such a value for each of the sequences that met the report's criteria (such as time frame), we could simply group by it and find the most common sequences.

If we are going to create a hash for our sequence, then the obvious ingredient is intObjectID.  The first idea that sprang to mind was to sum the intObjectID values.  For the above sequence we would get a hash of 1001 + 1002 + 1003 = 3006.  Such a calculation could easily be done for all the sequences in the report with a single aggregate query grouped by intVisitID, but of course it is nowhere near unique; the same requests in a different order would produce the same hash.  So what about multiplying each object ID by its corresponding sequence order number?  Then our hash would be (1001 * 1) + (1002 * 2) + (1003 * 3) = 6014.  That's better, but it's not hard to imagine two very different sequences being given the same hash if we're unlucky with the combinations of object IDs and sequence numbers.

This line of reasoning leads to the realisation that the best possible hash would be made simply by laying the object IDs out from left to right (or right to left - it doesn't matter which).  It's easier at this point to visualise them in hex (thinking in 16-bit for the moment):

1001 = 0x03E9
1002 = 0x03EA
1003 = 0x03EB

Hash for 1001 » 1002 » 1003 = 0x03E903EA03EB ( = 4299327931371 in decimal)

Is it unique?  Yes.  Can SQL Server handle the number?  Yes, in a 64-bit bigint.  Is it doable in a single SUM?  Yes (with a couple of tricks).  So is this it?  Not quite.  The example above will work, but a real-world situation is going to hit the buffers.  First, we might want to look at sequences of longer than length three.  Second, our object ID values are 32-bit ints (we could have tens of thousands of pages in the web site).  So to hash a sequence of length six, we would need 6 * 32 = 192 bits!  To fit the hash into 63 bits, we are going to need a couple of rules:

  1. We will not report on sequences of length greater than about six.
  2. For a sequence of length n, we will put only the [63 \ n] least significant bits of each object ID into the hash (we use only 63 bits because bigints are signed values and we use only positive values in our hashes).

Suppose we are reporting on sequences of length five.  Then we take only [ 63 \ 5 ] = 12 bits from each object ID and concatenate them, giving a 60-bit hash.  If the object IDs were random numbers then it wouldn't matter which 12 bits we chose, as long as we were consistent.  I suggest taking the 12 least significant bits because of the nature of the data (since intObjectID is an identity, the values in tblObject will be 1, 2, 3, ...).  Of course, something has to give, and now there is theoretical risk of non-uniqueness in our hash.  The 12 least significant digits are repeated every 4096 rows.  So the following sequences will all have the same hash value (x[i] is the ID of the ith object in the sequence):

  • x[0] » x[1] » x[2] » x[3] » x[4]
  • x[0] » x[1] » x[2] » x[3] » ( x[4] + 4096 )
  • x[0] » x[1] » ( x[2] + 8192 ) » x[3] » x[4]
  • etc...

How much of a problem this is in reality will depend on your circumstances, but I think in most cases it is not worth worrying about.  Remember that our hash is up to 63 bits long - enough to lay down about 18,000 hash values upon every square metre of the surface of the earth.  The theoretical risk of collisions is vanishingly small.

Let's call the bit block size b = [63 \ n].  We can extract the b least significant bits from a number x with a simple mask: x & (2^b - 1).  So the formula for our hash value of the object IDs x[i] for a sequence of i = 0 to n-1 (treating all object IDs as bigints) is:

SUM( ( x[i]  & (2^b - 1 ) ) << (b * i) )

It's a darn shame that there's no left-shift operator in SQL Server.  We'll have to fake it with multiplications, which will mean an unfortunate performance hit  (for example, to calculate x << 12 we have to compute x * 4096).  But the bottom line is that we have a method that can give us an (almost) unique hash value for every sequence in the report with a single aggregate query.  At last, it's time for some SQL.

First, some setup stuff.  We compute the size of the block of bits that we will take from each object ID, and the bit mask that will extract those bits.  We also set up a table of powers of two that will help us perform the left shifts.

DECLARE @tintPathLength tinyint
DECLARE @blockSize tinyint
DECLARE @blockMask bigint
DECLARE @i int

SELECT @tintPathLength = 5   -- You can use any path length from 1 to 63, but I recommend not going over 6.

SELECT @blockSize = 63 / @tintPathLength

-- Block mask is @blockSize 1 bits - ie (2 ^ @blockSize) - 1.
-- The following technique lets us compute 2^63 - 1 without overflow.
SELECT @blockMask = power(cast(2 as bigint), @blockSize - 1)
SELECT @blockMask = @blockMask + (@blockMask - 1)

DECLARE @tblShiftPowers TABLE ( intNumBlockShifts int, bintPower bigint )

-- Build a table of powers of two, which is used to position the bit blocks in our hashes
--  (unfortunately, there's no Left Shift operator in SQL Server)
SELECT @i = 0
WHILE (@i < @tintPathLength)
BEGIN
INSERT INTO @tblShiftPowers ( intNumBlockShifts, bintPower )
VALUES ( @i, power(cast(2 as bigint),@blockSize * @i) )
SELECT @i = @i + 1
END

Now we need to extract the sequences on which we wish to report.  This part will probably need to be customised for different situations.  The goal is to fill the temporary table @tblVisitSequence with sequences.  Each sequence is identified by an intVisitID and the items in the sequence must be numbered 0, 1, 2, ... in column sintSeqNum.  The example below is very simple; it just extracts the first tintPathLength rows in each visit.  A more advanced query could extract all the sequences that began with a particular page, or take sequences only from those visits that fell within a given time frame.

DECLARE @tblVisitSequence TABLE ( intVisitID int, sintSeqNum smallint, intObjectID int )

INSERT INTO @tblVisitSequence (intVisitID, sintSeqNum, intObjectID )
SELECT R.intVisitID,
(R.sintSeqNum - 1) AS SeqNum,
R.intObjectID
FROM tblRequest R
INNER JOIN tblVisit V
ON R.intVisitID = V.intVisitID
WHERE R.sintSeqNum BETWEEN 1 AND @tintPathLength

Here's an example of tblVisitSequence after running the above statement with tintPathLength = 3:

intVisitID sintSeqNum intObjectID
1234 0 1001
1234 1 1002
1234 2 1003
1456 0 1004
1456 1 1001
1456 2 1009
1770 0 1002
1770 1 1003
1770 2 1001

Now we are ready to fill our temporary table of hashes.  The query contains two aggregates: one to compute the hash for each sequence and another to count the occurrences of each hash:

DECLARE @tblPathHash TABLE ( intExampleVisitID int, intNumOccurs int )

INSERT INTO @tblPathHash ( intExampleVisitID, intNumOccurs )
SELECT MIN(intVisitID) AS ExampleVisitID,
COUNT(*) AS NumPathOccurrences
FROM (
-- Get the path hash for each visit
SELECT intVisitID,
SUM(bintPower * (intObjectID & @blockMask)) AS PathHash
FROM @tblVisitSequence VS
INNER JOIN @tblShiftPowers
ON intNumBlockShifts = VS.sintSeqNum
GROUP BY intVisitID ) AS VistPathHash
GROUP BY VistPathHash.PathHash
ORDER BY NumPathOccurrences DESC

This will fill @tblPathHash with the number of occurrences of each sequence.  Each sequence is represented by a visit ID, which is simply the visit ID of one of the visits that contains the sequence.  You could easily adapt the query to represent the sequences in some other way, depending on what you wanted to report.

I have put it all together in a bare-bones script that should give you the basic idea.  The script creates the three main tables and populates them with a small amount of data.  However, you won't see very meaningful results until your database contains a large number of sequences.

Conclusion

This article has described a low-tech method for dealing with ordered sequences of rows - something that is not really natural in SQL.  Even though SQL Server 2005 offers user-defined aggregates that could provide a more flexible alternative, I still think that the approach I have described offers a useful combination of good performance, low memory footprint and minimal coding.  Purists might appreciate the absence of cursors.

©2009 Carl Johansen