Querying 'top sequences' on SQL Server
by Carl Johansen
(This article originally appeared on the SQL Server Worldwide Users Group site, on 22-Jun-06)
Summary
This article describes a technique for analysing sets of rows that are to be considered
as ordered sequences. It uses the example of sequences of page hits by users
of a web site, but the method could be used in many other situations.
Recently I worked on a usage reporting system for a website. I captured the
visits and requests from the site and stored them in SQL Server and I produced the
usual reports like Most Requested Pages, Top Referrers, etc. Then the users
asked, "can you show us the most popular paths that people take when they
enter the site?" In other words, what are the most popular sequences
of requests? An example of a sequence of length three would be [home.aspx
» news.aspx » contactus.aspx]. Sequences are ordered, so that sequence is
not the same as [home.aspx » contactus.aspx » news.aspx]. This is an interesting
problem and I would like to show you a simplified version of my solution, which
does not use any fancy tricks like user-defined aggregates or cursors and which
works on SQL Server 2000 or later.
First, let's define a simple data model:
A visit consists of a single row in tblVisit, associated with a set of requests
in tblRequest (ordered by sintSeqNum) to represent the requests made by a single
user on a single occasion. The request rows contain pointers to the rows in
tblObject, which contain the names of the pages in the website. In other words,
tblRequest makes a many-to-many join between tblVisit and tblObject.
Suppose we have these rows in tblObject:
|
intObjectID
|
vchrURL
|
|
1001
|
home.aspx
|
|
1002
|
news.aspx
|
|
1003
|
contactus.aspx
|
Now if the user visits these pages in the order home.aspx » news.aspx » contactus.aspx,
then we will have the following rows in tblRequest:
|
intRequestID
|
intVisitID
|
sintSeqNum
|
intObjectID
|
|
5678
|
1234
|
1
|
1001
|
|
5679
|
1234
|
2
|
1002
|
|
5680
|
1234
|
3
|
1003
|
(The intRequestID value is basically irrelevant here. The intVisitID value
of 1234 simply identifies that all these rows belong to the same visit.)
These rows taken together constitute a sequence (of length three). In order
to generate the Top Sequences report, we need a way to uniquely identify each such
sequence and to treat it as a single unit. This is a little tricky, because
SQL is oriented towards treating rows independently of each other. A promising
idea is somehow to generate a hash value for each sequence. It doesn't matter
what the value is, as long as it is different for different sequences (including
two sequences that differ only in the ordering of their requests). If we had
such a value for each of the sequences that met the report's criteria (such as time
frame), we could simply group by it and find the most common sequences.
If we are going to create a hash for our sequence, then the obvious ingredient is
intObjectID. The first idea that sprang to mind was to sum the intObjectID
values. For the above sequence we would get a hash of 1001 + 1002 + 1003 =
3006. Such a calculation could easily be done for all the sequences in the
report with a single aggregate query grouped by intVisitID, but of course it is
nowhere near unique; the same requests in a different order would produce the same
hash. So what about multiplying each object ID by its corresponding sequence
order number? Then our hash would be (1001 * 1) + (1002 * 2) + (1003 * 3)
= 6014. That's better, but it's not hard to imagine two very different sequences
being given the same hash if we're unlucky with the combinations of object IDs and
sequence numbers.
This line of reasoning leads to the realisation that the best possible hash would
be made simply by laying the object IDs out from left to right (or right to left
- it doesn't matter which). It's easier at this point to visualise them in
hex (thinking in 16-bit for the moment):
1001 = 0x03E9
1002 = 0x03EA
1003 = 0x03EB
Hash for 1001 » 1002 » 1003 = 0x03E903EA03EB ( = 4299327931371 in decimal)
Is it unique? Yes. Can SQL Server handle the number? Yes, in a
64-bit bigint. Is it doable in a single SUM? Yes (with a couple of tricks).
So is this it? Not quite. The example above will work, but a real-world
situation is going to hit the buffers. First, we might want to look at sequences
of longer than length three. Second, our object ID values are 32-bit ints
(we could have tens of thousands of pages in the web site). So to hash a sequence
of length six, we would need 6 * 32 = 192 bits! To fit the hash into 63 bits,
we are going to need a couple of rules:
- We will not report on sequences of length greater than about six.
- For a sequence of length n, we will put only the [63 \ n] least significant
bits of each object ID into the hash (we use only 63 bits because bigints are signed
values and we use only positive values in our hashes).
Suppose we are reporting on sequences of length five. Then we take only [
63 \ 5 ] = 12 bits from each object ID and concatenate them, giving a 60-bit hash.
If the object IDs were random numbers then it wouldn't matter which 12 bits we chose,
as long as we were consistent. I suggest taking the 12 least significant bits
because of the nature of the data (since intObjectID is an identity, the values
in tblObject will be 1, 2, 3, ...). Of course, something has to give, and
now there is theoretical risk of non-uniqueness in our hash. The 12 least
significant digits are repeated every 4096 rows. So the following sequences
will all have the same hash value (x[i] is the ID of the ith object in the sequence):
- x[0] » x[1] » x[2] » x[3] » x[4]
- x[0] » x[1] » x[2] » x[3] » ( x[4] + 4096 )
- x[0] » x[1] » ( x[2] + 8192 ) » x[3] » x[4]
- etc...
How much of a problem this is in reality will depend on your circumstances, but
I think in most cases it is not worth worrying about. Remember that our hash
is up to 63 bits long - enough to lay down about 18,000 hash values upon every square
metre of the surface of the earth. The theoretical risk of collisions is vanishingly
small.
Let's call the bit block size b = [63 \ n]. We can extract the b
least significant bits from a number x with a simple mask: x & (2^b - 1).
So the formula for our hash value of the object IDs x[i] for a sequence of i = 0
to n-1 (treating all object IDs as bigints) is:
SUM( ( x[i] & (2^b - 1 ) ) << (b * i) )
It's a darn shame that there's no left-shift operator in SQL Server. We'll
have to fake it with multiplications, which will mean an unfortunate performance
hit (for example, to calculate x << 12 we have to compute x * 4096).
But the bottom line is that we have a method that can give us an (almost) unique
hash value for every sequence in the report with a single aggregate query.
At last, it's time for some SQL.
First, some setup stuff. We compute the size of the block of bits that we
will take from each object ID, and the bit mask that will extract those bits.
We also set up a table of powers of two that will help us perform the left shifts.
DECLARE @tintPathLength tinyint
DECLARE @blockSize tinyint
DECLARE @blockMask bigint
DECLARE @i int
SELECT @tintPathLength = 5 -- You can use any path length from 1 to 63, but
I recommend not going over 6.
SELECT @blockSize = 63 / @tintPathLength
-- Block mask is @blockSize 1 bits - ie (2 ^ @blockSize) - 1.
-- The following technique lets us compute 2^63 - 1 without overflow.
SELECT @blockMask = power(cast(2 as bigint), @blockSize - 1)
SELECT @blockMask = @blockMask + (@blockMask - 1)
DECLARE @tblShiftPowers TABLE ( intNumBlockShifts int, bintPower bigint )
-- Build a table of powers of two, which is used to position the bit blocks in our
hashes
-- (unfortunately, there's no Left Shift operator in SQL Server)
SELECT @i = 0
WHILE (@i < @tintPathLength)
BEGIN
INSERT INTO @tblShiftPowers ( intNumBlockShifts, bintPower )
VALUES ( @i, power(cast(2 as bigint),@blockSize * @i) )
SELECT @i = @i + 1
END
Now we need to extract the sequences on which we wish to report. This part
will probably need to be customised for different situations. The goal is
to fill the temporary table @tblVisitSequence with sequences. Each sequence
is identified by an intVisitID and the items in the sequence must be numbered
0, 1, 2, ... in column sintSeqNum. The example below is very simple; it just
extracts the first tintPathLength rows in each visit. A more advanced
query could extract all the sequences that began with a particular page, or take
sequences only from those visits that fell within a given time frame.
DECLARE @tblVisitSequence TABLE ( intVisitID int, sintSeqNum smallint, intObjectID
int )
INSERT INTO @tblVisitSequence (intVisitID, sintSeqNum, intObjectID )
SELECT R.intVisitID,
(R.sintSeqNum - 1) AS SeqNum,
R.intObjectID
FROM tblRequest R
INNER JOIN tblVisit V
ON R.intVisitID = V.intVisitID
WHERE R.sintSeqNum BETWEEN 1 AND @tintPathLength
Here's an example of tblVisitSequence after running the above statement with tintPathLength
= 3:
|
intVisitID
|
sintSeqNum
|
intObjectID
|
|
1234
|
0
|
1001
|
|
1234
|
1
|
1002
|
|
1234
|
2
|
1003
|
|
1456
|
0
|
1004
|
|
1456
|
1
|
1001
|
|
1456
|
2
|
1009
|
|
1770
|
0
|
1002
|
|
1770
|
1
|
1003
|
|
1770
|
2
|
1001
|
Now we are ready to fill our temporary table of hashes. The query contains
two aggregates: one to compute the hash for each sequence and another to count the
occurrences of each hash:
DECLARE @tblPathHash TABLE ( intExampleVisitID int, intNumOccurs int )
INSERT INTO @tblPathHash ( intExampleVisitID, intNumOccurs )
SELECT MIN(intVisitID) AS ExampleVisitID,
COUNT(*) AS NumPathOccurrences
FROM (
-- Get the path hash for each visit
SELECT intVisitID,
SUM(bintPower * (intObjectID & @blockMask)) AS PathHash
FROM @tblVisitSequence VS
INNER JOIN @tblShiftPowers
ON intNumBlockShifts = VS.sintSeqNum
GROUP BY intVisitID ) AS VistPathHash
GROUP BY VistPathHash.PathHash
ORDER BY NumPathOccurrences DESC
This will fill @tblPathHash with the number of occurrences of each sequence.
Each sequence is represented by a visit ID, which is simply the visit ID of one
of the visits that contains the sequence. You could easily adapt the query
to represent the sequences in some other way, depending on what you wanted to report.
I have put it all together in a bare-bones script that should
give you the basic idea. The script creates the three main tables and populates
them with a small amount of data. However, you won't see very meaningful results
until your database contains a large number of sequences.
Conclusion
This article has described a low-tech method for dealing with ordered sequences
of rows - something that is not really natural in SQL. Even though SQL Server
2005 offers user-defined aggregates that could provide a more flexible alternative,
I still think that the approach I have described offers a useful combination of
good performance, low memory footprint and minimal coding. Purists might appreciate
the absence of cursors.