Alan Dix - research topics - randomness
Normal SQL databases do not give any support for random sampling, so we have to resort to 'tricks' to get a sample. There are two main ways. In each we use some field from the data, often computed, that is deemed to be 'random'. We then either:
SELECT * FROM data ORDER BY the_field LIMIT 100;
SELECT * FROM data WHERE MOD(the_field,1000) < 10;
If the database has 10,000 entries then (b) will obtain approximately 100 records, whereas (a) will get exactly 100 records. However, if we look at the time to compute, (b) is a lot quicker as it is simply a filter and is O(N) (~10,000 record accesses in this example). In contrast (a) is normally O(NlogN) as the database has to be sorted before the sample (of size n) is extracted (~300,000 record accesses). A very clever sort implementation may be able to extract the top n in O(Nlogn) (~90,000), but this would still take longer than (b). This is one reason for suggesting some latitude in the number in the sample! The accuracy of a calculation on a subset of the data may not be very dependent on the sample size anyway [Piatetsky-Shapiro 1984].
Only if the sort field has been indexed, either (a) or (b) be able to obtain the sample in O(n) time.
The field to use for selection varies on the data.
One special choice is the appearance order in the database, in which case (a) and (b) both reduce to:
SELECT * FROM data LIMIT 100;
That is choose the first n. Clearly this is only acceptable if the original order in which the fields were added is in some way random relative to the attributes of interest.
As already noted some fields may be a natural candidate, such as the middle letters of a street name or digits of a telephone number.
Alternatively a random field may be created using the SQL '
RAND()' function. However, this has to be done with care. For example, the following does not work as expected in some databases (e.g. MySQL).
SELECT *, ra AS RAND() FROM data ORDER BY ra LIMIT 100;
Either the query optimiser notices that the expression '
RAND()' appears to be constant and so optimises it away! Or the expression is recalculated every time the record is read. At best this will simply mean that the sort is random as desired but the value of '
ra' returned in the query is arbitrary and not related to the value used in the sort. However, at worst the changing value of the sort field could make the database sort misbehave and possibly not terminate at all!
A safer alternative is to build a separate table consisting of the original database key and a calculated random value.
INSERT INTO data_rnd ( key, ra ) SELECT key, ra AS RAND() FROM data;
Selections can then use this table joined to the original data table. This has the advantage that records for filtering or sorting are drawn from this table which is still O(N) large, but consists of smaller records. For very large datasets this table could itself be a sample of the data set from which you sub-sample. This table in turn should be indexed on the random field '
ra' to enable O(n) sampling.
The use of a special field, whether generated or pre-existing means that we can be sure that samples overlap, so a sample of 200 will always includes the sample of 100. If we want independent samples we need to work harder. In case (b) we can use non-overlapping ranges:
SELECT *, m AS MOD(the_field,1000) FROM data WHERE 10 <= m and m < 20;
Or overlap the ranges by just the right amount to have the correct statistical properties.
Another way is to hash the field used for selection with a fixed key, for example:
SELECT *, FROM data WHERE MOD(MOD(the_field*key,large_prime),1000) < 10;
Different choices of key and large_prime give independent sample sets, each with the required overlap properties.
The problem with doing this is that the nature of a hashing destroys any value in indexing
the_field, meaning we are back to O(N) or O(NlogN) sampling. However, an index table could be created for multiple sample families.
Sampling only those records satisfying a query requires some care. Case (a) is easy and the relevant
WHERE clause is just added. Case (b) is more difficult. For example, if there are 10,000 records, and we believe the query would return 20% of these, but we want a sample of 100, then the select statement should be:
SELECT * FROM data WHERE query AND MOD(the_field,1000) < 50;
The figure of 50 out of 1000 (5%) is chosen so that 5% of the 20% that match the query gives us 1% of the database, 100 records.
Weighted sampling based on attribute values is more complex still
SELECT * FROM data WHERE MOD(the_field,1000) < weight*lim;
The fixed value
lim needs to be chosen to get the right number of samples. Depending on how much is known a priori about the distribution of the attribute a smaller sub-sample may need to be selected first to determine the right size. In the case of the Astral Visualiser or other forms of 'zoom', the data collected for the overview can act as this sub-sample.