When storing massive amounts of data in data lake or data warehouse solutions with each data entry attached to an ObjectID, the timestamp data embedded in the ObjectID comes in handy for certain time range queries without unnecessary joins. Here are a few example use cases.
- Summarize 12 month trailing purchase records by user creation/signup time given that user ID’s are ObjectID’s
- Summarize 12 month trailing activities by user session given that session ID’s are ObjectID’s
In this article, we will look into parsing and using the timestamp part of the ObjectID in Hive and Presto queries. Before we go further, let’s first look into the anatomy of the 12-byte ObjectID and go into the queries.
Anatomy of the ObjectID
Let’s break down the Mongo ObjectID “5fee6600f75eabce97882944” as an illustration. Depending on the MongoDB version, the anatomy is different. We will look at both but the meaning of the first 4 bytes did not change.
Exclusively From MongoDB 3.2
5fee6600 | f75eabce97 | 882944
(1) (2) (3)
- The first 4 bytes (1) is the creation time in seconds since the Unix epoch
- The next 5 bytes (2) is the random value
- The last 3 bytes (3) is the incrementing counter
Inclusively To MongoDB 3.2
5fee6600 | f75eab | ce97 | 882944
(1) (2) (3) (4)
- The first 4 bytes (1) and the last 3 bytes (4) are the same as the later versions
- The second 3 bytes (2) is the machine identifier
- The next 2 bytes (3) is the process id
Extracting the Creation Time of the ObjectID’s
The following operations in their respective platform will result in an epoch time.
Hive
CAST(
CONV(
SUBSTRING(your_object_id, 0, 8),
16, 10
)
AS BIGINT)
Presto
FROM_BASE(
SUBSTR(your_object_id, 1, 8), -- index starts from 1
16
)