Extract Time From Mongo ObjectID For Time Range Queries

When storing massive amounts of data in data lake or data warehouse solutions with each data entry attached to an ObjectID, the timestamp data embedded in the ObjectID comes in handy for certain time range queries without unnecessary joins. Here are a few example use cases.

  • Summarize 12 month trailing purchase records by user creation/signup time given that user ID’s are ObjectID’s
  • Summarize 12 month trailing activities by user session given that session ID’s are ObjectID’s

In this article, we will look into parsing and using the timestamp part of the ObjectID in Hive and Presto queries. Before we go further, let’s first look into the anatomy of the 12-byte ObjectID and go into the queries.

Anatomy of the ObjectID

Exclusively From MongoDB 3.2

5fee6600 | f75eabce97 | 882944
(1) (2) (3)
  1. The first 4 bytes (1) is the creation time in seconds since the Unix epoch
  2. The next 5 bytes (2) is the random value
  3. The last 3 bytes (3) is the incrementing counter

Inclusively To MongoDB 3.2

5fee6600 | f75eab | ce97 | 882944
(1) (2) (3) (4)
  1. The first 4 bytes (1) and the last 3 bytes (4) are the same as the later versions
  2. The second 3 bytes (2) is the machine identifier
  3. The next 2 bytes (3) is the process id

Extracting the Creation Time of the ObjectID’s

Hive

CAST(
CONV(
SUBSTRING(your_object_id, 0, 8),
16, 10
)
AS BIGINT)

Presto

FROM_BASE(
SUBSTR(your_object_id, 1, 8), -- index starts from 1
16
)

References

TODO

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store