Point In Time Metadata

a lesson learned as a Full Stack Product Analyst, unlocking performance and efficiency

Nov 26, 2024

Analysts often face criticism for ignoring cost-efficiency, opting for more compute power instead of better solutions. Although there are valid reasons for this, it’s time to rethink how we handle large datasets, especially with JSON data stored inside engines designed for tabular data.

This is a lesson I learned as a Full Stack Product Analyst at Firefly.ai, where I was the Product Analyst, Data Product Manager and Data Engineer.

Faster results mean more focus and better analysis.

Here are two examples showcasing how to improve performance and cut costs without sacrificing data accuracy.

Connect on LinkedIn

APPLICATION EVENT LOGS:
Optimizing Event Log Processing

Scenario: You’re working with raw JSON data containing event logs. The goal is to transform this into a daily log of active users without vendor lock-in, ensuring cost-effective performance.1

Challenges:

Large datasets (e.g., 10M+ rows).
Avoiding redundant re-processing of rows.
Avoiding vendor lock-in.

Solution:

Add a timestamp column to the raw table (MiRRor). Every new record automatically includes its creation timestamp.
Create a Point In Time (PIT) Metadata table (DIMensional) to store key processing details like the last processed date.

Benefits:

No need to re-compute MAX(DATE) for every query (from FACTual table).
Process only new rows added after the last run.
Significantly faster processing and reduced compute costs.

How the procedure works:

Use PIT Metadata to check the last processed date.
Compare it with the timestamp of new records.
Parse only the unprocessed rows into a structured table for analysis.

Below, the process flow:

Illustrating further, and enabling anyone to test and compare efficiency, follow the process below:

CREATE DATABASE PIT_METADATA;
USE DATABASE PIT_METADATA;

download the dataset

CREATE TABLE PIT_METADATA.dbo.application_event_logs (event_logs NVARCHAR(MAX));

--loading 5 million out of total: 10,104,419 rows (event logs)
---LOAD CSV 
BULK INSERT PIT_METADATA.dbo.application_event_logs
FROM 'C:\Users\userName\Desktop\Documents\novsto\first_batch_(5m_event_logs).csv'
WITH (
FORMAT='CSV',
FIRSTROW=0
);

ALTER TABLE PIT_METADATA.dbo.application_event_logs
ADD TS_MS DATETIME2 NULL;

-- Add a DEFAULT constraint to insert the current system datetime for new rows
ALTER TABLE PIT_METADATA.dbo.application_event_logs
ADD CONSTRAINT DF_TS_MS_DEFAULT
DEFAULT (SYSDATETIME()) FOR TS_MS;

CREATE TABLE PIT_METADATA.dbo.application_event_logs_staging (event_logs NVARCHAR(MAX));

--loading remaining 5mln 104,410
BULK INSERT PIT_METADATA.dbo.application_event_logs_staging
FROM 'C:\Users\userName\Desktop\Documents\novsto\second_batch_(5_1m_event_logs).csv'
WITH (
    FORMAT = 'CSV',
    FIRSTROW = 1,
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n'
);

INSERT INTO PIT_METADATA.dbo.application_event_logs (event_logs)
SELECT event_logs
FROM PIT_METADATA.dbo.application_event_logs_staging;

DROP TABLE PIT_METADATA.dbo.application_event_logs_staging;

DYNAMIC COHORT POSTING ANALYSIS

Scenario: For an e-newsletter platform (e.g., Substack), you want to track monthly posting activity rates. The focus is on how user cohorts evolve over time, comparing their posting habits month-to-month.

Challenges:

Dynamically handling growing data as new months are added.
Calculating accurate rates with no manual updates.

Solution:

Group users by cohorts (e.g., month of account creation).
Use a dynamic query to calculate monthly posting rates:
- Current Month Posts ÷ Previous Month Posts
- Handle nulls and zero values to avoid errors.
Save intermediate results in the PIT Metadata table for reuse.
(not in this post: save rates calculations for closed months).

Benefits:

No need to rebuild the entire dataset every time.
Scales efficiently with growing data.
Provides clear insights into user engagement trends.

*of course that further formatting such as color-coding for trends is great for yourself and your stakeholders to grasp trends faster (e.g. see top image of this post).

CREATE DATABASE eletters;
USE DATABASE eletters;

CREATE TABLE posts (
 publicationID	NVARCHAR(35) NOT NULL,
 postID	NVARCHAR(10) NOT NULL,
 createdAt DATETIME NOT NULL,
 post_title NVARCHAR(110),
 post_displayed_datetime DATETIME NOT NULL,
 views INT NOT NULL,
 new_free_subscribers INT NOT NULL,	
 new_paid_subscribers INT NOT NULL,
 likes INT NOT NULL,
 shares INT NOT NULL,
 comment_counter INT NOT NULL
);

CREATE TABLE publications (
 publicationID NVARCHAR(35) NOT NULL,
 publication_createdDate DATE NOT NULL,
 publication_name NVARCHAR(50) NOT NULL,
 publication_deletedDate DATE
  );

download Posts table data

download Publications table data

The code below generates all of the output shown above. It calculates it all from start to finish. It is more classic of what an analyst would usually do.

CREATE OR ALTER PROCEDURE DYNAMIC_RATES AS
BEGIN
    -- step0: declaring variables
    DECLARE @PIVOT_POST_DATE VARCHAR(MAX);
    DECLARE @NULL_TERMINATOR VARCHAR(MAX);
    DECLARE @QUERY NVARCHAR(MAX);

    -- step1: creating distinct YYYY-MM posting date groups
-- step2: saving the posting dates into two kind of string variables:
------2.1: PIVOT POST DATE is for the PIVOT function
------2.2: NULL TERMINATOR is to eliminate nulls in the columns (see distinction in output: "0.00" vs. "0")
    WITH distinct_posting_dates AS (
        SELECT DISTINCT FORMAT(createdAt, 'yyyy-MM') AS POST_DATE
        FROM posts
    )
    SELECT @PIVOT_POST_DATE = STRING_AGG('[' + POST_DATE + ']', ',') WITHIN GROUP (ORDER BY POST_DATE ASC),
        @NULL_TERMINATOR = STRING_AGG(', COALESCE([' + POST_DATE + '], ''0'') AS [' + POST_DATE + ']', '') WITHIN GROUP (ORDER BY POST_DATE ASC)
    FROM distinct_posting_dates;

    -- step3: building the dynamic query.
-------- note how the rates are calculated: 
-------- ensuring fraction ( * 1.00), 
-------- NULLIF to eliminate division by zero errors
-------- COALESCE to handle post-NULLification
-------- Rounding, turning into string for then to truncate fraction length
    SET @QUERY = '
    WITH posts_posted AS (
        SELECT publicationID,
               FORMAT(createdAt, ''yyyy-MM'') AS post_date,
               COUNT(postID) AS posts_amount
        FROM posts
        GROUP BY publicationID, FORMAT(createdAt, ''yyyy-MM'')
    )
    SELECT COHORT ' + @NULL_TERMINATOR + '
    FROM (
        SELECT 
            FORMAT(publication_CreatedDate, ''yyyy-MM'') AS COHORT,
            post_date AS POST_DATE,
            LEFT(
             CAST(
              ROUND(
               COALESCE(
                SUM(posts_amount) * 1.00
                  /
                NULLIF(
                 LAG(SUM(posts_amount)) OVER (
       PARTITION BY FORMAT(publication_CreatedDate, ''yyyy-MM'')
       ORDER BY FORMAT(publication_CreatedDate, ''yyyy-MM'')
             ), 0), 0), 2) AS VARCHAR(20)), 4
            ) AS POSTING_RATE
        FROM publications
        JOIN posts_posted
            ON publications.publicationID = posts_posted.publicationID
        GROUP BY FORMAT(publication_CreatedDate, ''yyyy-MM''), post_date
    ) AS sourceTable
    PIVOT (
        MAX(POSTING_RATE)
        FOR POST_DATE IN (' + @PIVOT_POST_DATE + ')
    ) AS pivoted_table
    ORDER BY COHORT ASC;
    ';

    -- step3: execute the dynamic query
    EXEC sp_executesql @QUERY;
END;

EXEC DYNAMIC_RATES;

This current one exemplifies that the PIT_METADATA table can be leveraged in another way compared to the first example. See the code below:

--create the table
CREATE TABLE eletters.dbo.PIT_METADATA (
	UDF_NAME NVARCHAR(1000),
	TABLE_NAME NVARCHAR(1000),
	VARIABLE_NAME NVARCHAR(1000),
	STRING_VALUE NVARCHAR(MAX),
	TIMESTAMP_VALUE DATETIME2(3),
	DATE_VALUE DATE,
	VALUE_LAST_UPDATED_AT DATETIME2(3)
	)
;
--create temporary table
CREATE TABLE #temp_metadata (
    PIVOT_POST_DATE NVARCHAR(MAX),
    NULL_TERMINATOR NVARCHAR(MAX)
);

--populate the table
WITH distinct_posting_dates AS (
    SELECT DISTINCT FORMAT(createdAt, 'yyyy-MM') AS POST_DATE
    FROM posts
    WHERE createdAt < '2021-06-01' --- so we can subsequently run and add the next date
)
-- Generate strings for both PIVOT_POST_DATE and NULL_TERMINATOR
INSERT INTO #temp_metadata (PIVOT_POST_DATE, NULL_TERMINATOR)
SELECT 
    STRING_AGG('[' + POST_DATE + ']', ',') WITHIN GROUP (ORDER BY POST_DATE ASC) AS PIVOT_POST_DATE,
    STRING_AGG(', COALESCE([' + POST_DATE + '], ''0'') AS [' + POST_DATE + ']', '') WITHIN GROUP (ORDER BY POST_DATE ASC) AS NULL_TERMINATOR
FROM distinct_posting_dates;

---insert values into PIT metadata
INSERT INTO eletters.dbo.PIT_METADATA (UDF_NAME, VARIABLE_NAME, STRING_VALUE, VALUE_LAST_UPDATED_AT)
SELECT 'DYNAMIC_RATES', 'PIVOT_POST_DATE', PIVOT_POST_DATE, '2021-05-30 10:10:10.111'
--SYSDATETIME() if it was in Production
FROM #temp_metadata;

INSERT INTO eletters.dbo.PIT_METADATA (UDF_NAME, VARIABLE_NAME, STRING_VALUE, VALUE_LAST_UPDATED_AT)
SELECT 'DYNAMIC_RATES', 'NULL_TERMINATOR', NULL_TERMINATOR, '2021-05-30 10:10:10.111'
--SYSDATETIME() if it was in Production
FROM #temp_metadata;

--drop the temporary table
DROP TABLE #temp_metadata;

CREATE OR ALTER PROCEDURE RECURRING_DYNAMIC_RATES AS
BEGIN
    --delcaring variables
    DECLARE @PIVOT_POST_DATE NVARCHAR(MAX);
    DECLARE @NULL_TERMINATOR NVARCHAR(MAX);
    DECLARE @QUERY NVARCHAR(MAX);
    DECLARE @LAST_UPDATED_AT DATETIME2(3);
--Current Date on Production should be SYSDATETIME(), i.e. not hardcoded like here.
    DECLARE @CURRENT_DATE DATETIME2(3) = CAST('2021-06-06 06:06:06.006' AS DATETIME2);

    --fetch metadata from PIT_METADATA table instead of re-computing it
    SELECT 
        @PIVOT_POST_DATE = STRING_VALUE,
        @LAST_UPDATED_AT = VALUE_LAST_UPDATED_AT
    FROM eletters.dbo.PIT_METADATA
    WHERE UDF_NAME = 'DYNAMIC_RATES'
      AND VARIABLE_NAME = 'PIVOT_POST_DATE';

    SELECT 
        @NULL_TERMINATOR = STRING_VALUE
    FROM eletters.dbo.PIT_METADATA
    WHERE UDF_NAME = 'DYNAMIC_RATES'
      AND VARIABLE_NAME = 'NULL_TERMINATOR';

    --check if metadata is outdated 
--(i.e. checks if need to update the strings because a new month started since last run)
    IF FORMAT(@CURRENT_DATE, 'yyyy-MM') > FORMAT(@LAST_UPDATED_AT, 'yyyy-MM')
    BEGIN
        -- append only the new month
------PIVOT POST DATE is for the PIVOT function
------NULL TERMINATOR is to eliminate nulls in the columns (see distinction in output: "0.00" vs. "0")
        SELECT @PIVOT_POST_DATE = @PIVOT_POST_DATE + ',[' + FORMAT(@CURRENT_DATE, 'yyyy-MM') + ']',
               @NULL_TERMINATOR = @NULL_TERMINATOR + ', COALESCE([' + FORMAT(@CURRENT_DATE, 'yyyy-MM') + '], ''0'') AS [' + FORMAT(@CURRENT_DATE, 'yyyy-MM') + ']'
        ;

        --update the metadata in PIT_METADATA table
        UPDATE eletters.dbo.PIT_METADATA
        SET STRING_VALUE = @PIVOT_POST_DATE,
            VALUE_LAST_UPDATED_AT = @CURRENT_DATE
        WHERE UDF_NAME = 'DYNAMIC_RATES'
          AND VARIABLE_NAME = 'PIVOT_POST_DATE';

        UPDATE eletters.dbo.PIT_METADATA
        SET STRING_VALUE = @NULL_TERMINATOR,
            VALUE_LAST_UPDATED_AT = @CURRENT_DATE
        WHERE UDF_NAME = 'DYNAMIC_RATES'
          AND VARIABLE_NAME = 'NULL_TERMINATOR';
    END

    -- the dynamic query:
-------- note how the rates are calculated: 
-------- ensuring fraction ( * 1.00), 
-------- NULLIF to eliminate division by zero errors
-------- COALESCE to handle post-NULLification
-------- Rounding, turning into string for then to truncate fraction length
    SET @QUERY = '
    WITH posts_posted AS (
        SELECT publicationID,
               FORMAT(createdAt, ''yyyy-MM'') AS post_date,
               COUNT(postID) AS posts_amount
        FROM posts
        GROUP BY publicationID, FORMAT(createdAt, ''yyyy-MM'')
    )
    SELECT COHORT ' + @NULL_TERMINATOR + '
    FROM (
        SELECT 
            FORMAT(publication_CreatedDate, ''yyyy-MM'') AS COHORT,
            post_date AS POST_DATE,
            LEFT(
                CAST(
                    ROUND(
                        COALESCE(
                            SUM(posts_amount) * 1.00
                            / NULLIF(LAG(SUM(posts_amount)) OVER (
                                PARTITION BY FORMAT(publication_CreatedDate, ''yyyy-MM'')
                                ORDER BY FORMAT(publication_CreatedDate, ''yyyy-MM'')
                            ), 0), 0
                        ),
                        2
                    ) AS VARCHAR(20)
                ), 4
            ) AS POSTING_RATE
        FROM publications
        JOIN posts_posted
            ON publications.publicationID = posts_posted.publicationID
        GROUP BY FORMAT(publication_CreatedDate, ''yyyy-MM''), post_date
    ) AS sourceTable
    PIVOT (
        MAX(POSTING_RATE)
        FOR POST_DATE IN (' + @PIVOT_POST_DATE + ')
    ) AS pivoted_table
    ORDER BY COHORT ASC;
    ';

    -- execute the dynamic query
    EXEC sp_executesql @QUERY;
END;

EXEC RECURRING_DYNAMIC_RATES;

Moreover, the saving of the rates-data itself isn’t addressed in this particular code example (though definitely worth it). Yet that’s definitely something you’d want to coordinate with your manager / data engineers.

Connect on LinkedIn

Key Takeaways:

Efficiency Matters: Streamlining processes saves time and costs, improving overall performance.
Leverage Metadata: PIT tables store reusable values, reducing redundant computations.
Think Dynamically: Design solutions that adapt to growing data without manual intervention.

By applying these techniques, you’ll not only save costs but also make your workflows more scalable and future-proof. Start small, test often, and refine your approach. Efficiency is the key to unlocking better insights.

That said, coordinate such efforts with your manager and the data engineering.

Based on my experience in Snowflake with Okta’s Auth0 data.

Point In Time Metadata

a lesson learned as a Full Stack Product Analyst, unlocking performance and efficiency

APPLICATION EVENT LOGS: Optimizing Event Log Processing

DYNAMIC COHORT POSTING ANALYSIS

Key Takeaways:

Discussion about this post

APPLICATION EVENT LOGS:
Optimizing Event Log Processing