e-letters marketplace case study
data analysis case study : 7 tables, much data. Improve your analytical, business, product, & critical thinking
The objective of this case study is to provide an as-real-as-possible example for data analysts to hone their analytical thinking, business acumen, product sense, and critical reasoning skills.
Lots of data; 7 tables. Enjoy!
background:
I host my portfolio on Substack, a platform that acts as a marketplace for electronic newsletters. It’s a space where real money can be made for those willing to put in the effort, with exposure being the key to success. So there are clear incentives for some users to cheat the system.
This idea struck me as I browsed Substack, curious about how they decided which publications to show/offer me. It became apparent that three factors played a role: keywords and categorical relevance, publication age, and subscriber count.
I confess—I have a bit of a criminal mind. This sparked the idea for a realistic analysis case study for data analysts working at a similar marketplace. Rest assured, all the data here is fictional and purely for practice—no accusations, no harm, just fun and an opportunity to learn.
2 options:
In here, you’ll get the chance to analyze data. You can either:
Dive right in with 3 focused tasks to solve the case directly.
Opt for a more gradual approach, with thought-provoking questions to build your business, product, and critical thinking skills, along with analysis and visualization tasks to explore and practice further.
The choice is yours.
the data
Table Name: USERS
General idea: table with all users and when they created their user.
Granularity: one record per userID.Table Name: USERPUBLICATIONS
General idea: table connecting between users and their publications.
Granularity: each user may have more than one publication.Table Name: PUBLICATIONS
General idea: table with all publications, when they were created, deletion datetime if applicable.
Granularity: one record per publicationID.Table Name: PUBLICATION_CATEGORIES
General idea: label category assigned to describe each publication.
Granularity: one record per publicationID.Table Name: POSTS
General idea: one record of postID per publicationID, alongside other data related to the post.
Granularity: one record for every postID, per publicationID.Table Name: EMAILED_POSTS_STATS
General idea: Information concerning the emailed post which was published and then sent to subscribers.
Granularity: one record per postID as each postID can be sent/emailed only once.Table Name: SUBSCRIBERS
General idea: dimensional table which records per combination of subscriber and publicationID, details about the subscription.
Granularity: one record per registration of an email to a / per publicationID, noting time and type of subscriber.
The entire folder. All data/documents available in that folder.
CREATE DATABASE eletters;
USE DATABASE eletters;
THE SCENARIO
Option #1: dive right in
Imagine it’s the start of your weekly planning session with your manager. You are given the following tasks to tackle:
assist the Head of Marketing & the Product team
The Head of Marketing wants to launch a blog series about creating successful publications. This is in response to user requests. They want to appear on the company’s website/application among the top-20 recommended publications (details of how the company ranks them is described below).
Thus, the first focus will be on subscriber growth:
Identify what characterizes some publications which attracted and grew their subscriber base more successfully compared to their peers.
Provide insights & examples that the Head of Marketing and the company’s Content Writer can share.
Not unrelated, you are requested, in parallel, to provide improvements to the ranking logic for the top-20 recommended publications. The Product team awaits your ideas.
Review the current methodology for determining the top-20 publications.
Suggest modifications to the ranking system that will better align with the company’s goals.
Briefly explain the rationale behind your recommendations.
the company’s current ranking methodology for top-20 publications
Here’s how the company currently determines the Top-20 Publications which it displays on its homepage both globally and per category:
Eligibility Criteria:
Publications must not be deleted.
Publications must be at least over 90 days old.
Publications must have at least one new post in the last 120 days.
Ranking Logic:
Publications are ranked primarily by their total number of subscribers as of the current date (feel free to assume the current date is 2021-07-01).
In case of a tie, the earliest-created user (publication-owner) is ranked higher.
the CEO’s concerns for marketplace-integrity
The CEO is concerned about fraudulent activity within the marketplace, which damages its reputation and revenue. Your task is to:
Spot the cheater(s). Look for patterns that suggest abuse or cheating.1
Briefly explain what/how you identified and why it constitutes cheating.
ADDITIONAL PRACTICE: ANALYSIS, VISUALIZATIONS, BUSINESS + PRODUCT + CRITICAL THINKING
Option #2: Warm Up / Additional Practice
Lots of data, lots of tables, lots of fun. You have here all you need for plenty of practice.
The key to make the most out of it is not by just technically practicing SQL/Python etc. (though 100% important too), but to always examine each through the perspective of e.g. business / product angle. Think about the unique contribution, if any, to better understanding of the business, the product, of those engaged with it, and then, the actionable recommendations you may provide stakeholders as a data analyst.
Moreover, recall you may open a Substack account yourself to better understand the product if you feel that you got questions you don’t have answers for.
Here are some ideas:
Carefully review the Publications table. Identify the naming policy defect and craft the message how you’d report it.
Create a dashboard which provides visibility into:
The total amount of users.
The total amount of publications.
The distribution of publications per publication category.
How many subscribers there are & paid/free types.
The distribution of subscribers & types per publication category.
How many email-posts were sent.
Calculate and populate the open_rate (emailed_posts_stats table).
Create a dashboard enabling to select a specific publication and compare its stats with the average of its peers (i.e. overall average and average of the publication category of the selected publication).
Who are the top 5% of users with the most subscribers? Free/Paid subscribers?
Are there publication categories which show more engagement (e.g. likes, comments, shares) by their readership compared to others?
Create a View for the current top-20 publications ranking system: global_rank, category_rank, publication_name, publicationID, publication_category (precise details of the company’s current methodology is in Option #1 above).
Ask yourself how the company generates revenue. Find out how Substack does it. For example, do the publishers pay it to host their e-newsletters (is it pay per X hosting-time, per post, et cetera), does it make money by collecting a fee off payments of paid subscribers or through online advertisements? And so on.
Conduct various cohorts tracking metrics & create visualizations / dashboards:
e.g. see tracking publication cohorts and their posting engagement:
track categories’ posting/subscriber(free/paid) etc. trends over time.
track the engagement metric rates (likes, shares, comments) for posts cohorts. Add filtering per selected publication category to the dashboard.
Track the views:subscribers ratio for publications’ posts. Does it help you deduce any useful insights about the users’ ability to convert viewers into subscribers?
Review the Subscribers, Posts, Users, etc. tables. Explain what you would improve and why (It’s key to explain why it matters). Think tracking processes, changes.
Re-read the description of the Posts, Emailed_Posts_Stats, and Subscribers tables, paying close attention to the fields
posts.new_free/paid_subscribers
andemailed_posts_stats.recipients_number
. Identify any gaps in how the data is being tracked.Use SQL to validate your hypothesis about the gap and craft a concise report to share with the data team. In your report:
Clearly state the identified gap.
Propose an engineering solution to address it.
Suggest a data recovery plan for cases where past data cannot be retrieved. Include your SQL code for the recovery process and specify the table where the recovered data should be stored.
side-note: Remember- data and software systems are inherently flexible. Basically anything that has been created, modified, or tracked can be adjusted after the fact. For instance:
A user or publication can be created or deleted.
Timestamps, such as "post_displayed_datetime," can theoretically be manipulated by users. e.g., to create the appearance of an older, "aged" publication for resale. Not a spoiler in this case ;) Though I thought about it ;)
Examine the email domains and their respective types. Evaluate if there are insights worth reporting.
Evaluate correlations between different metrics. Remember that correlation does not necessarily mean causation.
… Feel free to drop further interesting ideas in the comments / drop me a message via LinkedIn!
Once you find it, you can verify it easily in the raw data. It’ll eliminate any doubt you might have. You’ll know where to find that ‘golden evidence’. But worse case scenario, feel free to drop a comment or message via LinkedIn.