Building an SEO Data Pipeline

10 min readFeb 1, 2022

In 2020, I left my data-ops role at Atlassian to join an SEO growth firm, Organic Growth Marketing. This decision took some friends, colleagues, and observers by surprise — “you’re leaving an innovative public company to go work for an agency?” But like most things, there was a lot more than meets the eye.

The Chief OG, Nigel, and I shared a vision for how SEO and content analytics could be modernized with best practices. They had a great reputation with Tier-1 clients, but the OGM team recognized that the status quo left much to be desired.

SEO data is notoriously scattered, siloed, and long-tail. Not to mention, spreadsheets and UI-based tools like Google Search Console don’t allow for repeatable and scalable analysis. In other words, the time-to-insights was slow and the number of questions that could be asked from the tools was limited.

“The first thing I had to learn in SEO was how to export data from 2–3+ sources, try to bring it together using lookups in a spreadsheet, spend hours figuring out why my formulas didn’t work, and do half-baked analysis. Fortunately, our team doesn’t need to go through this brutal process anymore” — Nigel Stevens (CEO @ OGM)

Adopting Best Practices: Data Pipelines

Google Analytics, Google Search Console, SEMRush, Ahrefs… the list goes on. Each tool has its own data with its own blindspots. Sure, you could export the data manually into spreadsheets, but try and see how long you can keep that up without errors.

God forbid, you have a large site with tens of thousands of pages (resulting in millions of rows). These are the exact reasons why top-performing analytics teams have adopted data pipelines as an alternative to the old status quo. A data pipeline allows you to take “raw” data and (continuously) transform it via scripts into polished data ready for insights and reporting.

Here’s how data pipelines add value to SEO:

Un-silo Data From Tools: automatically extract data from various tools and data sources via APIs. Centralize the data in a common warehouse, and apply logic to stitch different data sources together. Blending data will enable you to answer questions previously not possible, like: “was the dip in conversions due to a decrease in organic traffic from fewer searches?” Anecdotally, this step is the most painstaking because you are integrating various APIs and blending already aggregated data.
Filter Unwanted Data: depending on the client we work with, some may have unwanted or messy data. At this stage, we filter irrelevant data before it hits the next step of our pipeline. In SaaS, consider all the in-product (app) pages that are swaying the narrative of organic traffic. We either want to exclude or label these product pages to eliminate noise from the pages that marketing can directly control.
Enrichment and Segmentation: now comes the fun part. Why limit yourself to the metrics and dimensions provided by the original data? Here are a few of the custom metrics and dimensions we created:

“branded vs. non-branded” queries, clicks, and impressions
page “categories” (e.g., Landing Pages, Content, Brand)
page “cohorts” (either based on publish date or optimization date)
page and query “topics” (using data science, you can intelligently aggregate long-tail pages and queries into their broader topics)

Screenshot of a unified view between Google Analytics (sessions) and Google Search Console (clicks) segmented by page “category”. Btw, all the demo data is randomly generated.

Having access to additional metrics and dimensions will unlock a firehose of new insights

4. Rule Automation: think offensive and defensive tactics you could automatically surface as anomalies or opportunities. Easily discover emerging queries that are beginning to rank, content that is decaying in traffic and optimization opportunities based on SERP and on-page data. However, you could offload some of the logic to the dashboarding layer instead of the pipeline itself.

Dashboarding

The data pipeline is useless unless your marketers can consume the data. Your data team will have to decide on a preferred visualization tool and what dashboards to create. Below are a few dashboards from our playbook that we leverage for client reporting:

Detailed Comparison: put simply, we use this dashboard to compare performance from one period to another. This is typically the starting place for drilling down to hypothesize a narrative. One of the most useful views has been bucketing pages and queries based on their performance and dynamically hyperlinking segments for further investigation. The other critical metric is contribution percentages so that we can quickly see how much a given segment has driven the overall growth of a certain metric (e.g., query Z has contributed 30% to the overall growth in clicks MoM).

Screenshot from our “detailed comparison” dashboard that enables us to see hotspots of growth

Cohort: since we track when each page is published and every time a page is optimized, we can aggregate publications and optimizations to get the big picture ROI on our efforts. SEO analysis is often not segmented, so the answer to the question — How did our efforts contribute to the bottom line? — is often a mystery. This dashboard is the secret to every SEO Director’s pay re-negotiation.

This graph helps us understand how content grows over time. It really illustrates the prolonged results you can get from good content and how SEO efforts are often compounding.

Content Decay: automatically catch pages that are unnaturally declining in traffic (taking into account seasonality). For a given page, in addition to traffic, we plot the optimization dates so that we can identify if an optimization needs to be reversed because it contributed to less traffic.

Example of a page that we optimized but wound up hurting. Traffic is down 60% from its peak.

Opportunities: we use this dashboard to identify quick wins and low-hanging fruit. Our NSA-like approach uses all the data we have about a given page to assign an opportunity score for each of its keywords. This is where having the SERP and on-page data pays dividends. We then sort the opportunities by their scores and provide as much contextual info as possible for optimization.

Screenshot from the opportunities dashboard for spotting quick optimization wins

So far, we’ve covered a high-level strategy for turning SEO data into automated insights. Now let’s address implementation options. We’ll break down implementation components into data sources, tooling/infrastructure, and team.

Data Sources

The secret to success is the richness of the data. In SEO, you need to consider all the levers that can be managed (e.g., on-page and off-page) and the acquisition strategy for that data. Some dimensions that come to mind when evaluating data sources include accuracy, breadth, depth, velocity, and cost. For each of the data silos, we’ve identified our preferred data sources:

Rankings and Impressions: this is the most contentious data source in the industry, but we prefer Google Search Console as our source of truth for keyword and page rankings, impressions, and clicks. Google is known to obfuscate its data, but we can overlook minor obfuscation if it’s proprietary data.
Traffic and Conversions: you can’t navigate performance on rankings alone; integrating organic traffic and conversions is instrumental to proving ROI. This data source depends entirely on what existing website tracking tool you are using. Most of our clients rely on Google Analytics. As a bonus, you can build forecasting models to set seasonally-adjusted growth targets (down to the page level).
SERPs: this can include, but are not limited to: snippets, SERP types, and competition. SERP data allows us to spot trends and mine optimization opportunities (e.g., “which keywords are we close to having featured snippet?”). Typically, this is raw data scraped from Google SERPs. Instead of scraping it ourselves, we use DataForSEO’s API as a scalable proxy. We also use DataForSEO to estimate keyword search volume.
On-Page: imagine having every header, title, link, and paragraph of your site indexed and scanned for optimization opportunities. To enable this, you must crawl and recrawl your site frequently. DataForSEO offers an affordable solution for this, or you can build your own spider microservice like we chose to. As a bonus, you can catalog your content over time to develop an automated version history. Then combine page-level history with the unexplained traffic above forecasts to empower automated attribution for each content optimization — more ammunition for SEO Directors to prove their worth.
Off-Page Data: last and maybe the least of the efforts are backlinks and authority. This data is useful for tracking individual earned links and aggregate performance. There are many offerings out there, each with its own pricing and claimed accuracy, but we tend to lean on Ahrefs.

Infrastructure & Tooling

The tools you choose to power your SEO data pipeline largely depend on your existing data architecture. Here are the various components of our infrastructure:

Data Warehouse: you need an analytics-oriented data warehouse to store and query your data. We like BigQuery because it’s serverless and minimizes the networking costs of transporting data from Google APIs.
Pipeline Orchestration: to orchestrate our pipeline of table transformations, we use dbt; however, some organizations like Airflow.
Dashboarding: you likely already have a dashboarding tool like Tableau, but our team relies on a SQL-to-visualization tool like Redash for fast iterations.
Pipeline Logic: it’s up to your data analysts, scientists, and engineers, but we write all of our data models and transformation logic in SQL unless a task requires Python.
Crawling and AI: gathering all that on-page data is going to require regularly scraping your own site, likely requiring Python. Additionally, if you want to do anything data-science-related like topical analysis of keywords and pages, you’ll also need Python.
Manual Enrichment: some things cannot be automated. Our team manually labels page “categories” using regex patterns (e.g., /blog/|/guides|/resources to group all “Content” pages). We have these “config files” as google sheets and then ingest them into BigQuery to be stitched into our data pipeline.
(JSON) Object Storage: for all the non-tabular data, like HTML and JSON returned by scraping and SERP APIs, we use an object-storage solution like Google Cloud Storage or S3 to store and stage the data before parsing into the data pipeline.

Team

Call them what you will, you will need an A-team to execute and creatively navigate difficult data challenges that arise. Here’s the general makeup of our unit:

Data Engineer: an expert in Python, APIs, and data pipelines.
Data Analyst: an expert in data modeling, SQL, and dashboarding.
SEO Manager: someone who knows what questions to ask from the data.

Anecdotal Experience

“We used to spend many hours researching trends, only to come up with half-baked hypotheses we weren’t super confident in. Now, in 5–10 minutes, we can confidently identify causal factors and give clients a much better understanding of what’s impacting their organic efforts.”
— Nigel Stevens (CEO @ OGM)

Building the SEO data pipeline comes with a significant learning curve, but the effort has been worth it. We’ve repeated and scaled the model for over a dozen SaaS companies. Here are some of the nuanced takeaways that come with the territory:

Google Data Is Notorious for “Unreliability”: many of you might be familiar with Google Analytics (GA) sampling; however, we find that sampling isn’t a problem if you query the APIs for each date rather than for a date range. GA also lags by 2 days (2.5 days if you have a larger site). Google Search Console (GSC) only returns data between 3 and 180 days ago, so taking snapshots is important. GSC also obfuscates the metrics, so it’s important to choose the most relevant grain depending on the analytical question. For example, if you have a specific analysis looking at keywords and pages then use the (page,query) grain to get the most accurate data. However, if you need metrics at the keywords level then use the query grain. Additionally, if you need just daily stats use the date grain.
Time-To-Development: development time really depends on your expectations and investment. With dedicated engineering and analytics resources, you might be able to get a simple dashboard up in 3 months. Anecdotally, “simple” isn’t good enough and you could be iterating on improvements for years. We are 1.5 years into this project and only getting “started” as we continue to explore.
Budgeting: the majority of expenses are going to be talent. Aside from people, you will incur some data expenses. Our BigQuery (and GCP) costs per site can range from $25-$300+/mo. All the Google API data is free. DataForSEO is relatively inexpensive for procuring search volume, SERP data, and scraping (<$100). Backlinks vendors like Ahref’s API can add up depending on site size ($500+).
Maintenance: code and pipeline maintenance is obvious housekeeping. Equally challenging are the manual tasks like updating the “config” sheets. Recall that we use regex Google Sheets to match URL paths into different categories. We also maintain Google Sheets with every optimization made to a page and every time we publish a new page. We’ve built cadence around these tasks to have up-to-date reporting.

Lastly, we are always looking for talent to join our team. Regardless, if you think this project is cool, connect! You never know how we could work together. Feel free to reach out to me via Linkedin.