If you’ve spent any meaningful time with GA4, you’ve probably arrived at the same conclusion most data teams reach: the GA4 user interface is not where the real work happens. Between data thresholding, cardinality limits, and a reporting structure that fights you at every turn, BigQuery is the only realistic way to extract genuine value from your analytics data.
The problem is that GA4’s BigQuery export schema is far from being a good data structure for analytics work. It’s nested where it should be flat, missing where it should be complete, and silent where it should be explicit. We spent years working with it, complaining about it, and building workarounds for it. When we designed d8a’s schema, we decided to fix all of it.
Let’s be specific. These aren’t vague grievances; they’re structural problems that cost data teams real time every single week.
Nested and repeated structures everywhere. GA4 stores event parameters as a repeated RECORD field called event_params, where each parameter is a key-value pair buried inside a nested structure. Want the page URL for an event? You can’t just select a column. You have to UNNEST the array and filter by key name. The same applies to user_properties and items. Every query starts with boilerplate that has nothing to do with your actual question.
Here’s what it takes to get a page URL from GA4 BigQuery:
SELECT
(SELECT value.string_value
FROM UNNEST(event_params)
WHERE key = 'page_location') AS page_location
FROM `project.dataset.events_*`
The issue is not line count; it’s complexity. Multiply this pattern by every parameter you need, and your query logic turns into nested extraction boilerplate before you’ve done any real analysis.
session_id is a timestamp, not an identifier. GA4 derives its ga_session_id from a Unix timestamp with one-second precision. The same value can appear for different users when sessions start in the same second. On its own, it’s not a unique identifier and only becomes useful when combined with a user-level identifier.
No real session scope. The BigQuery export is a flat event log with no session-level entity. There are no session rows, no session aggregates, no entry or exit pages, and no session duration column. Want to calculate bounce rate? You first need to build helper query blocks. Want the landing page for each session? You need advanced SQL functions like FIRST_VALUE. Want session duration? You need to subtract the first event timestamp from the last event timestamp for each session. Every session-level question requires you to rebuild the session from scratch.
Timestamp reliability issues. GA4 provides event_timestamp, event_server_timestamp, and event_previous_timestamp, three different time references with no clear guidance on which to trust for chronological ordering. Client-side timestamps can drift, arrive out of order, or reflect throttled background tabs. Server timestamps depend on when the hit was received, not when the action happened. Sorting events reliably within a session is harder than it should be.
Attribution bugs. Missing gclid values are a well-known issue. Paid campaigns that should carry a Google Click ID arrive without one, causing paid traffic to appear as organic. UTM parameters are also sometimes missing. Channel grouping logic is opaque and controlled entirely by Google. You can’t audit it, you can’t override it at the data level, and when it miscategorizes your traffic, your only option is to build your own attribution from scratch.
Consent turns your data into a black hole. When a user doesn’t grant consent, GA4 still collects events but strips all meaningful identifiers. No client_id, no session_id, no user properties. Those events land in BigQuery as orphaned rows, impossible to stitch into sessions and impossible to attribute to any user journey. You see the volume, but you can’t do anything with it. There’s no fallback mechanism to group those events into even basic sessions.
d8a takes the opposite approach to data structure. Every known event parameter recommended by Google, including the full ecommerce specification (purchase, refund, add_to_cart, view_item, and the rest), is flattened into its own dedicated column.
Want the page URL? It’s page_location. Search term? params_search_term. Transaction ID? params_transaction_id. Currency? params_currency. No UNNEST, no subqueries, no key-value lookups.
The same query from before becomes:
SELECT
page_location
FROM events
One column reference with no nested extraction. Works in any SQL client, any BI tool, any notebook. Your analysts don’t need to memorize GA4’s nested structure. They just query columns.
But flattening known parameters doesn’t mean we throw away the rest. Every event also carries a params array column that preserves the complete set of parameters exactly as they were sent. If your implementation includes custom event parameters that d8a doesn’t have a dedicated column for, they’re still in the data. Nothing is lost. You get the convenience of flat columns for standard parameters and the completeness of the full payload for everything else.
d8a computes sessions entirely on the backend. There is no session_start event. There is no client-side session timer. Sessions are formed server-side by stitching events together using the standard client_id from the cookie and, when available, the user_id set by your implementation.
This approach works identically for web tracking and server-to-server tracking. Session logic does not depend on browser behavior or perfect client-side implementation by the developer, which makes sessions more reliable regardless of how events arrive.
For situations where client_id and user_id are unreliable (no consent granted, aggressive cookie blocking, or simply a new visitor with no stored identifiers), d8a uses a feature called Session Stamp. It’s a privacy-friendly fallback identifier built from backend heuristics. The concept borrows from fingerprinting, but with a critical design constraint: Session Stamp has a short TTL window, which means it can only be used to group events into sessions within that window. It cannot be used to build persistent user profiles or track individuals across sessions.
That distinction matters. A fingerprint is a sticky, long-lived identifier that follows users across time. Session Stamp is a short-lived, session-scoping tool that expires quickly and serves exactly one purpose: making sure events that belong together end up in the same session, even when cookies aren’t available.
The practical impact is significant. In GA4, events collected without consent are orphaned: volume without context. In d8a, those same events can still be stitched into meaningful sessions, giving you usable analytics data even in strict consent environments. A session is usually the clearest signal of user intent, and that’s something analytics teams should not lose.
Server-side session computation unlocks something GA4 fundamentally cannot offer: a real session scope baked directly into the database schema.
In GA4 BigQuery, every session-level question requires you to aggregate, window, or self-join across the event log. In d8a, the answer is already a column. Here’s what ships out of the box:
Navigation context. Every event carries its position within the session. session_hit_number gives the sequence position of any event in the session. session_page_number gives the sequence position of the related page_view within the session. previous_page_location and next_page_location give you the pages before and after the current one, along with their titles. You can reconstruct the full user journey without a single window function.
Entry and exit tracking. session_entry_page_location, session_exit_page_location, and their title counterparts are available directly. We also track session_second_page_location and session_second_page_title, which is useful for analyzing what users do immediately after landing. Need to know if an event is on the first or last page of a session? session_is_entry_page and session_is_exit_page are integer flags ready to filter on.
Time on page. time_on_page is calculated automatically from event sequences. No need to subtract timestamps between consecutive page_view events or deal with the “last page has no duration” problem. It’s handled for you.
Session aggregates. Core engagement metrics are pre-calculated at the session level: session_total_page_views, session_unique_page_views, session_total_purchases, session_total_scrolls, and many others. Calculating bounce rate or identifying high-engagement sessions is a WHERE clause, not a subquery.
Behavioral flags. session_is_engaged tells you whether the session had meaningful interaction. session_returning_user flags repeat visitors. session_abandoned_cart identifies sessions with an add_to_cart event but no purchase, so you don’t need to write that logic yourself.
Session metadata. session_duration, session_first_event_time, session_last_event_time, and session_total_events are all available directly. If a session was split because a new UTM campaign parameter arrived mid-session or the user_id changed, session_split_cause records exactly why.
Attribution at session scope. All UTM parameters, click IDs (gclid, fbclid, msclkid, and others), and calculated session_source, session_medium, and session_term are duplicated at the session level. You don’t need to join back to the first event or write a window function to find the landing page’s campaign parameters. They’re already there on every row in the session.
d8a is an ingestion engine for your data warehouse. That means your data lands in infrastructure you control, and we don’t impose arbitrary restrictions on what you can collect or how much of it you keep.
No PII limitations. If your use case requires storing full IP addresses, device identifiers, or other personal data, you can. It’s your system, your infrastructure, your compliance responsibility. We recommend handling personal data consciously and in accordance with applicable regulations, but we don’t make that decision for you. d8a gives you the pipeline; you define the policy.
Higher data limits. All string fields in d8a support up to 8,192 characters. In GA4, the default event parameter value limit is 100 characters, with a few exceptions (page_title up to 300, page_referrer up to 420, and page_location up to 1,000). For many custom parameters, long values still get truncated. In d8a, your data arrives complete. No silent truncation, no lost context.
Pluggable geolocation. d8a ships with a default geolocation provider, but the architecture allows you to plug in any service. If you need higher accuracy for regional analysis, you can use a premium MaxMind license or any other provider that fits your requirements. The choice is yours, not ours.
Traffic attribution in d8a is calculated purely from raw data. Every session gets a session_source, session_medium, and session_term derived from the landing page URL, HTTP referrer, UTM parameters, and ad network click identifiers.
Detection follows a clear, documented priority order: paid click IDs (like gclid, fbclid, msclkid) are checked first, followed by video platforms, email providers, social media, AI assistants, search engines, generic referral, and finally direct traffic. There’s no black box. The logic is deterministic, and the priority order is fully documented.
UTM parameters always override auto-detected values, and they’re applied independently per field. If you only provide utm_source, the detected medium and term remain unchanged. This gives you precise control over attribution without losing automatic detection for the fields you don’t override.
The reference lists that power detection (search engines, social networks, video platforms, email providers, and AI tools) are maintained as open YAML files in the codebase. You can inspect them, extend them, or modify them for your specific needs.
For full details, see the traffic attribution documentation.
The database schema documentation maps every column to its name, type, scope, and description. This documentation is integrated directly with your data warehouse, which means AI agents and copilots can understand the underlying data structure and write queries for you. When your team asks an AI assistant to “show me the top landing pages by engaged sessions last month,” the agent has everything it needs to write correct SQL on the first try.
Clean, structured metadata is not a nice-to-have anymore. It’s the difference between an AI that guesses and one that gets it right.
d8a currently supports the GA4 gtag protocol and its own native protocol. The architecture is ready for additional protocols (Matomo, PostHog, Snowplow, or others) if user demand calls for it. There are no technical limitations preventing that, and when the need arises, you won’t have to rebuild your pipeline.
The same applies to database schema customization. We’re building d8a into a flexible pipeline that adapts to how your business measures success. The foundation is there, and as users push the boundaries of what they need, we’ll keep expanding what the schema can do.
GA4 puts data teams in an impossible position: a user interface too limited for serious analysis, and a BigQuery export schema that demands extensive engineering just to answer basic questions. d8a was designed to resolve that tension. A flat, queryable schema with real session scope, honest attribution, and no arbitrary limitations.