The event schema
Every event answers the same four questions:| Field | Question | Example |
|---|---|---|
actor | Who did it? | Jane Smith (canonical identity, not a per-system handle) |
action | What did they do? | message_sent, document_created, deal_stage_changed |
entity | To which thing? | Deal âNorthwindâ, document âContract_v3.docxâ |
timestamp | When? | The source systemâs time, not ingestion time |
What normalization handles for you
- Deduplication. Raw records are hashed on ingestion; message edits are detected by body hash. Re-syncing unchanged content produces zero new events.
- Edits update in place. An edited message updates its existing event rather than spawning a duplicate; the event history stays clean.
- Deletes are soft. Deletions in the source mark events deleted rather than removing them, because removing them would orphan the provenance chain above.
- Sender classification. Every message event carries a sender class (
member/guest/application), so bot noise and guest chatter can be weighed differently from employee activity. - System-message filtering. Joins, renames, and app-install notices are dropped before they ever become events.
Actor identity resolution
The same person isj.smith@acme.com in email, Jane Smith in Teams/Slack, and JS in a document filename. Process extraction is about who does what, so identity resolution is foundational, not cosmetic.
Resolution is deterministic: exact email and name matching merges per-system identities into one canonical actor. When a documentâs author initials canât be resolved to a known person, a placeholder identity is created and merged later when the evidence arrives.
Deterministic-only matching is a deliberate tradeoff. It can leave two identities unmerged (a maiden name, a nickname), but it never silently merges two different people, which would corrupt every âwho does this stepâ answer downstream. We chose the failure mode you can detect over the one you canât.
Why a common schema, and what it costs
The alternative (reasoning over raw payloads directly) would preserve more per-source nuance. We normalize anyway, because:- Cross-system processes are the point. âContract draft circulated by email, discussed in Teams/Slack, signed version filed in SharePointâ is only one process if all three systems emit comparable events.
- Extraction stays source-agnostic. A new connector means writing one normalizer, not retraining every analysis stage.
- Provenance stays walkable. A uniform event layer gives the audit chain one consistent middle hop.
Related
Knowledge
What extraction does with the events this layer produces.
Connectors
Where the raw records come from.