Skip to main content
The extraction pipeline never sees a Teams/Slack message, an email payload, or a PDF. It sees events: one common schema, regardless of source. Events are the layer that makes cross-system reasoning possible. A process that spans email, chat, and documents only becomes visible when all three speak the same language.

The event schema

Every event answers the same four questions:
FieldQuestionExample
actorWho did it?Jane Smith (canonical identity, not a per-system handle)
actionWhat did they do?message_sent, document_created, deal_stage_changed
entityTo which thing?Deal “Northwind”, document “Contract_v3.docx”
timestampWhen?The source system’s time, not ingestion time
Plus, always: provenance, the raw record IDs this event was derived from. One raw record can fan out into many events, and every link is recorded.

What normalization handles for you

  • Deduplication. Raw records are hashed on ingestion; message edits are detected by body hash. Re-syncing unchanged content produces zero new events.
  • Edits update in place. An edited message updates its existing event rather than spawning a duplicate; the event history stays clean.
  • Deletes are soft. Deletions in the source mark events deleted rather than removing them, because removing them would orphan the provenance chain above.
  • Sender classification. Every message event carries a sender class (member / guest / application), so bot noise and guest chatter can be weighed differently from employee activity.
  • System-message filtering. Joins, renames, and app-install notices are dropped before they ever become events.

Actor identity resolution

The same person is j.smith@acme.com in email, Jane Smith in Teams/Slack, and JS in a document filename. Process extraction is about who does what, so identity resolution is foundational, not cosmetic. Resolution is deterministic: exact email and name matching merges per-system identities into one canonical actor. When a document’s author initials can’t be resolved to a known person, a placeholder identity is created and merged later when the evidence arrives.
Deterministic-only matching is a deliberate tradeoff. It can leave two identities unmerged (a maiden name, a nickname), but it never silently merges two different people, which would corrupt every “who does this step” answer downstream. We chose the failure mode you can detect over the one you can’t.

Why a common schema, and what it costs

The alternative (reasoning over raw payloads directly) would preserve more per-source nuance. We normalize anyway, because:
  • Cross-system processes are the point. “Contract draft circulated by email, discussed in Teams/Slack, signed version filed in SharePoint” is only one process if all three systems emit comparable events.
  • Extraction stays source-agnostic. A new connector means writing one normalizer, not retraining every analysis stage.
  • Provenance stays walkable. A uniform event layer gives the audit chain one consistent middle hop.
The cost is honest: normalization discards source-specific richness (formatting, reactions, thread structure beyond reply links). The raw record keeps everything verbatim. Nothing is lost; it’s just one provenance hop away rather than in the event itself.

Knowledge

What extraction does with the events this layer produces.

Connectors

Where the raw records come from.