Skip to main content
Documents are where the richest process knowledge hides, and the hardest source to do well. A project archive isn’t just file contents: the folder structure, the filename conventions, and the version history are all signal. Seyn reads all of it.

What gets extracted, beyond the text

SignalWhat Seyn reads from it
File contentsDOCX, PDF, XLSX/XLSM, MSG email files, MPP, parsed to text and sheets.
Folder pathClient or project name, phase, meeting type, document type. /Acme Corp/Onboarding/Meeting Notes/ is metadata.
FilenameConventions like Project_2024-03-15_JS.docx yield dates and author initials, resolved to actor identities.
Version structure”Old” subfolders and date-ordered files are paired into version histories. What changed between drafts is its own signal.

Typed extraction

Parsed documents are routed to one of several purpose-built extractors based on detected document type:
ExtractorBuilt for
Meeting notesDecisions, attendees, action items, follow-ups.
Contracts & agreementsParties, key terms, obligations, effective and renewal dates.
Reports & spreadsheetsTabular figures, key metrics, and the assumptions behind them.
Policies & proceduresSteps, owners, approval requirements, and exceptions.
General purposeEverything else, flagged as such, so an unrecognised document never masquerades as a confidently-parsed one.
Every extraction is a centrally logged LLM call, so every extracted fact carries full provenance back to the source file.

Processing lifecycle

Each document moves through an explicit state machine, visible in the dashboard’s document monitor: Failed documents can be reprocessed from the dashboard with a bounded retry count. Processing runs with a per-organisation concurrency cap, so one tenant’s bulk upload can’t starve another’s.

Document Inbox

The Document Inbox is the self-serve path: any member can upload loose files or a ZIP archive (up to 5 GB) straight from the dashboard, with no source-system credentials required. It’s the fastest way to get a real corpus into Seyn: export a project archive, drop the ZIP, watch it process. Uploads go directly to object storage via presigned URLs; ZIP archives are unpacked server-side by a streaming unpacker, then fan out into per-file processing jobs.
Self-serve upload means hostile files are a design assumption. The unpacker enforces per-file and total size caps, entry-count and compression-ratio limits (ZIP bombs), path-traversal and symlink rejection, and refuses encrypted archives outright. The full defense table is on Security.
Operational guardrails:
  • Format whitelist. Only .docx, .pdf, .xlsx, .xlsm, .msg, .mpp are parsed; everything else is skipped with a recorded reason, not guessed at.
  • Watchdog. An unpack with no progress for 30 minutes is failed automatically. No silent zombies.
  • Archive lifecycle. Uploaded ZIPs are auto-purged from storage after 30 days; the extracted records and their provenance remain.

Cloud documents

SharePoint-hosted documents ride the same pipeline, with one addition: cloud files can be parsed by an external parsing service with retry and size guards instead of being downloaded for local parsing. Teams meeting transcripts In testing also enter as documents.

Common mistakes

SymptomCauseFix
Upload succeeds, document stuck in queuedBackground workers process per-org with a concurrency cap; large batches drain graduallyWatch the document monitor; counts advance as workers progress
A file shows skippedExtension outside the whitelist, or an archive entry tripped a defenseThe skip reason is recorded per file in the monitor
Extracted facts look generic for a specialised documentThe type detector routed it to the general-purpose extractorCheck the extractor flag on the document; folder and filename conventions improve routing
ZIP upload rejected immediatelyEncrypted archive, or over 5 GBRe-export without encryption; split very large archives

Connectors

The full source catalog, including the cloud document sources.

Events

How parsed documents become events in the common schema.