Documents - Seyn

Documents are where the richest process knowledge hides, and the hardest source to do well. A project archive isn’t just file contents: the folder structure, the filename conventions, and the version history are all signal. Seyn reads all of it.

What gets extracted, beyond the text

Signal	What Seyn reads from it
File contents	DOCX, PDF, XLSX/XLSM, MSG email files, MPP, parsed to text and sheets.
Folder path	Client or project name, phase, meeting type, document type. `/Acme Corp/Onboarding/Meeting Notes/` is metadata.
Filename	Conventions like `Project_2024-03-15_JS.docx` yield dates and author initials, resolved to actor identities.
Version structure	”Old” subfolders and date-ordered files are paired into version histories. What changed between drafts is its own signal.

Typed extraction

Parsed documents are routed to one of several purpose-built extractors based on detected document type:

Extractor	Built for
Meeting notes	Decisions, attendees, action items, follow-ups.
Contracts & agreements	Parties, key terms, obligations, effective and renewal dates.
Reports & spreadsheets	Tabular figures, key metrics, and the assumptions behind them.
Policies & procedures	Steps, owners, approval requirements, and exceptions.
General purpose	Everything else, flagged as such, so an unrecognised document never masquerades as a confidently-parsed one.

Every extraction is a centrally logged LLM call, so every extracted fact carries full provenance back to the source file.

Processing lifecycle

Each document moves through an explicit state machine, visible in the dashboard’s document monitor: Failed documents can be reprocessed from the dashboard with a bounded retry count. Processing runs with a per-organisation concurrency cap, so one tenant’s bulk upload can’t starve another’s.

Document Inbox

The Document Inbox is the self-serve path: any member can upload loose files or a ZIP archive (up to 5 GB) straight from the dashboard, with no source-system credentials required. It’s the fastest way to get a real corpus into Seyn: export a project archive, drop the ZIP, watch it process. Uploads go directly to object storage via presigned URLs; ZIP archives are unpacked server-side by a streaming unpacker, then fan out into per-file processing jobs.

Self-serve upload means hostile files are a design assumption. The unpacker enforces per-file and total size caps, entry-count and compression-ratio limits (ZIP bombs), path-traversal and symlink rejection, and refuses encrypted archives outright. The full defense table is on Security.

Operational guardrails:

Format whitelist. Only .docx, .pdf, .xlsx, .xlsm, .msg, .mpp are parsed; everything else is skipped with a recorded reason, not guessed at.
Watchdog. An unpack with no progress for 30 minutes is failed automatically. No silent zombies.
Archive lifecycle. Uploaded ZIPs are auto-purged from storage after 30 days; the extracted records and their provenance remain.

Cloud documents

SharePoint-hosted documents ride the same pipeline, with one addition: cloud files can be parsed by an external parsing service with retry and size guards instead of being downloaded for local parsing. Teams meeting transcripts ^{In testing} also enter as documents.

Common mistakes

Symptom	Cause	Fix
Upload succeeds, document stuck in `queued`	Background workers process per-org with a concurrency cap; large batches drain gradually	Watch the document monitor; counts advance as workers progress
A file shows `skipped`	Extension outside the whitelist, or an archive entry tripped a defense	The skip reason is recorded per file in the monitor
Extracted facts look generic for a specialised document	The type detector routed it to the general-purpose extractor	Check the extractor flag on the document; folder and filename conventions improve routing
ZIP upload rejected immediately	Encrypted archive, or over 5 GB	Re-export without encryption; split very large archives

Connectors

The full source catalog, including the cloud document sources.

Events

How parsed documents become events in the common schema.

​What gets extracted, beyond the text

​Typed extraction

​Processing lifecycle

​Document Inbox

​Cloud documents

​Common mistakes

​Related

Connectors

Events

What gets extracted, beyond the text

Typed extraction

Processing lifecycle

Document Inbox

Cloud documents

Common mistakes

Related