What gets extracted, beyond the text
| Signal | What Seyn reads from it |
|---|---|
| File contents | DOCX, PDF, XLSX/XLSM, MSG email files, MPP, parsed to text and sheets. |
| Folder path | Client or project name, phase, meeting type, document type. /Acme Corp/Onboarding/Meeting Notes/ is metadata. |
| Filename | Conventions like Project_2024-03-15_JS.docx yield dates and author initials, resolved to actor identities. |
| Version structure | âOldâ subfolders and date-ordered files are paired into version histories. What changed between drafts is its own signal. |
Typed extraction
Parsed documents are routed to one of several purpose-built extractors based on detected document type:| Extractor | Built for |
|---|---|
| Meeting notes | Decisions, attendees, action items, follow-ups. |
| Contracts & agreements | Parties, key terms, obligations, effective and renewal dates. |
| Reports & spreadsheets | Tabular figures, key metrics, and the assumptions behind them. |
| Policies & procedures | Steps, owners, approval requirements, and exceptions. |
| General purpose | Everything else, flagged as such, so an unrecognised document never masquerades as a confidently-parsed one. |
Processing lifecycle
Each document moves through an explicit state machine, visible in the dashboardâs document monitor: Failed documents can be reprocessed from the dashboard with a bounded retry count. Processing runs with a per-organisation concurrency cap, so one tenantâs bulk upload canât starve anotherâs.Document Inbox
The Document Inbox is the self-serve path: any member can upload loose files or a ZIP archive (up to 5 GB) straight from the dashboard, with no source-system credentials required. Itâs the fastest way to get a real corpus into Seyn: export a project archive, drop the ZIP, watch it process. Uploads go directly to object storage via presigned URLs; ZIP archives are unpacked server-side by a streaming unpacker, then fan out into per-file processing jobs. Operational guardrails:- Format whitelist. Only
.docx,.pdf,.xlsx,.xlsm,.msg,.mppare parsed; everything else is skipped with a recorded reason, not guessed at. - Watchdog. An unpack with no progress for 30 minutes is failed automatically. No silent zombies.
- Archive lifecycle. Uploaded ZIPs are auto-purged from storage after 30 days; the extracted records and their provenance remain.
Cloud documents
SharePoint-hosted documents ride the same pipeline, with one addition: cloud files can be parsed by an external parsing service with retry and size guards instead of being downloaded for local parsing. Teams meeting transcripts In testing also enter as documents.Common mistakes
| Symptom | Cause | Fix |
|---|---|---|
Upload succeeds, document stuck in queued | Background workers process per-org with a concurrency cap; large batches drain gradually | Watch the document monitor; counts advance as workers progress |
A file shows skipped | Extension outside the whitelist, or an archive entry tripped a defense | The skip reason is recorded per file in the monitor |
| Extracted facts look generic for a specialised document | The type detector routed it to the general-purpose extractor | Check the extractor flag on the document; folder and filename conventions improve routing |
| ZIP upload rejected immediately | Encrypted archive, or over 5 GB | Re-export without encryption; split very large archives |
Related
Connectors
The full source catalog, including the cloud document sources.
Events
How parsed documents become events in the common schema.