Session Management Evaluation Survey
9-dimensional framework for assessing LLM session management systems.
How to Use This Survey
- For each system being evaluated, score it on all 9 dimensions (0-5 scale)
- Use the guiding questions for each dimension
- Calculate average across all 9 dimensions for overall viability rating
- Use interpretation guide to assess if system meets your needs
Dimension 1: Architecture & Coordination
Definition: How session data is stored, discovered, and related to other sessions.
Scoring Questions
0-1 (Critical gaps):
- Storage format is proprietary binary (not human-readable)
- No fork/branch support at all
- Session discovery only works through proprietary client
- No API for external tools
2-3 (Partial support):
- Sessions stored in JSON but no fork metadata
- Fork detection requires manual linking
- Limited to single session at a time
- API exists but undocumented
4-5 (Production-ready):
- Sessions stored in human-readable format (JSON, JSONL)
- Forks tracked as DAG (Directed Acyclic Graph)
- Session relationships queryable via API
- Offline extraction possible without main client
Red Flags
- "Our proprietary format is optimized" (= not human-readable)
- "Forks are handled automatically" with no explanation
- No way to access raw data outside the tool
Dimension 2: Memory & State Management
Definition: How context, state, and knowledge are tracked across turns and sessions.
Scoring Questions
0-1 (No real management):
- Token counts missing or unreliable
- No separation of roles (system/user/assistant)
- Memory treated as opaque blob
- No tracking of context window exhaustion
2-3 (Basic management):
- Token counts present but not per-message
- Roles separated but not consistently
- Memory structure exists but limited query
- Context window mentioned but not managed
4-5 (Sophisticated):
- Token counts tracked per message, per role
- Clear separation of tactical (current) vs. strategic (cross-session) memory
-
CLAUDE.mdor equivalent persistent context parsed and indexed - Context window exhaustion detected and warned
Red Flags
- "Memory just works" (= no visibility into how)
- Thinking tokens are discarded or hidden
- No way to distinguish system prompts from user input
Dimension 3: Session Lifecycle Operations
Definition: What operations are supported on sessions (create, resume, fork, merge, etc.).
Scoring Questions
0-1 (Limited operations):
- Only "create new" and "view" supported
- No fork/branch capability
- No checkpoint/snapshot feature
- Resuming always from latest point
2-3 (Standard operations):
- Create, resume, fork, view supported
- Can resume from specific point in history
- Can manually checkpoint
- No atomic guarantees
4-5 (Advanced):
- Create, resume, fork, merge, archive all supported
- Resume from any message in history
- Automatic checkpoints (e.g., every N tokens)
- Atomic state transitions with rollback
- Pause/unpause capability
Red Flags
- "We don't support forks; just create new sessions"
- "Resuming always overwrites the previous attempt"
- No way to compare two branches side-by-side
Dimension 4: Discovery & Retrieval
Definition: Finding sessions and searching within them.
Scoring Questions
0-1 (Discovery is manual):
- Only manual list or file browser
- No search capability
- No filtering by date, directory, status
- Linear scan required to find anything
2-3 (Basic discovery):
- List sessions with some metadata
- Keyword search but not full-text
- Filter by date OR directory (not both)
- No fork tracing UI
4-5 (Powerful discovery):
- List with all metadata (date, directory, tokens, status)
- Full-text search across all messages
- Complex filtering (AND/OR logic)
- Fork tracing and lineage visualization
- Fast on 1000+ sessions
Red Flags
- "Use Ctrl-F in the file browser" (not real search)
- Search works but takes >10 seconds
- Can't filter by multiple criteria simultaneously
Dimension 5: Export & Integration
Definition: Getting data out in usable formats and integrating with other tools.
Scoring Questions
0-1 (Data is trapped):
- No export capability
- Only screenshots or copy-paste allowed
- No API for external access
- Proprietary format prevents reuse
2-3 (Basic export):
- Export to JSON exists
- Markdown export but loses metadata
- No reimport capability
- No tool integration
4-5 (Seamless integration):
- Export to Markdown with full metadata preserved
- Export to JSON/JSONL losslessly
- Reimport and resume from export
- Git integration (can commit sessions)
- MCP (Model Context Protocol) support for agent handoff
- Webhook/API for external processing
Red Flags
- "Export is for backup only"
- Metadata lost on export
- No way to use exported data elsewhere
Dimension 6: Performance & Scalability
Definition: How the system handles large numbers of sessions and messages.
Scoring Questions
0-1 (Not scalable):
- Lists all 1000 sessions takes >10 seconds
- Search takes >30 seconds
- System becomes sluggish with 100+ sessions
- Full-text search not implemented
2-3 (Acceptable performance):
- List 1000 sessions in 2-5 seconds
- Search across 10K messages in 5-10 seconds
- Some indexing but not comprehensive
- Works with 100s of sessions but not 1000s
4-5 (Optimized):
- List 1000+ sessions in <1 second
- Search 100K+ messages in <5 seconds
- SQLite or equivalent with proper indexes
- Incremental indexing (only new data)
- Caching for frequent queries
Red Flags
- "Performance gets worse with more data"
- Search implemented as linear scan
- No mention of indexing strategy
Dimension 7: Reliability & Durability
Definition: Data integrity, crash recovery, and safety guarantees.
Scoring Questions
0-1 (Risky):
- No backup mechanism
- Crashes may lose recent data
- No ACID compliance
- No recovery documentation
2-3 (Basic durability):
- Automatic backups but frequency unclear
- Some protection against crashes
- Manual recovery procedure exists
- No verification of data integrity
4-5 (Production-grade):
- ACID compliance (atomic writes)
- Automatic backups at regular intervals
- Crash recovery tested and documented
- Data integrity checks (checksums, etc.)
- Versioning/rollback capability
- Clear retention policy
Red Flags
- "Just use git" (not a substitute for database safety)
- No mention of what happens if process crashes mid-write
- Backups are manual
Dimension 8: Security & Privacy
Definition: Access control, encryption, and sensitive data handling.
Scoring Questions
0-1 (No security):
- No file permissions enforcement
- No encryption at rest
- All users can read all sessions
- No audit trail
2-3 (Basic security):
- File permissions respected
- Optional encryption
- Per-user access control mentioned
- Limited audit capability
4-5 (Secure):
- File permissions enforced at application level
- Encryption at rest by default
- Per-session access control with roles
- Audit trail (who accessed what when)
- Sensitive data detection and redaction
- Secure deletion option
Red Flags
- "Security isn't relevant for local tools"
- No mention of file permissions
- No way to control who sees which sessions
Dimension 9: Developer Experience
Definition: How easy and pleasant it is to use the tool.
Scoring Questions
0-1 (Poor UX):
- Complex configuration required
- Error messages are cryptic
- No documentation
- Frequent crashes or bugs
2-3 (Acceptable):
- Configuration has reasonable defaults
- Error messages are sometimes helpful
- Basic documentation exists
- Stable but occasional issues
4-5 (Excellent):
- Zero configuration or sensible defaults
- Clear, actionable error messages
- Comprehensive documentation with examples
- IDE integration (VS Code, Cursor plugin)
- CLI is simple and intuitive
- Active community support
Red Flags
- "You have to read the source code to use it"
- "We don't support any IDE integration"
- Error messages like "ERROR: XXXXXX"
Scoring & Interpretation
Calculation
For each dimension, assign 0-5 score based on how many red flags, guiding questions, and capabilities align.
Overall Viability Score = (D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 + D9) / 9
Viability Interpretation
| Score | Rating | Decision |
|---|---|---|
| 0.0-1.5 | Unusable | Fundamental gaps. Do not use. |
| 1.5-2.5 | Problematic | Critical features missing. Significant work needed. |
| 2.5-3.5 | Viable | Usable but has gaps. Can be improved. |
| 3.5-4.5 | Good | Production-ready. Minor improvements possible. |
| 4.5-5.0 | Excellent | Best-in-class. Ready for team rollout. |
Comparison Template
Use this template to evaluate multiple systems side-by-side.
``` System A: _____________ | System B: _____________ | System C: _____________
Architecture & Coordination Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Memory & State Management Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Session Lifecycle Operations Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Discovery & Retrieval Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Export & Integration Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Performance & Scalability Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Reliability & Durability Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Security & Privacy Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
Developer Experience Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________
OVERALL SCORE: ___/5 | OVERALL SCORE: ___/5 | OVERALL SCORE: ___/5
Recommendation: _________________________________ ```
Examples
Example 1: Existing "claude-conversation-extractor" Tool
| Dimension | Score | Notes |
|---|---|---|
| Architecture | 3 | JSON output, but no fork tracking |
| Memory | 2 | Exports content but no token/role separation |
| Lifecycle | 1 | Export only; no fork/resume support |
| Discovery | 2 | Search works but limited filtering |
| Export | 4 | Markdown export with metadata |
| Performance | 4 | Handles large exports well |
| Reliability | 3 | Straightforward extraction, unlikely to corrupt |
| Security | 2 | No access control; trusts filesystem |
| Developer | 3 | Pip installable; basic docs |
| OVERALL | 2.7 | Viable for one-time export, not session management |
Example 2: Proposed "claude-session-manager" Package
| Dimension | Score | Notes |
|---|---|---|
| Architecture | 5 | DAG fork tracking, API-first design |
| Memory | 5 | Per-message tokens, role separation, tactical/strategic memory |
| Lifecycle | 4 | Create, resume, fork, archive; merge TBD |
| Discovery | 5 | Full-text search, complex filtering, fork visualization |
| Export | 5 | Markdown + JSON + reimport + git integration |
| Performance | 4 | SQLite indexes, search on 100K+ messages |
| Reliability | 4 | ACID compliance, automatic backups, tested recovery |
| Security | 3 | File permissions, audit trail; encryption TBD |
| Developer | 4 | CLI tool, Python library, comprehensive docs |
| OVERALL | 4.4 | Good - production-ready with minor enhancements |
Decision Framework
If evaluating existing tools:
- Score < 2.5: Not worth integrating; build custom solution
- Score 2.5-3.5: Integrate but plan enhancements
- Score > 3.5: Adopt with confidence
If building new system:
- Target score: 4.0+
- Focus first on Dimensions 1, 4, 5 (data structure, discovery, export)
- Dimensions 6-8 can be added later if users don't have 100s of sessions
If multi-agent coordination is required:
- Dimension 5 (Export & Integration, especially MCP) becomes critical
- Dimension 2 (Memory management) becomes critical for context preservation
- Must support agent-to-agent session handoff