Session Management Evaluation Survey

9-dimensional framework for assessing LLM session management systems.

How to Use This Survey

For each system being evaluated, score it on all 9 dimensions (0-5 scale)
Use the guiding questions for each dimension
Calculate average across all 9 dimensions for overall viability rating
Use interpretation guide to assess if system meets your needs

Dimension 1: Architecture & Coordination

Definition: How session data is stored, discovered, and related to other sessions.

Scoring Questions

0-1 (Critical gaps):

Storage format is proprietary binary (not human-readable)
No fork/branch support at all
Session discovery only works through proprietary client
No API for external tools

2-3 (Partial support):

Sessions stored in JSON but no fork metadata
Fork detection requires manual linking
Limited to single session at a time
API exists but undocumented

4-5 (Production-ready):

Sessions stored in human-readable format (JSON, JSONL)
Forks tracked as DAG (Directed Acyclic Graph)
Session relationships queryable via API
Offline extraction possible without main client

Red Flags

"Our proprietary format is optimized" (= not human-readable)
"Forks are handled automatically" with no explanation
No way to access raw data outside the tool

Dimension 2: Memory & State Management

Definition: How context, state, and knowledge are tracked across turns and sessions.

Scoring Questions

0-1 (No real management):

Token counts missing or unreliable
No separation of roles (system/user/assistant)
Memory treated as opaque blob
No tracking of context window exhaustion

2-3 (Basic management):

Token counts present but not per-message
Roles separated but not consistently
Memory structure exists but limited query
Context window mentioned but not managed

4-5 (Sophisticated):

Token counts tracked per message, per role
Clear separation of tactical (current) vs. strategic (cross-session) memory
CLAUDE.md or equivalent persistent context parsed and indexed
Context window exhaustion detected and warned

Red Flags

"Memory just works" (= no visibility into how)
Thinking tokens are discarded or hidden
No way to distinguish system prompts from user input

Dimension 3: Session Lifecycle Operations

Definition: What operations are supported on sessions (create, resume, fork, merge, etc.).

Scoring Questions

0-1 (Limited operations):

Only "create new" and "view" supported
No fork/branch capability
No checkpoint/snapshot feature
Resuming always from latest point

2-3 (Standard operations):

Create, resume, fork, view supported
Can resume from specific point in history
Can manually checkpoint
No atomic guarantees

4-5 (Advanced):

Create, resume, fork, merge, archive all supported
Resume from any message in history
Automatic checkpoints (e.g., every N tokens)
Atomic state transitions with rollback
Pause/unpause capability

Red Flags

"We don't support forks; just create new sessions"
"Resuming always overwrites the previous attempt"
No way to compare two branches side-by-side

Dimension 4: Discovery & Retrieval

Definition: Finding sessions and searching within them.

Scoring Questions

0-1 (Discovery is manual):

Only manual list or file browser
No search capability
No filtering by date, directory, status
Linear scan required to find anything

2-3 (Basic discovery):

List sessions with some metadata
Keyword search but not full-text
Filter by date OR directory (not both)
No fork tracing UI

4-5 (Powerful discovery):

List with all metadata (date, directory, tokens, status)
Full-text search across all messages
Complex filtering (AND/OR logic)
Fork tracing and lineage visualization
Fast on 1000+ sessions

Red Flags

"Use Ctrl-F in the file browser" (not real search)
Search works but takes >10 seconds
Can't filter by multiple criteria simultaneously

Dimension 5: Export & Integration

Definition: Getting data out in usable formats and integrating with other tools.

Scoring Questions

0-1 (Data is trapped):

No export capability
Only screenshots or copy-paste allowed
No API for external access
Proprietary format prevents reuse

2-3 (Basic export):

Export to JSON exists
Markdown export but loses metadata
No reimport capability
No tool integration

4-5 (Seamless integration):

Export to Markdown with full metadata preserved
Export to JSON/JSONL losslessly
Reimport and resume from export
Git integration (can commit sessions)
MCP (Model Context Protocol) support for agent handoff
Webhook/API for external processing

Red Flags

"Export is for backup only"
Metadata lost on export
No way to use exported data elsewhere

Dimension 6: Performance & Scalability

Definition: How the system handles large numbers of sessions and messages.

Scoring Questions

0-1 (Not scalable):

Lists all 1000 sessions takes >10 seconds
Search takes >30 seconds
System becomes sluggish with 100+ sessions
Full-text search not implemented

2-3 (Acceptable performance):

List 1000 sessions in 2-5 seconds
Search across 10K messages in 5-10 seconds
Some indexing but not comprehensive
Works with 100s of sessions but not 1000s

4-5 (Optimized):

List 1000+ sessions in <1 second
Search 100K+ messages in <5 seconds
SQLite or equivalent with proper indexes
Incremental indexing (only new data)
Caching for frequent queries

Red Flags

"Performance gets worse with more data"
Search implemented as linear scan
No mention of indexing strategy

Dimension 7: Reliability & Durability

Definition: Data integrity, crash recovery, and safety guarantees.

Scoring Questions

0-1 (Risky):

No backup mechanism
Crashes may lose recent data
No ACID compliance
No recovery documentation

2-3 (Basic durability):

Automatic backups but frequency unclear
Some protection against crashes
Manual recovery procedure exists
No verification of data integrity

4-5 (Production-grade):

ACID compliance (atomic writes)
Automatic backups at regular intervals
Crash recovery tested and documented
Data integrity checks (checksums, etc.)
Versioning/rollback capability
Clear retention policy

Red Flags

"Just use git" (not a substitute for database safety)
No mention of what happens if process crashes mid-write
Backups are manual

Dimension 8: Security & Privacy

Definition: Access control, encryption, and sensitive data handling.

Scoring Questions

0-1 (No security):

No file permissions enforcement
No encryption at rest
All users can read all sessions
No audit trail

2-3 (Basic security):

File permissions respected
Optional encryption
Per-user access control mentioned
Limited audit capability

4-5 (Secure):

File permissions enforced at application level
Encryption at rest by default
Per-session access control with roles
Audit trail (who accessed what when)
Sensitive data detection and redaction
Secure deletion option

Red Flags

"Security isn't relevant for local tools"
No mention of file permissions
No way to control who sees which sessions

Dimension 9: Developer Experience

Definition: How easy and pleasant it is to use the tool.

Scoring Questions

0-1 (Poor UX):

Complex configuration required
Error messages are cryptic
No documentation
Frequent crashes or bugs

2-3 (Acceptable):

Configuration has reasonable defaults
Error messages are sometimes helpful
Basic documentation exists
Stable but occasional issues

4-5 (Excellent):

Zero configuration or sensible defaults
Clear, actionable error messages
Comprehensive documentation with examples
IDE integration (VS Code, Cursor plugin)
CLI is simple and intuitive
Active community support

Red Flags

"You have to read the source code to use it"
"We don't support any IDE integration"
Error messages like "ERROR: XXXXXX"

Scoring & Interpretation

Calculation

For each dimension, assign 0-5 score based on how many red flags, guiding questions, and capabilities align.

Overall Viability Score = (D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 + D9) / 9

Viability Interpretation

Score	Rating	Decision
0.0-1.5	Unusable	Fundamental gaps. Do not use.
1.5-2.5	Problematic	Critical features missing. Significant work needed.
2.5-3.5	Viable	Usable but has gaps. Can be improved.
3.5-4.5	Good	Production-ready. Minor improvements possible.
4.5-5.0	Excellent	Best-in-class. Ready for team rollout.

Comparison Template

Use this template to evaluate multiple systems side-by-side.

``` System A: _____________ | System B: _____________ | System C: _____________

Architecture & Coordination Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Memory & State Management Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Session Lifecycle Operations Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Discovery & Retrieval Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Export & Integration Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Performance & Scalability Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Reliability & Durability Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Security & Privacy Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Developer Experience Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

OVERALL SCORE: ___/5 | OVERALL SCORE: ___/5 | OVERALL SCORE: ___/5

Recommendation: _________________________________ ```

Examples

Example 1: Existing "claude-conversation-extractor" Tool

Dimension	Score	Notes
Architecture	3	JSON output, but no fork tracking
Memory	2	Exports content but no token/role separation
Lifecycle	1	Export only; no fork/resume support
Discovery	2	Search works but limited filtering
Export	4	Markdown export with metadata
Performance	4	Handles large exports well
Reliability	3	Straightforward extraction, unlikely to corrupt
Security	2	No access control; trusts filesystem
Developer	3	Pip installable; basic docs
OVERALL	2.7	Viable for one-time export, not session management

Example 2: Proposed "claude-session-manager" Package

Dimension	Score	Notes
Architecture	5	DAG fork tracking, API-first design
Memory	5	Per-message tokens, role separation, tactical/strategic memory
Lifecycle	4	Create, resume, fork, archive; merge TBD
Discovery	5	Full-text search, complex filtering, fork visualization
Export	5	Markdown + JSON + reimport + git integration
Performance	4	SQLite indexes, search on 100K+ messages
Reliability	4	ACID compliance, automatic backups, tested recovery
Security	3	File permissions, audit trail; encryption TBD
Developer	4	CLI tool, Python library, comprehensive docs
OVERALL	4.4	Good - production-ready with minor enhancements

Decision Framework

If evaluating existing tools:

Score < 2.5: Not worth integrating; build custom solution
Score 2.5-3.5: Integrate but plan enhancements
Score > 3.5: Adopt with confidence

If building new system:

Target score: 4.0+
Focus first on Dimensions 1, 4, 5 (data structure, discovery, export)
Dimensions 6-8 can be added later if users don't have 100s of sessions

If multi-agent coordination is required:

Dimension 5 (Export & Integration, especially MCP) becomes critical
Dimension 2 (Memory management) becomes critical for context preservation
Must support agent-to-agent session handoff