← Back to Articles & Artefacts
artefactseast

Session Management Evaluation Survey

IAIP Research
jg-260110-osm-claude-code-a9af9be8-df86-4285-a367

Session Management Evaluation Survey

9-dimensional framework for assessing LLM session management systems.


How to Use This Survey

  1. For each system being evaluated, score it on all 9 dimensions (0-5 scale)
  2. Use the guiding questions for each dimension
  3. Calculate average across all 9 dimensions for overall viability rating
  4. Use interpretation guide to assess if system meets your needs

Dimension 1: Architecture & Coordination

Definition: How session data is stored, discovered, and related to other sessions.

Scoring Questions

0-1 (Critical gaps):

  • Storage format is proprietary binary (not human-readable)
  • No fork/branch support at all
  • Session discovery only works through proprietary client
  • No API for external tools

2-3 (Partial support):

  • Sessions stored in JSON but no fork metadata
  • Fork detection requires manual linking
  • Limited to single session at a time
  • API exists but undocumented

4-5 (Production-ready):

  • Sessions stored in human-readable format (JSON, JSONL)
  • Forks tracked as DAG (Directed Acyclic Graph)
  • Session relationships queryable via API
  • Offline extraction possible without main client

Red Flags

  • "Our proprietary format is optimized" (= not human-readable)
  • "Forks are handled automatically" with no explanation
  • No way to access raw data outside the tool

Dimension 2: Memory & State Management

Definition: How context, state, and knowledge are tracked across turns and sessions.

Scoring Questions

0-1 (No real management):

  • Token counts missing or unreliable
  • No separation of roles (system/user/assistant)
  • Memory treated as opaque blob
  • No tracking of context window exhaustion

2-3 (Basic management):

  • Token counts present but not per-message
  • Roles separated but not consistently
  • Memory structure exists but limited query
  • Context window mentioned but not managed

4-5 (Sophisticated):

  • Token counts tracked per message, per role
  • Clear separation of tactical (current) vs. strategic (cross-session) memory
  • CLAUDE.md or equivalent persistent context parsed and indexed
  • Context window exhaustion detected and warned

Red Flags

  • "Memory just works" (= no visibility into how)
  • Thinking tokens are discarded or hidden
  • No way to distinguish system prompts from user input

Dimension 3: Session Lifecycle Operations

Definition: What operations are supported on sessions (create, resume, fork, merge, etc.).

Scoring Questions

0-1 (Limited operations):

  • Only "create new" and "view" supported
  • No fork/branch capability
  • No checkpoint/snapshot feature
  • Resuming always from latest point

2-3 (Standard operations):

  • Create, resume, fork, view supported
  • Can resume from specific point in history
  • Can manually checkpoint
  • No atomic guarantees

4-5 (Advanced):

  • Create, resume, fork, merge, archive all supported
  • Resume from any message in history
  • Automatic checkpoints (e.g., every N tokens)
  • Atomic state transitions with rollback
  • Pause/unpause capability

Red Flags

  • "We don't support forks; just create new sessions"
  • "Resuming always overwrites the previous attempt"
  • No way to compare two branches side-by-side

Dimension 4: Discovery & Retrieval

Definition: Finding sessions and searching within them.

Scoring Questions

0-1 (Discovery is manual):

  • Only manual list or file browser
  • No search capability
  • No filtering by date, directory, status
  • Linear scan required to find anything

2-3 (Basic discovery):

  • List sessions with some metadata
  • Keyword search but not full-text
  • Filter by date OR directory (not both)
  • No fork tracing UI

4-5 (Powerful discovery):

  • List with all metadata (date, directory, tokens, status)
  • Full-text search across all messages
  • Complex filtering (AND/OR logic)
  • Fork tracing and lineage visualization
  • Fast on 1000+ sessions

Red Flags

  • "Use Ctrl-F in the file browser" (not real search)
  • Search works but takes >10 seconds
  • Can't filter by multiple criteria simultaneously

Dimension 5: Export & Integration

Definition: Getting data out in usable formats and integrating with other tools.

Scoring Questions

0-1 (Data is trapped):

  • No export capability
  • Only screenshots or copy-paste allowed
  • No API for external access
  • Proprietary format prevents reuse

2-3 (Basic export):

  • Export to JSON exists
  • Markdown export but loses metadata
  • No reimport capability
  • No tool integration

4-5 (Seamless integration):

  • Export to Markdown with full metadata preserved
  • Export to JSON/JSONL losslessly
  • Reimport and resume from export
  • Git integration (can commit sessions)
  • MCP (Model Context Protocol) support for agent handoff
  • Webhook/API for external processing

Red Flags

  • "Export is for backup only"
  • Metadata lost on export
  • No way to use exported data elsewhere

Dimension 6: Performance & Scalability

Definition: How the system handles large numbers of sessions and messages.

Scoring Questions

0-1 (Not scalable):

  • Lists all 1000 sessions takes >10 seconds
  • Search takes >30 seconds
  • System becomes sluggish with 100+ sessions
  • Full-text search not implemented

2-3 (Acceptable performance):

  • List 1000 sessions in 2-5 seconds
  • Search across 10K messages in 5-10 seconds
  • Some indexing but not comprehensive
  • Works with 100s of sessions but not 1000s

4-5 (Optimized):

  • List 1000+ sessions in <1 second
  • Search 100K+ messages in <5 seconds
  • SQLite or equivalent with proper indexes
  • Incremental indexing (only new data)
  • Caching for frequent queries

Red Flags

  • "Performance gets worse with more data"
  • Search implemented as linear scan
  • No mention of indexing strategy

Dimension 7: Reliability & Durability

Definition: Data integrity, crash recovery, and safety guarantees.

Scoring Questions

0-1 (Risky):

  • No backup mechanism
  • Crashes may lose recent data
  • No ACID compliance
  • No recovery documentation

2-3 (Basic durability):

  • Automatic backups but frequency unclear
  • Some protection against crashes
  • Manual recovery procedure exists
  • No verification of data integrity

4-5 (Production-grade):

  • ACID compliance (atomic writes)
  • Automatic backups at regular intervals
  • Crash recovery tested and documented
  • Data integrity checks (checksums, etc.)
  • Versioning/rollback capability
  • Clear retention policy

Red Flags

  • "Just use git" (not a substitute for database safety)
  • No mention of what happens if process crashes mid-write
  • Backups are manual

Dimension 8: Security & Privacy

Definition: Access control, encryption, and sensitive data handling.

Scoring Questions

0-1 (No security):

  • No file permissions enforcement
  • No encryption at rest
  • All users can read all sessions
  • No audit trail

2-3 (Basic security):

  • File permissions respected
  • Optional encryption
  • Per-user access control mentioned
  • Limited audit capability

4-5 (Secure):

  • File permissions enforced at application level
  • Encryption at rest by default
  • Per-session access control with roles
  • Audit trail (who accessed what when)
  • Sensitive data detection and redaction
  • Secure deletion option

Red Flags

  • "Security isn't relevant for local tools"
  • No mention of file permissions
  • No way to control who sees which sessions

Dimension 9: Developer Experience

Definition: How easy and pleasant it is to use the tool.

Scoring Questions

0-1 (Poor UX):

  • Complex configuration required
  • Error messages are cryptic
  • No documentation
  • Frequent crashes or bugs

2-3 (Acceptable):

  • Configuration has reasonable defaults
  • Error messages are sometimes helpful
  • Basic documentation exists
  • Stable but occasional issues

4-5 (Excellent):

  • Zero configuration or sensible defaults
  • Clear, actionable error messages
  • Comprehensive documentation with examples
  • IDE integration (VS Code, Cursor plugin)
  • CLI is simple and intuitive
  • Active community support

Red Flags

  • "You have to read the source code to use it"
  • "We don't support any IDE integration"
  • Error messages like "ERROR: XXXXXX"

Scoring & Interpretation

Calculation

For each dimension, assign 0-5 score based on how many red flags, guiding questions, and capabilities align.

Overall Viability Score = (D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 + D9) / 9

Viability Interpretation

ScoreRatingDecision
0.0-1.5UnusableFundamental gaps. Do not use.
1.5-2.5ProblematicCritical features missing. Significant work needed.
2.5-3.5ViableUsable but has gaps. Can be improved.
3.5-4.5GoodProduction-ready. Minor improvements possible.
4.5-5.0ExcellentBest-in-class. Ready for team rollout.

Comparison Template

Use this template to evaluate multiple systems side-by-side.

``` System A: _____________ | System B: _____________ | System C: _____________

Architecture & Coordination Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Memory & State Management Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Session Lifecycle Operations Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Discovery & Retrieval Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Export & Integration Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Performance & Scalability Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Reliability & Durability Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Security & Privacy Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

Developer Experience Score: ___/5 | Score: ___/5 | Score: ___/5 Notes: ________________ | Notes: ________________ | Notes: ________________

OVERALL SCORE: ___/5 | OVERALL SCORE: ___/5 | OVERALL SCORE: ___/5

Recommendation: _________________________________ ```


Examples

Example 1: Existing "claude-conversation-extractor" Tool

DimensionScoreNotes
Architecture3JSON output, but no fork tracking
Memory2Exports content but no token/role separation
Lifecycle1Export only; no fork/resume support
Discovery2Search works but limited filtering
Export4Markdown export with metadata
Performance4Handles large exports well
Reliability3Straightforward extraction, unlikely to corrupt
Security2No access control; trusts filesystem
Developer3Pip installable; basic docs
OVERALL2.7Viable for one-time export, not session management

Example 2: Proposed "claude-session-manager" Package

DimensionScoreNotes
Architecture5DAG fork tracking, API-first design
Memory5Per-message tokens, role separation, tactical/strategic memory
Lifecycle4Create, resume, fork, archive; merge TBD
Discovery5Full-text search, complex filtering, fork visualization
Export5Markdown + JSON + reimport + git integration
Performance4SQLite indexes, search on 100K+ messages
Reliability4ACID compliance, automatic backups, tested recovery
Security3File permissions, audit trail; encryption TBD
Developer4CLI tool, Python library, comprehensive docs
OVERALL4.4Good - production-ready with minor enhancements

Decision Framework

If evaluating existing tools:

  • Score < 2.5: Not worth integrating; build custom solution
  • Score 2.5-3.5: Integrate but plan enhancements
  • Score > 3.5: Adopt with confidence

If building new system:

  • Target score: 4.0+
  • Focus first on Dimensions 1, 4, 5 (data structure, discovery, export)
  • Dimensions 6-8 can be added later if users don't have 100s of sessions

If multi-agent coordination is required:

  • Dimension 5 (Export & Integration, especially MCP) becomes critical
  • Dimension 2 (Memory management) becomes critical for context preservation
  • Must support agent-to-agent session handoff