← Back to Articles & Artefacts
artefactssouth

Academic Critique: The Interrogative Turn Survey & Literature Review

IAIP Research
rch-ctx-polyphonic-discussion

Academic Critique: The Interrogative Turn Survey & Literature Review

Reviewer: Senior Academic Reviewer (Interdisciplinary Research Methodology, Computational Linguistics, Philosophy of AI, Prompt Engineering)
Date of Review: April 6, 2026
Documents Under Review:

  1. Survey 01 — APE Methods Evolution
  2. Survey 02 — Computational Linguistics
  3. Survey 03 — Philosophy of AI
  4. Survey 04 — PDE State of the Art
  5. MASTER-SURVEY (Cross-disciplinary synthesis)
  6. LITERATURE-REVIEW (Final integrated review)

Executive Assessment

This is an ambitious, intellectually stimulating, and intermittently brilliant piece of interdisciplinary scholarship. The central conceit—holding technical, linguistic, and philosophical lenses simultaneously on the evolution of prompt decomposition—is genuinely novel, and the "cross-disciplinary convergences" identified in the MASTER-SURVEY (particularly the isomorphism between question semantics and tree-of-thought reasoning, and the Gricean reading of evolutionary optimisation) represent real intellectual contributions that deserve publication and further investigation. The writing is consistently lucid, the architecture is well-conceived, and the sheer scope of the bibliographic coverage—120+ sources across five domains—is impressive for a survey produced in this format.

However, the work suffers from three systemic weaknesses that, in their current form, would prevent acceptance at a top-tier interdisciplinary venue (e.g., Artificial Intelligence, Minds and Machines, or AI & Society). First, the central thesis is presented as an empirical claim ("the interrogative turn is happening") but defended primarily through interpretive argument and selective evidence—the strongest technical systems remain overwhelmingly imperative, and the "conversational decomposition" evidence is thin, recent, and largely unreplicated. The work does not adequately confront the possibility that the "shift" is an artefact of the authors' interpretive framework rather than a phenomenon in the field. Second, the citation apparatus contains several fragile elements: references to blog posts and grey literature treated as equivalent to peer-reviewed findings; at least one arXiv ID that appears anachronistic; incomplete attributions for several papers; and a focal work (González Arocha, 2025) published in a regional journal that has not been widely cited or engaged with by the philosophical community. Third, the Indigenous epistemology treatment, while well-intentioned, risks the very instrumentalisation it claims to avoid—Wilson's (2008) framework is deployed as a "lens" for reinterpreting Western AI problems rather than centred on its own terms, and the diversity of Indigenous scholarly voices is insufficient.

Verdict on publishability: Not publishable in current form at a top venue. With substantial revision addressing the three systemic issues above—particularly strengthening the empirical basis for the central thesis, cleaning the citation apparatus, and deepening the Indigenous engagement—this could become an important contribution. The cross-disciplinary convergences and the research questions (especially RQ2, RQ6, and RQ9) are genuinely novel and worth developing into a research programme.


Detailed Critique by Dimension

1. SCHOLARLY RIGOR — Score: 5/10

Key Strengths:

  • The technical surveys (01 and 04) demonstrate strong command of the peer-reviewed APE and agent architecture literatures. Citation of venue, year, and key benchmark results is generally accurate and well-organised.
  • The chronological evolution in Survey 01 is carefully staged with appropriate evidence quality gradations (the "Evidence Quality Assessment" table in Survey 01 is a model of honest self-assessment).
  • The philosophy survey engages primary texts (Wittgenstein, Searle, Dreyfus, Ihde, Buber, Levinas) rather than relying solely on secondary interpretations.

Key Weaknesses:

  1. Evidence hierarchy violations. The work frequently treats non-peer-reviewed sources as equivalent to peer-reviewed findings:

    • Anthropic's "Effective Context Engineering" (2025) is a corporate blog post, yet is treated as establishing a "paradigm" with specific performance claims ("up to 54% improvement"—PDE survey). This claim appears nowhere in peer-reviewed literature.
    • IBM's 2026 guide is a marketing document cited as if it were scholarship.
    • The SciELO Bakhtinian analysis (2025) is a blog post, not an article.
    • "STRV analysis" (2024) appears to be a tech company blog.
    • "Sholzman (2024)" is cited repeatedly in the philosophy survey but never fully attributed—no first name, no title, no venue. This is unacceptable.
    • "DTPA" (Deep Thinking Prompting Agents) is referenced with no proper citation at all.
  2. Incomplete attributions. Multiple papers in the computational linguistics survey lack full author names: items 7, 17, 19, 20, 29, 30, 32, 33, 35, 37, 38 in the Sources section are cited without named authors. Several entries in the philosophy survey suffer similarly. This is not merely a formatting issue—it suggests reliance on search results rather than actual reading.

  3. Possible citation ghosts. Several references require verification:

    • GEPA (arXiv:2507.19457): The arXiv ID prefix "2507" indicates July 2025 submission, which is valid given a review date of April 2026. However, this specific paper should be verified—the claimed results (open-source models outperforming Claude Opus 4.1 at 90× lower cost) are extraordinary claims that require extraordinary evidence from a preprint.
    • FATA (arXiv:2508.08308): August 2025 submission—again, plausible but should be verified. The title and framing vary slightly across surveys.
    • Coeckelbergh (2025) Communicative AI: Referenced in the MASTER-SURVEY (Section 1) but absent from both the philosophy survey's source list and the literature review's bibliography. If this book exists, it should be cited; if it does not, it is a fabricated reference.
    • "Conversation Routines" (arXiv:2501.11613v4): Cited without authors—who wrote this?
  4. Conflation of primary and secondary sources. The philosophy survey frequently cites Wittgenstein, Searle, and Dreyfus through secondary applications (Jolma, 2024; STRV, 2024) rather than demonstrating direct engagement with the primary arguments. The claim that Wittgenstein's language games provide "the most direct philosophical framework" for understanding prompting is attributed to secondary sources, not argued from Philosophical Investigations itself.

  5. Overclaiming from weak evidence. The claim that the interrogative turn is "accelerating" (MASTER-SURVEY, Section 10) rests primarily on three 2025 systems (ACT, FATA, Tri-Agent), all of which are either preprints, workshop papers, or described as "emerging." This is a fragile empirical basis for a strong directional claim.

Specific Recommendations:

  • Audit every citation: verify existence, verify that the cited paper says what is attributed to it, supply full author names.
  • Clearly distinguish peer-reviewed findings from preprints, blog posts, and grey literature in-text, not just in the evidence quality table.
  • Remove or flag all references that cannot be fully attributed.
  • Tone down directional claims about the "interrogative turn" to match the actual evidence base.

2. ARGUMENTATIVE COHERENCE — Score: 6/10

Key Strengths:

  • The overall argumentative arc—from technical description through linguistic analysis to philosophical interpretation—is well-designed and builds cumulatively. The literature review (Document 6) is particularly well-structured.
  • The identification of "tensions" (MASTER-SURVEY Section 6) is a genuine strength. Tension 1 ("Is the interrogative turn real or metaphorical?") demonstrates intellectual honesty that is too rare in survey papers.
  • The five "cross-disciplinary convergences" are the strongest intellectual contribution. Convergence 1 (question semantics / tree search / Socratic inquiry isomorphism) and Convergence 3 (context-sensitivity as first principle) are genuinely insightful.

Key Weaknesses:

  1. The central thesis is underdetermined by the evidence. The "instruction-to-inquiry shift" is presented as an empirical phenomenon documented across disciplines. But the evidence is equivocal:

    • The most successful and widely deployed PDE systems (DSPy, LangChain, MetaGPT, LATS) are overwhelmingly structured and imperative.
    • The "conversational decomposition" evidence (ACT, FATA, Tri-Agent) is from 2025 preprints/workshops that have not been replicated or widely adopted.
    • The majority of APE methods produce static prompts—even GEPA and TextGrad, which the survey positions as "conversational," are fundamentally optimisation loops, not dialogues.
    • The work acknowledges this in Tension 1 but then proceeds as if the thesis were established.
  2. False equivalence across disciplinary "convergences." Some claimed convergences are weaker than presented:

    • Convergence 2 ("Cooperative Principles Across All Three Domains") conflates Grice's cooperative principle (a normative model of conversation), technical fitness functions (optimisation metrics), and Floridi's epistemic cooperation (an ethical framework). These are structurally different kinds of "cooperation." The analogy is suggestive but the claim of genuine convergence overstates the case.
    • Convergence 5 ("Feedback Loop as Dialogical Structure") equates TextGrad's gradient-like updates with Gadamer's hermeneutic circle. The structural parallel is real, but calling TextGrad a "hermeneutic circle implemented in code" anthropomorphises a numerical optimisation process.
  3. The synthesis juxtaposes more than it synthesises. In several places, the MASTER-SURVEY lists what each discipline says about a topic without demonstrating how the combination produces insight beyond what any single discipline provides. True synthesis would show how the intersection produces knowledge that is not merely additive but emergent. The "cross-disciplinary insights" (numbered 1–6 in the MASTER-SURVEY) do this well; the surrounding text often does not.

  4. The research questions are partially grafted on. While RQ1, RQ2, and RQ6 genuinely emerge from the reviewed literature, RQ3 (Indigenous PDE) and RQ10 (temporal phenomenology) appear to reflect the authors' prior commitments rather than emerging organically from the gaps identified. RQ3 in particular is important but its connection to the specific reviewed literatures is thin—no reviewed paper suggests this direction.

  5. Non-sequitur in the relational turn argument. The move from "the inquiry paradigm is relational" to "this maps onto Indigenous relational epistemology" (MASTER-SURVEY Section 7, Literature Review Section 2.4) involves a leap. That two things are "relational" does not make them the same kind of relational. Wilson's relational epistemology concerns networks of accountability involving human, more-than-human, and spiritual relations; the "relational" character of conversational prompting concerns information exchange patterns between a human and a statistical model. Conflating these risks trivialising Indigenous epistemology by reducing it to a general label for "not-extractive."

Specific Recommendations:

  • Reframe the thesis: rather than claiming the interrogative turn is happening, argue that the conditions for it are emerging and that it should happen (a normative rather than empirical claim).
  • Strengthen the convergence arguments by specifying the formal properties that make the structures genuinely isomorphic (not merely analogous).
  • Rework the Indigenous epistemology section to acknowledge the qualitative difference between Wilson's relational accountability and conversational prompting.

3. DISCIPLINARY DEPTH vs. BREADTH — Score: 6/10

Key Strengths:

  • The APE technical survey (01) is genuinely comprehensive and well-organised. The chronological evolution, annotated key papers, and evidence quality assessment would be publishable as a standalone technical survey.
  • The PDE survey (04) covers an impressive range of systems with accurate taxonomic organisation.
  • The philosophy survey (03) demonstrates real engagement with primary philosophical texts and makes genuinely philosophical arguments (not just name-dropping).

Key Weaknesses:

  1. Computational linguistics: broad but shallow. The linguistics survey (02) correctly identifies RST, speech act theory, Gricean pragmatics, question semantics, and compositionality as relevant frameworks. But the engagement remains at the "framework identification" level—none of these theories is applied in detail to actual prompt examples. For instance:

    • The RST analysis never annotates an actual prompt with RST relations.
    • The speech act analysis never performs a detailed felicity condition analysis on specific prompt types.
    • The Gricean analysis identifies maxim violations in general but never demonstrates a detailed pragmatic annotation of a real LLM interaction.
    • Zhang & Cao (2025) is cited for information-theoretic foundations but their specific formal results are not engaged with.
  2. Philosophy: uneven depth. Wittgenstein, Ihde, and Floridi are well-treated. But:

    • Levinas is mentioned but his arguments about the face-to-face encounter are never developed. Simply stating that "Levinas raises the question" is not philosophical engagement.
    • Gadamer is cited for the hermeneutic circle but Truth and Method's actual arguments about prejudice, horizon-fusion, and effective-historical consciousness are not engaged.
    • The Bakhtin section relies on one SciELO blog post for its application to LLMs—this needs substantial strengthening through actual literary/discourse analysis.
    • Harman's object-oriented ontology is mentioned once in a parenthetical (MASTER-SURVEY) and never developed.
  3. Technical: missing recent architectures. The PDE survey, while comprehensive, misses several important 2025 systems:

    • OpenAI's recent work on process reward models and step-level verification
    • The "reasoning traces" line of work (DeepSeek-R1, QwQ, etc.)
    • The emerging "scaffolding" paradigm distinct from both prompting and fine-tuning
    • No engagement with Constitutional AI or RLHF as they relate to instruction-following
  4. Cognitive science is entirely absent. This is a significant gap for a review about how humans interact with AI. Dual process theory (Kahneman), bounded rationality (Simon), and cognitive load theory (Sweller) are directly relevant to prompt decomposition and are never mentioned.

Specific Recommendations:

  • Add worked examples: annotate 2–3 actual prompts with RST structure, speech act analysis, and Gricean evaluation.
  • Deepen Levinas and Gadamer engagement or remove them (mentioning without engaging is worse than omitting).
  • Add a cognitive science subsection addressing dual process theory and cognitive load as they relate to decomposition.

4. CRITICAL OMISSIONS — Score: 5/10

Missing Papers/Authors (with justification):

  1. Bender, Gebru, McMillan-Major & Shmitchell (2021), "On the Dangers of Stochastic Parrots." Listed in the bibliography of the MASTER-SURVEY but never discussed in any survey text. This is the most-cited critical paper on LLM limitations—its "stochastic parrot" framing directly challenges the attribution of communicative capacity to LLMs and should have been central to the philosophy survey's discussion of machine dialogue.

  2. Shanahan (2024), "Talking About Large Language Models" (Communications of the ACM). Provides a careful philosophical analysis of the metaphors we use for LLMs, directly relevant to the question of whether "conversation" and "inquiry" are appropriate framings.

  3. Mahowald, Ivanova, Blank, Kanwisher, Tenenbaum & Fedorenko (2024), "Dissociating Language and Thought in Large Language Models" (Trends in Cognitive Sciences). The formal competence/functional competence distinction is essential for the linguistics survey's claims about pragmatic capacity.

  4. Kahneman (2011), Thinking, Fast and Slow. Dual process theory is absent yet directly relevant: prompt decomposition can be understood as enforcing System 2 (deliberative) processing on what would otherwise be System 1 (heuristic) model responses.

  5. Suchman (2007), Human-Machine Reconfigurations. The foundational STS text on situated action and the gap between plans and situated action—directly relevant to the decomposition-as-plan-then-execute paradigm.

  6. Nussbaum (2011) or Sen (1999), capabilities approach. The ethical analysis relies on Floridi and Coeckelbergh but ignores the capabilities tradition, which offers an alternative framework for evaluating whether AI interactions support human flourishing.

  7. Ouyang et al. (2022), "Training Language Models to Follow Instructions with Human Feedback" (InstructGPT/RLHF). Foundational for understanding why instruction-tuned models respond as they do—entirely absent from the technical surveys.

  8. Ziegler et al. (2019), Christiano et al. (2017) — RLHF foundations. The alignment/instruction-following pipeline is barely mentioned despite being the technical mechanism that makes both imperative and interrogative prompting possible.

  9. Mitchell et al. (2019), "Model Cards" and Gebru et al. (2021), "Datasheets for Datasets." Documentation frameworks for responsible AI development—relevant to Djeffal's reflexive prompt engineering.

  10. Hovy & Spruit (2016), "The Social Impact of NLP." An early and influential paper on the ethical dimensions of NLP systems.

Missing Theoretical Perspectives:

  1. Cognitive Science. No engagement with dual process theory, cognitive load theory, bounded rationality, or the extensive cognitive science literature on question-asking as a cognitive strategy. This is a glaring gap for a review about how humans formulate and decompose questions.

  2. Science and Technology Studies (STS). Beyond Coeckelbergh, no engagement with Jasanoff (co-production), Latour (actor-network theory), Suchman (situated action), or the broader STS tradition. The "relational turn" discussion would benefit enormously from ANT's treatment of human-nonhuman relations.

  3. Human-Computer Interaction (HCI). No empirical user studies of how people actually prompt LLMs. The ACL "Bridging HCI and NLP" workshop is mentioned in passing but none of its papers are cited. The extensive literature on conversational user interface design is absent.

  4. Argumentation Theory. Despite citing Russo, Schliesser & Wagemans (who are argumentation theorists), no engagement with argumentation theory proper—which offers formal models of dialogical reasoning directly relevant to the "inquiry" paradigm.

  5. Relevance Theory (Sperber & Wilson, 1986/1995). An alternative to Gricean pragmatics that provides a cognitive model of communication based on relevance maximisation—potentially more applicable to LLM interaction than Grice's cooperative model.

Specific Recommendations:

  • Add a cognitive science section addressing dual process theory and cognitive load.
  • Engage with Bender et al. (2021) explicitly—its omission from the discussion is conspicuous.
  • Include at least one STS perspective (Suchman's situated action is most directly relevant).
  • Reference the RLHF/alignment literature as the technical substrate for instruction-following.

5. METHODOLOGICAL CONCERNS — Score: 5/10

Key Strengths:

  • The multi-agent survey methodology is innovative and well-documented. Assigning different disciplinary "agents" to produce independent surveys, then synthesising, is a reasonable approach to interdisciplinary review.
  • The evidence quality assessment tables (Survey 01) demonstrate methodological self-awareness.

Key Weaknesses:

  1. The focal work selection (González Arocha, 2025) is inadequately justified. The literature review positions González Arocha's "Critical Phenomenology of Prompting" as the focal lens, but:

    • Sophia (Universidad Politécnica Salesiana, Ecuador) is a regional journal with limited international visibility. This is not an inherent disqualification, but it means the work has not been subject to the scrutiny that a publication in, say, Philosophy & Technology or AI & Society would have received.
    • The paper has had insufficient time to accumulate citations or scholarly responses.
    • The review does not explain why this work was chosen over more established alternatives (Djeffal at FAccT, Floridi in Philosophy & Technology, Coeckelbergh's books).
    • The selection creates the appearance of choosing a focal work that supports the thesis rather than one that challenges it.
  2. The MECE decomposition is not genuinely MECE. The four survey agents are divided as: (1) APE Methods, (2) Computational Linguistics, (3) Philosophy of AI, (4) PDE State of Art. But:

    • Overlap: Speech act theory appears in both Survey 02 (CL) and Survey 03 (Philosophy). Question semantics appears in both. Wittgenstein appears in both. The CoT/ToT/GoT progression appears in Surveys 01, 02, and 04. DSPy appears in Surveys 01 and 04. This is not merely benign redundancy—it means the "synthesis" is partly synthesising the same material told three times.
    • Gap: There is no agent for cognitive science, HCI, or STS. The MECE claim requires that the four surveys collectively exhaust the relevant disciplinary space; they do not.
  3. Echo-chamber risk. All four surveys converge on the same thesis (the interrogative turn). While this convergence could reflect genuine cross-disciplinary alignment, it could also reflect:

    • A shared prompt/context given to all agents that biases toward the thesis.
    • The absence of a "devil's advocate" agent tasked with finding counter-evidence.
    • Confirmation bias in source selection.

    The fact that all four surveys identify the same "critical gap" (conversational prompt optimisation) and all four frame the instruction-to-inquiry shift positively suggests insufficient methodological diversity. A robustly designed multi-agent survey should include at least one agent tasked with building the strongest possible case against the central thesis.

  4. Search strategy is undocumented. No survey describes its search methodology: which databases were searched, what search terms were used, what inclusion/exclusion criteria were applied, how many initial results were screened, or how the final set was selected. Without this, the surveys cannot be evaluated as systematic reviews.

  5. Temporal bias. The surveys are heavily weighted toward 2023–2026 publications, which is appropriate for the technical literature but problematic for the linguistics and philosophy sections. Foundational works in pragmatics (Levinson, 1983), discourse analysis (Schiffrin, 1994), and philosophy of technology (Feenberg, 1999) are absent, creating the impression that these fields suddenly discovered AI in 2023.

Specific Recommendations:

  • Justify the González Arocha focal work selection explicitly, or consider alternative focal works (see "Alternative Focal Works" section below).
  • Add a search methodology section to each survey.
  • Commission a counter-thesis agent in the multi-agent protocol.
  • Acknowledge the MECE violations and address the disciplinary gaps.

6. WRITING QUALITY — Score: 7/10

Key Strengths:

  • The prose is consistently clear, well-paced, and accessible to an interdisciplinary audience. Technical concepts are explained without condescension; philosophical concepts are presented without unnecessary obscurantism.
  • The structural parallel between documents (each survey has Key Findings, Key Papers, Open Problems, Sources) creates navigational consistency.
  • The Literature Review (Document 6) is particularly well-written—the opening paragraph is compelling and the cumulative argument builds effectively.
  • Effective use of cross-referencing between surveys ("as the APE survey documents...," "the philosophy survey argues...").

Key Weaknesses:

  1. Jargon inflation in the synthesis. Phrases like "algorithmic monologism," "epistemic co-construction," "phenomenological reorientation," and "relational turn" are introduced without sufficient definitional work. By the time the literature review deploys all of them simultaneously, the prose risks becoming a jargon thicket.

  2. Rhetorical overreach. Several passages make claims that exceed what the evidence supports:

    • "the most significant yet undertheorised developments in contemporary artificial intelligence" (Lit Review abstract)—superlative claims require superlative evidence.
    • "the interrogative turn is not merely a technical optimisation strategy but a reconstitution of the human-AI epistemic relationship" (MASTER-SURVEY abstract)—"reconstitution" is very strong.
    • "prompting is 'an inherently philosophical act'" (González Arocha, quoted repeatedly)—this is a claim by one philosopher in one paper, not an established consensus.
  3. Repetition across documents. The same key points (CoT as pivotal, DSPy as paradigm shift, compositionality gap, Wilson's four principles) are repeated in nearly identical language across multiple documents. This is partly inherent to the multi-agent format but should have been addressed in synthesis.

  4. Bibliography formatting inconsistencies. The surveys use different citation formats: Survey 01 uses numbered references; Survey 02 uses author-date in text but numbered in the sources; Survey 03 uses a hybrid. The Literature Review standardises to APA-like format but has some internal inconsistencies (e.g., some entries have full author lists, others use "et al." inconsistently).

Specific Recommendations:

  • Provide a glossary or definition section for coined terms.
  • Replace superlative claims with qualified language.
  • Standardise bibliography format across all documents.
  • Reduce cross-document repetition in the synthesis.

7. INDIGENOUS EPISTEMOLOGY TREATMENT — Score: 4/10

Key Strengths:

  • The work clearly signals awareness of the risk of appropriation.
  • The CARE Principles are accurately presented and their mapping to AI design is thoughtful.
  • The reframing in the Literature Review Section 2.4—"what appears as progressive discovery is better understood as recovery"—is a genuinely powerful rhetorical move that demonstrates philosophical depth.
  • The explicit statement that RQ3 "must be conducted with Indigenous communities, not merely about Indigenous epistemology" shows ethical awareness.

Key Weaknesses:

  1. Wilson (2008) is virtually the only Indigenous voice. Research is Ceremony was published 18 years ago. Where are:

    • Linda Tuhiwai Smith (Decolonizing Methodologies, 1999/2021, 3rd ed.)—the foundational text on Indigenous research methodology
    • Margaret Kovach (Indigenous Methodologies, 2009/2021)
    • Bagele Chilisa (Indigenous Research Methodologies, 2012/2019)
    • The extensive work by Māori scholars (Royal, 2002; Mead, 2003) on relational knowledge
    • Leroy Little Bear's writings on Indigenous science and Blackfoot metaphysics

    Relying on a single Indigenous scholar to represent "Indigenous epistemology" is itself a form of tokenism.

  2. The work falls into the trap it identifies. Despite acknowledging the risk, the surveys use Indigenous epistemology as a "lens" for reinterpreting Western AI problems. Wilson's framework is deployed to validate the instruction-to-inquiry thesis—to show that the Western field is "belatedly recognising what relational epistemologies have always known." This instrumentalises Indigenous knowledge: it is valued for what it confirms about the authors' thesis, not engaged on its own terms.

  3. No engagement with Indigenous critiques of the "relational" framing as applied to AI. Indigenous scholars have raised pointed questions about whether AI systems—which are products of extractive data practices, built on colonial linguistic corpora, and deployed by tech corporations—can ever be sites of relational knowledge-production. The Lewis et al. (2020) IP//AI Position Paper is cited but its specific arguments and tensions are not engaged.

  4. The CARE Principles mapping is suggestive but ungrounded. Mapping "Collective Benefit" to "PDE systems should optimise for community outcomes" is an assertion, not an argument. How would "collective benefit" be operationalised? What community? Who decides? The CARE Principles were developed for Indigenous data governance specifically—extending them to general AI interaction design requires substantial argumentative work that is not provided.

  5. Absence of Indigenous AI practitioners' actual work. Running Wolf (FLAIR), Abdilla (Old Ways New), and the Abundant Intelligences programme are mentioned but their actual technical and theoretical contributions are not described. What does FLAIR's language technology actually do? What does "Old Ways New" actually propose? These are not just names to list but bodies of work to engage.

  6. The "recovery not progress" framing is risky. While intellectually interesting, claiming that Western AI's conversational turn is a "return" to Indigenous ways of knowing risks suggesting that Indigenous knowledge systems are simply earlier versions of what Western technology will eventually rediscover. This is a sophisticated form of the "noble savage" trope—Indigenous people as temporally prior rather than contemporaneously different.

Specific Recommendations:

  • Add at least 3–4 additional Indigenous scholars (Smith, Kovach, Chilisa, Little Bear).
  • Engage the IP//AI Position Paper's actual arguments, not just its existence.
  • Describe what FLAIR and Abundant Intelligences actually do.
  • Add a paragraph explicitly addressing Indigenous critiques of applying relational frameworks to AI.
  • Rework the "recovery not progress" framing to avoid temporal priority claims.

8. RESEARCH QUESTIONS ASSESSMENT — Score: 7/10

Key Strengths:

  • RQ2 (formalising decomposition as compositional question semantics) is genuinely novel, well-specified, and testable. This is the strongest research question in the set and could anchor a PhD programme.
  • RQ6 (Gricean maxim violations predicting decomposition failures) is pragmatically valuable and feasible with existing data.
  • RQ9 (discourse grammar of prompt decomposition) addresses a real gap and the proposed methodology is sound.
  • The structuring by priority (primary/secondary/exploratory) with feasibility and novelty assessments is well done.

Per-Question Assessment:

RQAnswerable?Novel?Framing bias?Sharper version?
RQ1 (illocutionary force → epistemological quality)Partially—"epistemological quality" needs operationalisationModerate—related to existing prompt sensitivity studiesYes—assumes interrogative is superior"Under what task conditions does prompt illocutionary force affect output quality beyond accuracy?"
RQ2 (DecomP as question semantics)Yes, with formal workHighMinimalGood as stated
RQ3 (Indigenous relational PDE)Very difficult without community partnership already in placeHigh as concept; low in specificitySignificant—presupposes outcomeShould be reformulated as participatory design question
RQ4 (Bakhtinian monologism in multi-agent systems)Yes, testableModerate-highYes—assumes Bakhtin's critique applies"To what extent does architectural diversity in multi-agent systems produce output diversity beyond single-model sampling?"
RQ5 (Reflexive decomposition engines)Partially—"ethical self-awareness" is hard to operationaliseModerateYes—assumes reflexivity improves outcomes"Can decomposition engines monitor their own epistemic adequacy using automated feedback mechanisms?"
RQ6 (Gricean violations → decomposition failures)YesModerateMinimalGood as stated
RQ7 (Phenomenological stance measurement)Difficult—phenomenological studies are not easily quantifiedHighYes—implies interrogative is phenomenologically richerShould separate the phenomenological study from the outcome correlation
RQ8 (Socratic PromptBreeder)Yes, testableVery highModerate—may be a category errorNeed to operationalise "Socratic" mutation concretely
RQ9 (Discourse grammar)Yes, with substantial linguistic workVery highMinimalGood as stated
RQ10 (Temporal phenomenology / designed latency)Yes, testableHighModerate—assumes latency helpsShould be two-directional: does pacing affect engagement (positive or negative)?

Missing Research Questions:

  • No question about failure modes. When does the interrogative approach fail? Under what conditions do imperative prompts outperform interrogative ones? This would strengthen the thesis by bounding it.
  • No question about user diversity. How do different users (experts vs. novices, different cultural backgrounds, different languages) experience and benefit from the interrogative turn?
  • No question about scale. Does the interrogative turn scale? Conversational decomposition is expensive—how does it perform under cost constraints?
  • No question about adversarial robustness. Are interrogative prompt systems more or less vulnerable to prompt injection or adversarial manipulation?

Specific Recommendations:

  • Operationalise "epistemological quality" in RQ1 before testing.
  • Add an explicitly failure-oriented RQ: "Under what conditions does imperative decomposition outperform interrogative decomposition?"
  • Add a user-diversity RQ.
  • Reframe RQ3 as a participatory design proposal rather than an abstract research question.

Citation Audit

Suspect or Unverifiable Citations

CitationConcernSeverity
Coeckelbergh (2025) Communicative AIReferenced in MASTER-SURVEY but absent from all bibliographiesHIGH — possible fabrication
"Sholzman (2024)"No first name, title, or venue providedHIGH — unverifiable
"DTPA" / "Deep Thinking Prompting Agents"No proper citation anywhereMODERATE — grey literature or fabricated
"Conversation Routines" (arXiv:2501.11613v4)No authors listedMODERATE — incomplete
SciELO Bakhtinian analysis (2025)Blog post treated as scholarshipMODERATE — evidence hierarchy
"STRV analysis" (2024)Appears to be a tech company blogMODERATE — evidence hierarchy
"Shaka analysis" (2024)No full citationMODERATE — unverifiable
Anthropic (2025) "54% improvement" claimBlog post, specific metric may be exaggerated/decontextualisedMODERATE — overclaiming
GEPA (arXiv:2507.19457)Extraordinary claims from preprintLOW — but needs verification
Ferrario & Loi (2026) in Phil & Tech 39Recent publication, may not be available yetLOW — but verify

Claims Requiring Verification

  1. "DSPy achieves 25–65% accuracy improvements over manual prompting" (Survey 01, multiple) — Does the Khattab et al. paper actually claim this range, or is this aggregated across different reports?
  2. "GEPA enables open-source models to outperform Claude Opus 4.1" (Survey 01) — Verify the specific model comparison.
  3. "TextGrad published in Nature 2025" (Survey 01) — Verify publication venue (was it Nature proper or a Nature family journal?).
  4. "Zeldes et al. (2025), eRST, Computational Linguistics 51(1), 23–72" — The DOI and journal details check out structurally, but the page range should be verified.
  5. "Zhang & Cao (2025) demonstrate at ACL 2025" — Verify this is an ACL 2025 main conference paper.

Missing Literature

Essential Additions (Alphabetical)

  1. Bender, E.M., Gebru, T., McMillan-Major, A. & Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" FAccT 2021. — The "stochastic parrot" framing directly challenges attributing communicative capacity to LLMs and is the single most-cited critical perspective on the phenomenon these surveys discuss.

  2. Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. — Foundation of RLHF, the technical mechanism that made instruction-following possible.

  3. Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. — Dual process theory provides the cognitive science framework for understanding why decomposition helps (enforcing System 2 processing).

  4. Kovach, M. (2009/2021). Indigenous Methodologies: Characteristics, Conversations, and Contexts. University of Toronto Press. — Essential second voice for Indigenous research methodology beyond Wilson.

  5. Levinson, S. (1983). Pragmatics. Cambridge University Press. — Foundational text for the pragmatic theory heavily deployed in Survey 02.

  6. Mahowald, K., Ivanova, A.A., Blank, I.A., Kanwisher, N., Tenenbaum, J.B. & Fedorenko, E. (2024). "Dissociating Language and Thought in Large Language Models." Trends in Cognitive Sciences, 28(6), 517–540. — The formal/functional competence distinction is essential for the linguistics survey's pragmatic competence claims.

  7. Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. — InstructGPT/RLHF — the technical foundation of instruction-following that the entire technical survey presupposes.

  8. Shanahan, M. (2024). "Talking About Large Language Models." Communications of the ACM, 67(2), 68–79. — Careful philosophical analysis of the metaphors used for LLMs.

  9. Smith, L.T. (1999/2021). Decolonizing Methodologies: Research and Indigenous Peoples. 3rd ed. Zed Books. — The foundational text on Indigenous research methodology; its absence when Wilson is so central is a significant gap.

  10. Sperber, D. & Wilson, D. (1986/1995). Relevance: Communication and Cognition. Blackwell. — Relevance Theory provides an alternative to Gricean pragmatics that may be more applicable to LLM interaction.

  11. Suchman, L. (2007). Human-Machine Reconfigurations: Plans and Situated Actions. 2nd ed. Cambridge University Press. — The foundational STS text on situated action, directly challenging the plan-then-execute model of decomposition.


Recommended Revisions (Priority-Ordered)

Critical (Must Fix)

  1. Citation audit. Verify every citation. Remove or flag Coeckelbergh (2025) Communicative AI if it cannot be confirmed. Supply full attributions for Sholzman, Conversation Routines, DTPA. Fix all anonymous references.

  2. Reframe the central thesis. The current framing presents the interrogative turn as an established empirical phenomenon. Reframe as: "we identify an emerging trajectory toward conversational decomposition, supported by early evidence from [specific systems], theoretically motivated by [specific linguistic and philosophical arguments], but not yet empirically established as reliably superior." This is both more honest and more defensible.

  3. Strengthen the Indigenous epistemology section. Add Smith, Kovach, and at least one more Indigenous scholar. Engage the IP//AI Position Paper's actual arguments. Describe FLAIR's and Abundant Intelligences' actual work. Address Indigenous critiques of applying relational frameworks to AI. Rework the "recovery not progress" framing.

  4. Add explicit counter-evidence. Dedicate a section to conditions under which imperative prompting outperforms interrogative prompting. Engage with the fact that the most successful deployed systems are structured, not conversational.

Important (Should Fix)

  1. Add cognitive science perspective. At minimum, address dual process theory and cognitive load theory as they relate to decomposition.

  2. Engage Bender et al. (2021) explicitly. The "stochastic parrot" critique is the elephant in the room for any claim about machine dialogue.

  3. Add search methodology. Document databases searched, search terms, inclusion/exclusion criteria, and screening process for each survey.

  4. Deepen linguistic analysis with worked examples. Annotate 2–3 actual prompts with RST structure, speech act analysis, and Gricean evaluation to demonstrate that the theoretical frameworks are not merely applied decoratively.

  5. Commission a counter-thesis agent. In future iterations of the multi-agent protocol, include an agent specifically tasked with building the strongest case against the central thesis.

  6. Add a failure-conditions research question. "Under what task, model, and user conditions does imperative decomposition outperform interrogative decomposition?" is essential for bounding the thesis.

Minor (Nice to Fix)

  1. Standardise bibliography format across all documents.

  2. Reduce cross-document repetition in the synthesis.

  3. Provide a glossary for coined terminology (algorithmic monologism, interrogative turn, context engineering, etc.).

  4. Fix incomplete attributions in the computational linguistics survey (papers 7, 17, 19, 20, etc. in the Sources section).

  5. Engage Gadamer's actual arguments (prejudice, horizon-fusion, effective-historical consciousness) rather than just citing the hermeneutic circle.


Counter-Thesis

The Steel-Man Argument Against the "Instruction-to-Inquiry Shift"

The strongest argument against the interrogative turn thesis runs as follows:

The "shift" is an artefact of observer bias, not a phenomenon in the field. The authors have identified a progression from static to dynamic systems—which is real—and interpreted it through the lens of "instruction vs. inquiry"—which is imposed. The actual technical evolution is toward greater automation and optimisation, not toward more question-asking. DSPy, the most influential recent framework, moves away from natural-language prompting entirely—toward programmatic compilation. LATS and ToolChain*, the performance frontier for decomposition, are tree-search algorithms, not dialogues. The systems positioned as "conversational" (CAMEL, AutoGen, ChatDev) use dialogue as an implementation mechanism, not as an epistemological stance. Their inter-agent "conversations" are optimisation procedures dressed in natural language—they are no more "inquiries" than a genetic algorithm's crossover operations are "sexual reproduction."

The performance evidence favours structure, not inquiry. The highest scores on every major benchmark (SWE-bench, AgentBench, WebArena) are achieved by structured, imperative systems with carefully designed pipelines—not by systems that "ask clarifying questions." The "conversational decomposition" systems cited (ACT, FATA, Tri-Agent) are all 2025 preprints that have not demonstrated superiority on standard benchmarks. Until interrogative systems consistently outperform imperative ones on hard benchmarks, the "shift" is aspirational, not empirical.

The linguistic analysis conflates syntax with semantics. Whether a prompt is phrased as "Summarize X" or "What are the key points of X?" is a surface-level syntactic variation. Instruction-tuned LLMs process both through the same attention mechanisms and produce functionally similar outputs. Leidner & Plachouras (2023)—cited by the survey itself—showed that linguistic form does not reliably predict output quality. The illocutionary force distinction is philosophically interesting but may have no measurable computational consequence.

The philosophical arguments are unfalsifiable. Claims about "phenomenological reorientation" and "epistemic co-construction" cannot be tested against the technical evidence. They are interpretive overlays that could be applied to any change in AI interaction design. The fact that the philosophy of dialogue has concepts (Buber's I-Thou, Bakhtin's polyphony, Gadamer's hermeneutic circle) that can be mapped onto technical systems does not mean those systems instantiate those concepts. The "convergences" identified are analogies, not isomorphisms.

The Indigenous framing obscures the actual power dynamics. The invitation to see conversational prompting as "relational" and "ceremonial" (in Wilson's sense) obscures the fact that LLM-based systems are products of massive data extraction, corporate control, and environmental cost. Framing them through Indigenous epistemology without addressing these material conditions risks legitimising extractive technology through the language of relationality—precisely the kind of co-optation that Indigenous scholars warn against.

What would falsify the thesis? If the interrogative turn is real, we should expect: (a) interrogative systems consistently outperforming imperative ones on controlled benchmarks; (b) measurable differences in output epistemological quality (not just accuracy) as a function of prompt illocutionary force; (c) user studies showing that inquiry-based interaction produces different epistemic engagement. None of these has been demonstrated. Until they are, the thesis remains a compelling interpretive framework—not an empirical finding.


Alternative Focal Works

The literature review selects González Arocha (2025) as its focal work. Here are three alternatives that might have been more productive:

1. Djeffal (2025), "Reflexive Prompt Engineering" — FAccT 2025

Case for: Published at a top-tier interdisciplinary venue (FAccT) with peer review. Provides an actionable five-component framework that bridges philosophy and engineering. Already cited by the review but subordinated to González Arocha. Djeffal's "reflexivity" concept is more operationalisable than González Arocha's "mediating space" and more directly connected to the technical literature (it could be implemented in DSPy, TextGrad, or a decomposition engine). The FAccT audience guarantees engagement with both CS and ethics communities.

Case against: Narrower in scope—focuses on responsible prompt practice, not on the instruction-to-inquiry shift specifically. Less philosophically ambitious.

Verdict: Would have produced a more technically grounded and actionable review, at the cost of some philosophical depth.

2. Krause & Vossen (2024), "The Gricean Maxims in NLP" — INLG 2024

Case for: Provides the definitive current mapping between pragmatic theory and NLP practice. Published in a peer-reviewed NLP venue. Directly bridges linguistics and technical AI. The Gricean framework is empirically testable (unlike phenomenological claims) and offers specific, falsifiable predictions about prompt effectiveness. Would anchor the review in a framework the computational community can engage with.

Case against: Purely linguistic—lacks philosophical depth and says nothing about Indigenous epistemology or phenomenology.

Verdict: Would have produced a more empirically grounded and falsifiable review, at the cost of philosophical and ethical dimensions.

3. The compositionality gap papers: Press et al. (2023) + Khot et al. (2023) together

Case for: The compositionality gap is the strongest empirical evidence for why decomposition (and specifically interrogative decomposition) works. Press et al.'s finding that self-ask (interrogative) outperforms linear reasoning (imperative) is the most concrete evidence for the instruction-to-inquiry shift. Combined with Khot et al.'s DecomP framework, this provides both theoretical and practical foundations. Both are published at EMNLP/ICLR.

Case against: Narrowly technical—would lose the philosophical and Indigenous dimensions entirely.

Verdict: Would have produced a tighter, more defensible technical argument, at the cost of the interdisciplinary ambition that is the review's distinguishing feature.

Overall assessment: González Arocha is the riskiest choice (least-established author, regional journal, hardest to test) but also the most ambitious (only work that treats prompting as inherently philosophical). The review should either justify this choice more explicitly or consider dual focal works—pairing González Arocha with Krause & Vossen to balance philosophical ambition with empirical grounding.


Revised Research Questions

Based on this critique, I propose five sharper, more novel, and more testable research questions:

RQ-A: Under What Conditions Does Prompt Illocutionary Force Predict Task Performance?

A controlled, falsifiable reformulation of the original RQ1.

Given Leidner & Plachouras's (2023) finding that linguistic form does not reliably predict output quality, and the survey's claim that illocutionary force is computationally consequential: For which task types, model architectures, and complexity levels does the choice between imperative and interrogative framing produce statistically significant differences in (a) task accuracy, (b) output diversity, (c) explanation quality, and (d) error type distribution? This explicitly tests the boundary conditions of the thesis rather than assuming its validity.

Methods: Factorial experiment: {task type: factual, analytical, creative, multi-step} × {prompt form: imperative, interrogative, mixed} × {model: at least 3 different architectures} × {complexity: low, medium, high}. N ≥ 200 prompts per cell. Evaluate on standard + epistemological metrics.

RQ-B: Can Pragmatic Well-Formedness of Decomposition Predict Task Failure Before Execution?

A tightened version of the original RQ6 with a specific falsifiable hypothesis.

Hypothesis: Decomposition plans that violate Gricean maxims (as operationalised by Krause & Vossen, 2024) at the sub-task specification level will fail at statistically higher rates than pragmatically well-formed decompositions, controlling for task difficulty and model capability.

Methods: Annotate 500+ decomposition traces from AgentBench and SWE-bench for Gricean maxim violations at each decomposition step. Build a logistic regression model predicting task failure from pragmatic features. Test predictive power against baseline models using only task difficulty and model size.

RQ-C: Does Multi-Agent Architectural Diversity Produce Output Diversity Beyond Single-Model Variance?

A deflationary reformulation of the original RQ4 that makes the Bakhtinian critique testable.

Multi-agent systems (CAMEL, AutoGen, ChatDev) are claimed to produce "dialogical" decomposition. But is the output diversity of N agents using the same underlying model statistically greater than N independent samples from that model? If not, multi-agent "dialogue" is computationally equivalent to repeated sampling—"algorithmic monologism" with extra steps. If it is greater, what architectural features (role differentiation, communication protocols, disagreement mechanisms) contribute to genuine diversity?

Methods: Compare output distributions (semantic similarity, solution strategy diversity, error pattern diversity) across: (a) N independent samples from one model, (b) N agents with different roles but same model, (c) N agents with different models. Information-theoretic measures of diversity (entropy, mutual information).

RQ-D: What Is the Cognitive Load Profile of Conversational vs. Structured Decomposition for Human Users?

Fills the cognitive science gap entirely absent from the current review.

If conversational decomposition requires humans to engage in multi-turn dialogue to specify tasks, while structured decomposition requires completing forms or writing specifications: What are the comparative cognitive load profiles (intrinsic, extraneous, germane) of these interaction modes, and how do they interact with user expertise, task complexity, and time pressure?

Methods: Within-subjects experiment with eye tracking, NASA-TLX, and think-aloud protocols. Users perform identical tasks through (a) structured interface (JSON/form-based), (b) conversational interface (dialogue-based), (c) hybrid. Measure completion time, error rate, cognitive load, and user satisfaction.

RQ-E: A Formal Discourse Grammar of Well-Formed Prompt Decomposition

Combines the original RQ9 with a specific formal target.

Can a context-free or mildly context-sensitive grammar be induced from a corpus of successful prompt decompositions, and does compliance with this grammar predict decomposition success on held-out tasks? Extend RST annotation (Zeldes et al., 2025) with question-semantic types (Hamblin, 1973) to create a decomposition-specific annotation scheme. Induce grammar from annotated corpus. Test predictive validity.

Methods: (1) Annotate 300+ successful decompositions from diverse benchmarks using extended RST + question-semantic categories. (2) Induce grammar via grammar induction algorithms. (3) Annotate 100+ failed decompositions. (4) Test whether grammar violations discriminate success from failure. (5) Implement grammar checker as a decomposition validation layer in a PDE system.


Summary Scorecard

DimensionScore (1–10)Brief Justification
Scholarly Rigor5Multiple citation concerns, evidence hierarchy violations, incomplete attributions
Argumentative Coherence6Strong architecture, thesis underdetermined by evidence, some false equivalences
Disciplinary Depth/Breadth6Good technical coverage, linguistics broad but shallow, philosophy uneven
Critical Omissions5Missing cognitive science, STS, HCI, key papers (Bender, Ouyang, Shanahan)
Methodological Concerns5Focal work selection unjustified, MECE violations, echo-chamber risk, no search methodology
Writing Quality7Clear and well-paced, some jargon inflation and rhetorical overreach
Indigenous Epistemology4Single-scholar dependence, instrumentalisation risk, insufficient Indigenous voices
Research Questions7RQ2, RQ6, RQ9 are strong; missing failure-conditions and user-diversity questions

Composite Score: 5.6 / 10 — Promising but requiring substantial revision before publication.


This critique was produced on April 6, 2026, as part of the IAIP Polyphonic Discussion research protocol (RCH-CTX-Polyphonic-discussion--2604060040). It is intended to strengthen the work, not to discourage the research programme. The cross-disciplinary ambition of this project is its greatest asset; the execution needs to match that ambition.