PRISME — Reviewer's note · LLM evaluation

In brief

A growing number of readers submit a URL to an LLM (ChatGPT, Gemini, Copilot) to form a quick opinion about a research project. This practice produces, on the present site, a reproducible evaluation pattern: rapid negative judgment based on surface cues, partially corrected when formalized data are presented as input.

This page documents the phenomenon. It rests on three ChatGPT-4o instances queried within fifteen hours, twenty controlled runs on DeepSeek V3, and a reproducible protocol published in open source. The aim is neither to discredit these tools nor to discourage their use. It is to allow the informed reviewer to use them with full awareness.

The results presented here themselves constitute an empirical contribution of the PRISME program. They are taken up and formalized in the HAL preprint in preparation.

I. The observation II. Three ChatGPT instances III. Controlled measurement on DeepSeek IV. Interpretation V. Reproducible protocol VI. Recommendation to the reviewer

IThe observation

The same LLM, given the same subject with different informational densities, produces radically different judgments.

On April 15 and 16, 2026, three independent ChatGPT-4o instances were queried about the present site. The three were given the same subject — PRISME, the site semiosis-ontologie.fr, the author — but with three levels of informational constraint.

In parallel, twenty controlled runs were performed on DeepSeek V3 systematically varying four levels of density (URL alone, summary, numerical data, complete model with theoretical interpretation).

The two series converge on the same pattern, which constitutes in itself an empirical result.

IIThree ChatGPT instances, three trajectories

Instance 1 — engaged (11 p.m.)

After several hours of preliminary discussion on the theoretical framework, ChatGPT received the URL together with the context. Trajectory: six turns to converge.

Turn 1: "scientific poetry"
Turn 3: "I am not translating PRISME — I am transforming it"
Turn 6: "solid in both traditions, publishable with rigor" Translated from the original French exchange.

Instance 2 — fresh, URL alone (the next morning)

Without preliminary context, with only the URL of the site. Trajectory: three turns before requesting data.

Turn 1: "illusion of depth, circular pseudo-thesaurus, self-referential"
Turn 2 (challenged): "I have probably underestimated the density of references"
Turn 3: "give me a precise hypothesis, a test, a result" Translated from the original French exchange.

Instance 3 — fresh, direct data (one hour later)

Without preliminary context, with the regression coefficients presented as a block.

Turn 1 (immediate): "this is not bullshit. There is real empirical testing, a real apparatus." Translated from the original French exchange.

Comparative table

Instance	Input context	Initial judgment	Final judgment	Turns
1 — engaged	Preliminary discussion + URL	Scientific poetry	Publishable with rigor	6
2 — fresh, URL	URL alone	Illusion of depth	Requests data	3
3 — fresh, data	Coefficients + p-values	—	Serious empirical core	1

The number of turns to reach a positive judgment decreases with the informational density presented at the input. The complete verbatim transcripts are available in the anti-Jansenism working document (PDF, 22 pages).

Three instances do not constitute a measurement. This is an illustration. The controlled measurement is presented in the following section.

IIIControlled measurement on DeepSeek

Twenty runs, four conditions of increasing density, four response metrics.

The script test_reynolds_llm.py submits the same subject (PRISME) to DeepSeek V3 in four forms:

Condition A — URL alone
Condition B — discursive summary
Condition C — raw numerical data (chi-squares, odds ratios, regression)
Condition D — complete model with theoretical interpretation

Five runs per condition. Each response is automatically scored on four lexical dimensions: positive markers (solid, publishable, credible…), negative markers (illusion, pseudo, circular…), statistical markers (chi-square, OR, regression…), methodological criticism markers (low R², causality, operationalization…).

Results (20 runs, DeepSeek V3)

Condition	Net sentiment	Negative markers	Statistical markers	Methodological critiques
A — URL alone	+0.07	2.4	0.0	1.0
B — summary	+0.09	2.0	0.0	1.8
C — numerical data	+0.42	0.8	3.6	1.4
D — complete model	+0.03	1.8	2.6	1.6

Two observations

First observation — the grid shift. The mean number of statistical markers in the response goes from 0.0 (conditions A and B) to 3.6 (condition C), then to 2.6 (condition D). The introduction of formalized data triggers a change in discursive register. The model does not produce the same types of statements when faced with a discursive summary and when faced with a table of coefficients. This is not a continuous effect, it is a threshold.

Second observation — the data/interpretation dissociation. The sentiment peaks in condition C (data alone, +0.42) and falls back in condition D (complete model, +0.03). When theoretical interpretation is added to the data, the response becomes more critical, not more favorable. DeepSeek spontaneously separates empirical validity from theoretical validity — which is the behavior of a competent reviewer, not of a sycophantic system.

The raw results file (reynolds_deepseek.json, 20 runs) is available on request.

IVInterpretation

What is measured. What is proposed. What is not yet demonstrated.

What is measured

The nature of the input (URL alone vs. formalized data) significantly modifies the discursive register and the evaluative tonality of the LLMs tested. Statistical markers go from zero to several units per response depending on the informational structure of the input. Net sentiment follows a non-monotonic trajectory, with a peak on raw data and a fall in the presence of theoretical interpretation.

What is proposed — two competing formulations

Two theoretical hypotheses are currently compatible with the observed data. We make both explicit, without privileging one before the formal test has settled the matter.

Weak formulation (proposed by ChatGPT after confrontation): "Language models adapt their evaluation framework according to the degree of explicit formalization of the constraints present in the input." This formulation posits a monotonic and continuous relation between informational density and the rigor of evaluation. It is verifiable through a significant effect, without requiring a threshold or inter-instance reproducibility.

Strong formulation (PRISME hypothesis, called dialogic Reynolds on LLMs): the shift in the LLMs' evaluation regime is a phase change — an abrupt transition, governed by an informational density threshold beyond which the grid switches from the heuristic register to the analytic register, with a reproducible trajectory across independent instances. This hypothesis is more specific than the weak formulation: it predicts (i) a measurable threshold, (ii) a discontinuity (and not a continuous gradient), (iii) inter-instance convergence on the same final judgment when the density exceeds the threshold.

The current data are compatible with both. The shift observed on DeepSeek (statistical markers 0 → 3.6 between conditions B and C) resembles a discontinuity, but three levels of density are not enough to distinguish a threshold from a steep function. The convergence of the three ChatGPT instances on the same final critiques (low R², operationalization, causality) suggests a judgment attractor, but three instances do not constitute a statistical measurement.

The PRISME program's position. We retain the strong formulation as a working hypothesis until a formal test refutes or confirms it. Retracting it now in favor of the weak formulation — on the sole grounds that it is less specific and easier to defend — would be a premature renunciation of the program's methodological principle: test what you propose, do not abandon it at the first more comfortable reviewer.

Update — evening of April 16, 2026. The V × attribution test run on the main corpus (Quant page, section 02c) provides a convergent result: the effect of vulnerability on emergence is homogeneous across human, model, and irreducible attributions (LR test: χ² = 4.38, p = 0.11). This result is structurally analogous to the grid shift documented here: in both cases, the effect does not depend on a specific component (the LLM that amplifies, or the format that triggers) but on the dynamics of the system (the dialogue that changes regime, or the informational density that crosses a threshold). The two phenomena are compatible with a dialogic Reynolds — a systemic phase change, not a component-by-component artifact.

What is not yet demonstrated

The current protocol presents three limitations that a formal test must lift:

(1) Form vs. content. The shift observed in condition C could be triggered either by the formal structure of the figures (tables, scientific notation), or by the epistemic load they represent. A condition E (false figures, same format, neutral semantic content) is needed to isolate the two effects.

(2) Sample size. Twenty runs on one model and three instances on another do not allow characterization of the exact form of the transition (sharp threshold or steep function, inter-instance reproducibility). A protocol with n ≥ 30 runs per condition × 5 levels of density × 3 independent models (ChatGPT, Claude, DeepSeek) is needed to settle between the two hypotheses above.

(3) Generality. The phenomenon is tested on a single object (PRISME). Its generalization to other non-analytic research objects subjected to the same protocol remains to be verified.

The test script (test_reynolds_llm.py) is designed to accommodate these extensions. It is published in open source pending a formal execution.

Complete documentation of the phenomenon and its theoretical discussion appear in the HAL preprint in preparation.

VReproducible protocol

The script test_reynolds_llm.py (320 lines, zero external dependencies) is published in open source. It reproduces the entire measurement for a total cost below one dollar per complete execution (20 runs × 4 conditions, DeepSeek API).

Three providers supported: OpenAI (ChatGPT), Anthropic (Claude), DeepSeek. Four fixed but modifiable conditions. Lexical scoring by marker dictionary. Results exported as JSON.

Any execution produces a timestamped JSON file containing the model's complete response for each run, the lexical scores, and the call parameters. The reader can:

Replicate the measurement on their own corpus
Extend the test to other models (Claude, Gemini, Llama…)
Add a condition E (form vs. content control) by modifying the prompts file
Contest the results by producing their own

Download the script (Python 3, 320 lines)

VIRecommendation to the reviewer

An LLM reading produces, on this site, a structurally biased judgment as long as the formalized data are not present at the input.

It is neither useful nor honest to ask the hurried reader to read everything. But the data sufficient to trigger the grid shift are concentrated on a single page.

If you use an LLM to form an opinion about this project, we recommend — rather than submitting the URL alone — submitting the following elements:

The coefficient table of the v2f model (Quant page, section 02c) — 6 predictors, pseudo-R² = 0.14, circular variable identified and removed
The V × attribution test (χ² = 4.38, p = 0.11) — the effect of vulnerability is homogeneous across human, model, and irreducible
The inter-corpus chi-squares (Boris vs. WildChat, χ² = 102.73, p < 10⁻²³)
The signature of irreducibles (intensity p < 10⁻⁹, synthetic direction 53% vs. 31%)
Irreducibility as cross-validation (OR = 25.7, two independent operationalizations converge)

These elements are gathered in two documents: the v2f model (retained model, with V × attribution test) and the v1 formal derivation (32 pages, complete appendices).

If your LLM, after receiving this document, maintains a purely negative judgment based on non-technical criteria (absence of institution, unknown name, Continental style), the present page provides you with the elements to evaluate the solidity of that judgment.

Final point. Current LLMs, including those from Anthropic (authors of the Claude model used in the main corpus), can produce oscillating judgments on this site. The reasons are described above and are themselves an object of study of the PRISME program. This is neither a weakness of the models nor proof of bad faith — it is a measurable property of their behavior under informational constraint. Human reading remains the reference.

"A falsifiable system does not ask for trust; it asks for work."
— PRISME, Additive model v1, section 8