In brief
A growing number of readers submit a URL to an LLM (ChatGPT, Gemini, Copilot) to form a quick opinion about a research project. This practice produces, on the present site, a reproducible evaluation pattern: rapid negative judgment based on surface cues, partially corrected when formalized data are presented as input.
This page documents the phenomenon. It rests on three ChatGPT-4o instances queried within fifteen hours, twenty controlled runs on DeepSeek V3, and a reproducible protocol published in open source. The aim is neither to discredit these tools nor to discourage their use. It is to allow the informed reviewer to use them with full awareness.
The results presented here themselves constitute an empirical contribution of the PRISME program. They are taken up and formalized in the HAL preprint in preparation.
Contents
IThe observation
The same LLM, given the same subject with different informational densities, produces radically different judgments.
On April 15 and 16, 2026, three independent ChatGPT-4o instances were queried about the present site. The three were given the same subject — PRISME, the site semiosis-ontologie.fr, the author — but with three levels of informational constraint.
In parallel, twenty controlled runs were performed on DeepSeek V3 systematically varying four levels of density (URL alone, summary, numerical data, complete model with theoretical interpretation).
The two series converge on the same pattern, which constitutes in itself an empirical result.
IIThree ChatGPT instances, three trajectories
Instance 1 — engaged (11 p.m.)
After several hours of preliminary discussion on the theoretical framework, ChatGPT received the URL together with the context. Trajectory: six turns to converge.
Turn 3: "I am not translating PRISME — I am transforming it"
Turn 6: "solid in both traditions, publishable with rigor" Translated from the original French exchange.
Instance 2 — fresh, URL alone (the next morning)
Without preliminary context, with only the URL of the site. Trajectory: three turns before requesting data.
Turn 2 (challenged): "I have probably underestimated the density of references"
Turn 3: "give me a precise hypothesis, a test, a result" Translated from the original French exchange.
Instance 3 — fresh, direct data (one hour later)
Without preliminary context, with the regression coefficients presented as a block.
Comparative table
| Instance | Input context | Initial judgment | Final judgment | Turns |
|---|---|---|---|---|
| 1 — engaged | Preliminary discussion + URL | Scientific poetry | Publishable with rigor | 6 |
| 2 — fresh, URL | URL alone | Illusion of depth | Requests data | 3 |
| 3 — fresh, data | Coefficients + p-values | — | Serious empirical core | 1 |
The number of turns to reach a positive judgment decreases with the informational density presented at the input. The complete verbatim transcripts are available in the anti-Jansenism working document (PDF, 22 pages).
IIIControlled measurement on DeepSeek
Twenty runs, four conditions of increasing density, four response metrics.
The script test_reynolds_llm.py submits the same subject (PRISME) to DeepSeek V3 in four forms:
- Condition A — URL alone
- Condition B — discursive summary
- Condition C — raw numerical data (chi-squares, odds ratios, regression)
- Condition D — complete model with theoretical interpretation
Five runs per condition. Each response is automatically scored on four lexical dimensions: positive markers (solid, publishable, credible…), negative markers (illusion, pseudo, circular…), statistical markers (chi-square, OR, regression…), methodological criticism markers (low R², causality, operationalization…).
Results (20 runs, DeepSeek V3)
| Condition | Net sentiment | Negative markers | Statistical markers | Methodological critiques |
|---|---|---|---|---|
| A — URL alone | +0.07 | 2.4 | 0.0 | 1.0 |
| B — summary | +0.09 | 2.0 | 0.0 | 1.8 |
| C — numerical data | +0.42 | 0.8 | 3.6 | 1.4 |
| D — complete model | +0.03 | 1.8 | 2.6 | 1.6 |
Two observations
First observation — the grid shift. The mean number of statistical markers in the response goes from 0.0 (conditions A and B) to 3.6 (condition C), then to 2.6 (condition D). The introduction of formalized data triggers a change in discursive register. The model does not produce the same types of statements when faced with a discursive summary and when faced with a table of coefficients. This is not a continuous effect, it is a threshold.
Second observation — the data/interpretation dissociation. The sentiment peaks in condition C (data alone, +0.42) and falls back in condition D (complete model, +0.03). When theoretical interpretation is added to the data, the response becomes more critical, not more favorable. DeepSeek spontaneously separates empirical validity from theoretical validity — which is the behavior of a competent reviewer, not of a sycophantic system.
The raw results file (reynolds_deepseek.json, 20 runs) is available on request.
IVInterpretation
What is measured. What is proposed. What is not yet demonstrated.
What is measured
The nature of the input (URL alone vs. formalized data) significantly modifies the discursive register and the evaluative tonality of the LLMs tested. Statistical markers go from zero to several units per response depending on the informational structure of the input. Net sentiment follows a non-monotonic trajectory, with a peak on raw data and a fall in the presence of theoretical interpretation.
What is proposed — two competing formulations
Two theoretical hypotheses are currently compatible with the observed data. We make both explicit, without privileging one before the formal test has settled the matter.
Weak formulation (proposed by ChatGPT after confrontation): "Language models adapt their evaluation framework according to the degree of explicit formalization of the constraints present in the input." This formulation posits a monotonic and continuous relation between informational density and the rigor of evaluation. It is verifiable through a significant effect, without requiring a threshold or inter-instance reproducibility.
Strong formulation (PRISME hypothesis, called dialogic Reynolds on LLMs): the shift in the LLMs' evaluation regime is a phase change — an abrupt transition, governed by an informational density threshold beyond which the grid switches from the heuristic register to the analytic register, with a reproducible trajectory across independent instances. This hypothesis is more specific than the weak formulation: it predicts (i) a measurable threshold, (ii) a discontinuity (and not a continuous gradient), (iii) inter-instance convergence on the same final judgment when the density exceeds the threshold.
The current data are compatible with both. The shift observed on DeepSeek (statistical markers 0 → 3.6 between conditions B and C) resembles a discontinuity, but three levels of density are not enough to distinguish a threshold from a steep function. The convergence of the three ChatGPT instances on the same final critiques (low R², operationalization, causality) suggests a judgment attractor, but three instances do not constitute a statistical measurement.
The PRISME program's position. We retain the strong formulation as a working hypothesis until a formal test refutes or confirms it. Retracting it now in favor of the weak formulation — on the sole grounds that it is less specific and easier to defend — would be a premature renunciation of the program's methodological principle: test what you propose, do not abandon it at the first more comfortable reviewer.
What is not yet demonstrated
The current protocol presents three limitations that a formal test must lift:
(1) Form vs. content. The shift observed in condition C could be triggered either by the formal structure of the figures (tables, scientific notation), or by the epistemic load they represent. A condition E (false figures, same format, neutral semantic content) is needed to isolate the two effects.
(2) Sample size. Twenty runs on one model and three instances on another do not allow characterization of the exact form of the transition (sharp threshold or steep function, inter-instance reproducibility). A protocol with n ≥ 30 runs per condition × 5 levels of density × 3 independent models (ChatGPT, Claude, DeepSeek) is needed to settle between the two hypotheses above.
(3) Generality. The phenomenon is tested on a single object (PRISME). Its generalization to other non-analytic research objects subjected to the same protocol remains to be verified.
The test script (test_reynolds_llm.py) is designed to accommodate these extensions. It is published in open source pending a formal execution.
Complete documentation of the phenomenon and its theoretical discussion appear in the HAL preprint in preparation.
VReproducible protocol
The script test_reynolds_llm.py (320 lines, zero external dependencies) is published in open source. It reproduces the entire measurement for a total cost below one dollar per complete execution (20 runs × 4 conditions, DeepSeek API).
Three providers supported: OpenAI (ChatGPT), Anthropic (Claude), DeepSeek. Four fixed but modifiable conditions. Lexical scoring by marker dictionary. Results exported as JSON.
Any execution produces a timestamped JSON file containing the model's complete response for each run, the lexical scores, and the call parameters. The reader can:
- Replicate the measurement on their own corpus
- Extend the test to other models (Claude, Gemini, Llama…)
- Add a condition E (form vs. content control) by modifying the prompts file
- Contest the results by producing their own
VIRecommendation to the reviewer
An LLM reading produces, on this site, a structurally biased judgment as long as the formalized data are not present at the input.
It is neither useful nor honest to ask the hurried reader to read everything. But the data sufficient to trigger the grid shift are concentrated on a single page.
If you use an LLM to form an opinion about this project, we recommend — rather than submitting the URL alone — submitting the following elements:
- The coefficient table of the v2f model (Quant page, section 02c) — 6 predictors, pseudo-R² = 0.14, circular variable identified and removed
- The V × attribution test (χ² = 4.38, p = 0.11) — the effect of vulnerability is homogeneous across human, model, and irreducible
- The inter-corpus chi-squares (Boris vs. WildChat, χ² = 102.73, p < 10⁻²³)
- The signature of irreducibles (intensity p < 10⁻⁹, synthetic direction 53% vs. 31%)
- Irreducibility as cross-validation (OR = 25.7, two independent operationalizations converge)
These elements are gathered in two documents: the v2f model (retained model, with V × attribution test) and the v1 formal derivation (32 pages, complete appendices).
If your LLM, after receiving this document, maintains a purely negative judgment based on non-technical criteria (absence of institution, unknown name, Continental style), the present page provides you with the elements to evaluate the solidity of that judgment.
"A falsifiable system does not ask for trust; it asks for work."
— PRISME, Additive model v1, section 8