Reputation in public goods cooperation under double <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" altimg="si51.svg" display="inline" id="d1e1286"><mml:mi>Q</mml:mi></mml:math>-learning protocol

Q: What game and population structure are used?

A spatial public goods game on an $L\times L$ square lattice (with periodic boundary conditions), where each player interacts with its four von Neumann neighbors and participates in five groups (centered on itself and each neighbor).

Q: How is payoff computed in the model?

Each player’s game payoff $p_i(\tau)$ is the sum of group payoffs $p_{ij}(\tau)$ across five groups, with cooperators contributing and defectors contributing nothing; the amplified contributions are shared equally using synergy factor $r$. Reputation augments fitness via $\Pi_i(\tau)=(1-\eta)p_i(\tau)+\eta R_i(\tau)$.

Q: How does reputation evolve over time?

Reputation $R_i(\tau)$ is initialized uniformly in $[1,10]$ and updated nonlinearly: cooperation increases $R$ by $\delta\left(1-\frac{R}{10}\right)$, defection decreases it by $-\delta\frac{R}{10}$, with clipping to $[1,10]$.

Q: How are cooperation outcomes measured?

The stationary cooperation level $f_c$ is averaged over the final 500 steps ($9501$–$10000$) of each run and then averaged across $M=30$ independent runs.

Q: What is the main empirical comparison between DQL and TQL?

When reputation is dynamically relevant (e.g., $\delta=2$), DQL yields a significantly higher cooperation level than TQL, especially as $\eta$ increases; when $\delta=0$, differences are minor because reputation is effectively non-informative.

Q: What parameter combinations promote high cooperation?

For $r=2$, high cooperation appears only when $\delta>0.5$ and $\eta>0.22$. For $r=3$, low-cooperation regions disappear and cooperation shifts to medium/high levels.

Q: How do Q-values explain the cooperation mechanism?

Cooperation dominance occurs when $\bar{Q}_{(C,C)} > \bar{Q}_{(C,D)}$ and $\bar{Q}_{(D,C)} > \bar{Q}_{(D,D)}$. For example, at $\eta=0$, only $\bar{Q}_{(D,D)}=0.33$ is nonzero, while at $\eta=0.3$ cooperation-related Q-values are much larger (e.g., $\bar{Q}_{(C,C)}=87.52$, $\bar{Q}_{(D,D)}=69.89$).

Kai Xie, Attila Szolnoki

Chaos Solitons & Fractals·2025·Social Sciences·27 citations

8 min read

Read the full paper at DOI or on arxiv

TL;DR

The model integrates reputation into three channels: heterogeneous investment driven by organizer reputation and cooperation willingness (HIORC), nonlinear reputation transfer (NRT) with abrupt changes, and reputation-weighted fitness $Π_{i} (τ) = (1 - η) p_{i} (τ) + η R_{i} (τ)$ .

Briefing Cornell Notes

Briefing

This paper asks how cooperation in a spatial public goods game (PGG) changes when (i) players’ investment is heterogeneous and depends on both the reputation of group organizers and the population’s cooperation willingness, (ii) individuals’ payoffs are augmented by reputation through a weighted fitness function, and (iii) reputation evolves nonlinearly with potentially abrupt changes rather than monotonic drift. The question matters because cooperation dilemmas are central to evolutionary game theory and to real collective-action settings (public projects, institutions, online communities), where trust and reputation are known to be fragile, state-dependent, and capable of sudden collapse or recovery. Most prior reputation-based PGG models either treat reputation as monotonic or incorporate it in a limited way (e.g., only as a selection/interaction modifier). Here, reputation is integrated simultaneously into investment heterogeneity, into payoff evaluation, and into a nonlinear update rule, creating a richer mechanism for indirect reciprocity.

Methodologically, the authors implement an agent-based evolutionary game on a two-dimensional lattice of size $L \times L$ with periodic boundary conditions, using von Neumann neighborhoods (each player interacts with four nearest neighbors). Each player participates in five groups (centered on itself and on each neighbor). The game is repeated for $E = 1 0^{4}$ steps per run, with $M = 30$ independent runs. Initial strategies are randomly assigned as cooperation (C) or defection (D). The public goods payoff from each group $G_{j}$ is computed with a synergy factor $r$ : cooperators contribute and defectors contribute nothing, and the group’s amplified contributions are shared equally among group members. The total game payoff $p_{i} (τ)$ is the sum of payoffs across the five groups. Reputation enters via a weighted “actual income” (fitness) function $Π_{i} (τ) = (1 - η) p_{i} (τ) + η R_{i} (τ)$ , where $η$ controls how strongly reputation dominates fitness.

Reputation dynamics follow a nonlinear transfer (NRT) rule. Each player’s reputation $R_{i} (τ)$ is initialized uniformly in $[R_{m i n}, R_{m a x}]$ with $R_{m i n} = 1$ and $R_{m a x} = 10$ . At each step, reputation changes according to the player’s last action: if the player cooperated, reputation increases by $δ (1 - \frac{R _{i} ( τ )}{10})$ ; if the player defected, reputation decreases by $- δ \frac{R _{i} ( τ )}{10}$ . This creates diminishing returns for high reputation under cooperation and potentially large drops under defection, with reputation clipped to $[1, 10]$ . The parameter $δ$ is the reputation sensitivity.

Crucially, strategy updating is performed using reinforcement learning rather than imitation. The paper replaces traditional Q-learning (TQL) with double Q-learning (DQL) to mitigate overestimation bias. Each agent maintains two Q-tables, $Q_{1}^{i}$ and $Q_{2}^{i}$ , over the state set $S = {C, D}$ and action set $A = {C, D}$ . Action selection uses the larger Q-value between the two tables (with $ϵ$ -greedy exploration, $ϵ = 0.02$ in baseline experiments), while Q-value updates randomly update either $Q_{1}$ or $Q_{2}$ . When updating one table, the target uses the other table’s evaluation, separating action selection from value estimation—this is the key theoretical mechanism for reducing overestimation bias.

The primary empirical outcome is the stationary cooperation level $f_{c}$ , computed by averaging the cooperation fraction over the final 500 steps of each run (steps $9501$ to $10000$ ), then averaging across runs. The authors also track time-dependent cooperation $f_{c} (τ)$ and analyze Q-values and reputation distributions in the stationary regime.

Key findings are reported through simulation comparisons and parameter sweeps. First, DQL substantially outperforms TQL in promoting cooperation when reputation is meaningfully coupled to fitness (i.e., when $δ > 0$ and $η > 0$ ). The paper notes that when $δ = 0$ (reputation effectively fixed/uninformative), TQL can show slightly higher cooperation, but this is not meaningful because reputation is not dynamically connected to fitness. Under realistic reputation sensitivity (e.g., $δ = 2$ ), increasing $η$ and $δ$ increases cooperation for both algorithms, but the improvement under DQL is “remarkably more pronounced,” with the authors emphasizing that DQL’s advantage is clear for the relevant parameter regimes (the text indicates DQL superiority particularly when $r$ is in the cooperative-supporting range, though the exact p-values are not provided).

Second, the coevolution of strategy and reputation reveals a mechanism: when reputation is weighted heavily in fitness (larger $η$ ), the system can reverse the typical tragedy-of-the-commons outcome. The authors describe spatial snapshots showing that at $η = 0$ (reputation irrelevant), defection dominates and only low-reputation classes survive. When $η$ is increased (e.g., $η = 0.3$ ), cooperators can form small clusters and defectors are eliminated even at relatively low synergy $r$ , while reputation levels rise and become aligned with cooperation. Importantly, the paper argues that large cooperative clusters do not arise from network reciprocity because the agents’ Q-tables are not directly conditioned on neighbors’ strategies; instead, updates are driven primarily by self-reward (fitness) and the reinforcement learning process. This design makes the emergence of cooperation less dependent on topology and more dependent on the reputation-payoff coupling.

Third, reputation dynamics exhibit a non-monotonic, state-dependent structure. The authors observe that some players maintain moderate reputation and repeatedly flip between cooperation and defection when $η > 0$ , consistent with a trade-off: maintaining high reputation requires sustained cooperation, while defection can yield short-term payoff but erodes reputation. They also report that the middle-reputation class can persist robustly for any $η > 0$ , with its final fraction only mildly dependent on $η$ . The explanation is that reputation growth slows near saturation due to the NRT rule, so reaching higher reputation requires sustained investment and becomes costly/ambiguous.

Fourth, the combined parameter effects are mapped on the $δ$ - $η$ plane. For a representative low synergy case $r = 2$ , the authors report that high cooperation (red region) appears only when $δ > 0.5$ and $η > 0.22$ . For a higher synergy case $r = 3$ , the low-cooperation region disappears, and cooperation shifts to medium or high levels (green/yellow). The authors summarize the key message as: $η$ , $r$ , and $δ$ must be jointly tuned; no single parameter alone reliably produces high cooperation.

Fifth, the paper provides quantitative Q-table evidence. Table 3 reports average Q-values in the stationary state for $r = 2$ and $δ = 2$ across $η \in {0, 0.1, 0.2, 0.3}$ . When $η = 0$ , the only nonzero average is $\overset{ˉ}{Q}_{(D, D)} = 0.33$ , while $\overset{ˉ}{Q}_{(C, C)} = \overset{ˉ}{Q}_{(C, D)} = \overset{ˉ}{Q}_{(D, C)} = 0$ , indicating defection is optimal. As $η$ increases, cooperation-related Q-values rise: at $η = 0.1$ , $\overset{ˉ}{Q}_{(C, C)} = 13.09$ , $\overset{ˉ}{Q}_{(C, D)} = 17.25$ , $\overset{ˉ}{Q}_{(D, C)} = 17.34$ , $\overset{ˉ}{Q}_{(D, D)} = 17.69$ ; at $η = 0.2$ , $\overset{ˉ}{Q}_{(C, C)} = 55.36$ , $\overset{ˉ}{Q}_{(C, D)} = 54.57$ , $\overset{ˉ}{Q}_{(D, C)} = 57.71$ , $\overset{ˉ}{Q}_{(D, D)} = 48.59$ ; and at $η = 0.3$ , $\overset{ˉ}{Q}_{(C, C)} = 87.52$ , $\overset{ˉ}{Q}_{(C, D)} = 78.64$ , $\overset{ˉ}{Q}_{(D, C)} = 84.39$ , $\overset{ˉ}{Q}_{(D, D)} = 69.89$ . The authors interpret cooperation dominance as occurring when both inequalities hold: $\overset{ˉ}{Q}_{(C, C)} > \overset{ˉ}{Q}_{(C, D)}$ and $\overset{ˉ}{Q}_{(D, C)} > \overset{ˉ}{Q}_{(D, D)}$ . They further argue that larger gaps between these quantities make strategy switching toward cooperation easier.

Finally, robustness is supported via a mean-field approximation. The authors derive a differential equation for cooperation frequency $f_{c} (τ)$ based on transition probabilities $P_{D \to C}$ and $P_{C \to D}$ , yielding a stationary expression $f_{c} = \frac{P _{D \to C}}{P _{C \to D} + P _{D \to C}}$ . They report close agreement between mean-field predictions and simulation results, suggesting that the observed cooperation enhancement is not purely a finite-size artifact.

Limitations include the reliance on a specific lattice topology (square lattice with von Neumann neighborhood) and a specific two-action state/action representation for reinforcement learning (state equals current strategy). The paper also fixes baseline learning parameters ( $α = 0.8$ , $γ = 0.9$ , $ϵ = 0.02$ ) and does not provide statistical significance tests (e.g., p-values or confidence intervals) for the cooperation-level differences between DQL and TQL, though it uses $M = 30$ runs and reports error bars qualitatively in figures. Additionally, the HIORC mechanism is described conceptually, but the provided excerpt emphasizes the payoff weighting and NRT reputation dynamics more than the full mathematical specification of organizer reputation and bandwagon-driven heterogeneous investment; thus, readers should verify how fully HIORC is operationalized in the simulation.

Practically, the results suggest that cooperation can be stabilized when (1) reputation is not merely an interaction label but enters agents’ effective fitness, (2) reputation dynamics reflect realistic fragility and nonlinear recovery, and (3) decision-making uses reinforcement learning methods that avoid overestimation bias. This is relevant to designing institutions and platforms where reputation affects both resource allocation (who invests) and evaluation (how rewards are computed). Stakeholders in online marketplaces, crowdsourcing, and public project governance—where reputation systems and learning-based adaptation occur—should care because the model predicts that coupling reputation to payoff can shift populations from defection to cooperation, but only when reputation sensitivity and payoff weighting are jointly strong enough relative to the underlying public goods synergy $r$ .

Cornell Notes

The paper studies cooperation in a spatial public goods game where reputation affects both investment heterogeneity and agents’ fitness, with reputation evolving via a nonlinear, potentially abrupt update rule. It shows that using double Q-learning (instead of traditional Q-learning) substantially increases cooperation and that the resulting cooperation is explained by changes in agents’ Q-values and reputation distributions, with mean-field theory matching simulations.

What is the central research question of the paper?

How does integrating reputation into both public goods investment and payoff evaluation—while letting reputation evolve nonlinearly—affect the emergence and stability of cooperation, and how does double Q-learning change that outcome compared with traditional Q-learning?

What game and population structure are used?

A spatial public goods game on an $L \times L$ square lattice (with periodic boundary conditions), where each player interacts with its four von Neumann neighbors and participates in five groups (centered on itself and each neighbor).

How is payoff computed in the model?

Each player’s game payoff $p_{i} (τ)$ is the sum of group payoffs $p_{ij} (τ)$ across five groups, with cooperators contributing and defectors contributing nothing; the amplified contributions are shared equally using synergy factor $r$ . Reputation augments fitness via $Π_{i} (τ) = (1 - η) p_{i} (τ) + η R_{i} (τ)$ .

How does reputation evolve over time?

Reputation $R_{i} (τ)$ is initialized uniformly in $[1, 10]$ and updated nonlinearly: cooperation increases $R$ by $δ (1 - \frac{R}{10})$ , defection decreases it by $- δ \frac{R}{10}$ , with clipping to $[1, 10]$ .

What reinforcement learning method is used for strategy updates?

Double Q-learning (DQL), where each agent maintains two Q-tables $Q_{1}$ and $Q_{2}$ . Action selection uses the tables’ maxima with $ϵ$ -greedy exploration, while updates separate action selection from value estimation to reduce overestimation bias.

How are cooperation outcomes measured?

The stationary cooperation level $f_{c}$ is averaged over the final 500 steps ( $9501$ – $10000$ ) of each run and then averaged across $M = 30$ independent runs.

What is the main empirical comparison between DQL and TQL?

When reputation is dynamically relevant (e.g., $δ = 2$ ), DQL yields a significantly higher cooperation level than TQL, especially as $η$ increases; when $δ = 0$ , differences are minor because reputation is effectively non-informative.

What parameter combinations promote high cooperation?

For $r = 2$ , high cooperation appears only when $δ > 0.5$ and $η > 0.22$ . For $r = 3$ , low-cooperation regions disappear and cooperation shifts to medium/high levels.

How do Q-values explain the cooperation mechanism?

Cooperation dominance occurs when $\overset{ˉ}{Q}_{(C, C)} > \overset{ˉ}{Q}_{(C, D)}$ and $\overset{ˉ}{Q}_{(D, C)} > \overset{ˉ}{Q}_{(D, D)}$ . For example, at $η = 0$ , only $\overset{ˉ}{Q}_{(D, D)} = 0.33$ is nonzero, while at $η = 0.3$ cooperation-related Q-values are much larger (e.g., $\overset{ˉ}{Q}_{(C, C)} = 87.52$ , $\overset{ˉ}{Q}_{(D, D)} = 69.89$ ).

How is robustness validated?

Through a mean-field approximation that predicts the stationary cooperation frequency using transition probabilities; the authors report close agreement between theory and simulation.

Review Questions

Which parts of the model make reputation “non-monotonic” and capable of abrupt changes, and how do those parts enter the fitness function?
Why does double Q-learning reduce overestimation bias compared with traditional Q-learning, and how is that reflected in the agent’s update equations?
Under what conditions (in terms of $η$ , $δ$ , and $r$ ) does the paper report high cooperation, and what does the $δ$ - $η$ heatmap imply about parameter interactions?
How do the reported average Q-values (Table 3) support the claim that cooperation becomes dominant only when specific inequalities between $\overset{ˉ}{Q}_{(s, a)}$ hold?
What does the mean-field approximation assume, and how does its agreement with simulations support the paper’s conclusions?

Key Points

1
The model integrates reputation into three channels: heterogeneous investment driven by organizer reputation and cooperation willingness (HIORC), nonlinear reputation transfer (NRT) with abrupt changes, and reputation-weighted fitness $Π_{i} (τ) = (1 - η) p_{i} (τ) + η R_{i} (τ)$ .
2
Reputation evolves nonlinearly: cooperation increases reputation by $δ (1 - \frac{R}{10})$ while defection decreases it by $- δ \frac{R}{10}$ , with clipping to $[1, 10]$ .
3
Using double Q-learning (DQL) for strategy updates substantially improves cooperation compared with traditional Q-learning (TQL) when reputation sensitivity is nonzero (e.g., $δ = 2$ ).
4
High cooperation requires joint tuning of parameters: for $r = 2$ , high cooperation occurs only when $δ > 0.5$ and $η > 0.22$ ; for $r = 3$ , low-cooperation regions vanish.
5
Cooperative behavior is not attributed to network reciprocity because Q-tables are not conditioned on neighbors; instead, cooperation emerges from self-reward dynamics under reputation-coupled fitness.
6
Q-table analysis shows cooperation dominance when $\overset{ˉ}{Q}_{(C, C)} > \overset{ˉ}{Q}_{(C, D)}$ and $\overset{ˉ}{Q}_{(D, C)} > \overset{ˉ}{Q}_{(D, D)}$ ; Table 3 quantifies how these values shift as $η$ increases.
7
A mean-field approximation for $f_{c}$ matches simulation results, supporting robustness of the mechanism beyond lattice-specific effects.

Highlights

“Πi​(τ)=(1−η)pi​(τ)+ηRi​(τ)” — reputation is directly mixed into fitness, not just used for interaction rules.

When η=0: Qˉ​(D,D)​=0.33 while Qˉ​(C,C)​=Qˉ​(C,D)​=Qˉ​(D,C)​=0, implying defection is optimal.

At η=0.3 (with r=2,δ=2): Qˉ​(C,C)​=87.52 and Qˉ​(D,D)​=69.89, consistent with cooperation dominance.

For r=2, high cooperation appears only when δ>0.5 and η>0.22, showing strong parameter interaction rather than single-parameter control.

The authors report close agreement between simulations and a mean-field stationary prediction fc​=PC→D​+PD→C​PD→C​​.

Topics

Evolutionary game theory
Public goods games
Cooperation and defection dynamics
Reputation and indirect reciprocity
Reinforcement learning in evolutionary settings
Double Q-learning and overestimation bias
Spatial games on lattices
Agent-based modeling
Mean-field approximation
Nonlinear dynamical systems in social modeling

Mentioned

Double Q-learning (DQL)
Traditional Q-learning (TQL)
Mean-field approximation
$\epsilon$-greedy exploration
Kai Xie
Attila Szolnoki
Martin A. Nowak
Karl Sigmund
Hauert
Doebeli
Hardin
Szolnoki
Hasselt
van Hasselt
Silver
Silver (DeepMind)
PGG - Public goods game
HIORC - Heterogeneity investment based on organizer’s reputation and cooperation willingness
NRT - Nonlinear reputation transfer
TQL - Traditional Q-learning
DQL - Double Q-learning