The study compares ChatGPT, a human expert, checklist-based writing analytics, and no support on intrinsic motivation, SRL processes, and learning outcomes in a randomized lab experiment.
Briefing
This paper asks whether using generative artificial intelligence (GenAI)—specifically ChatGPT—changes learners’ intrinsic motivation, their self-regulated learning (SRL) processes, and their learning performance when compared with other forms of support (a human expert or writing analytics tools) or with no support. The question matters because “hybrid intelligence” in education is often framed as augmenting learners rather than replacing teachers, yet the mechanisms by which learners benefit (or fail to benefit) from GenAI remain insufficiently evidenced by controlled, process-level studies. In particular, the authors argue that GenAI may enable short-term performance gains while simultaneously encouraging “metacognitive laziness,” a hypothesized pattern of over-reliance on external help that reduces learners’ engagement in monitoring, evaluation, and other metacognitive activities needed for durable learning.
To address these questions, the authors conducted a randomized laboratory experiment with 117 university students (mean age 22.61, SD = 3.39; 70% female; 55% undergraduates). All participants were English-as-a-second-language learners. Participants completed a two-stage reading-and-writing task (including training videos, a 2-hour stage-1 reading/writing phase, and a 1-hour stage-2 revising phase) followed by a post-test within one day. The task required writing an essay envisioning the future of education in 2035 while integrating three provided topics (AI, differentiated teaching, and scaffolding teaching) and using a rubric for scoring.
Participants were randomly assigned to four groups: (1) a control group with no extra support (CN, n = 30 in the main description; later tables show slightly different Ns due to scoring/test availability), (2) an AI group supported by ChatGPT 4.0 restricted to task topics via an OpenAI API interface (AI, n = 35), (3) a human expert group (HE, n = 25) who provided real-time one-on-one feedback in a chat, and (4) a checklist tools group (CL, n = 27) using a writing analytics toolkit providing feedback on spelling/grammar, academic style, originality, and rhetorical structure.
Motivation was measured post-task using the Intrinsic Motivation Inventory (IMI), which includes four dimensions: interest/enjoyment, perceived competence, effort/importance, and pressure/tension. Group differences were tested with ANOVA followed by Tukey’s HSD.
To capture SRL processes, the authors collected multi-channel learning trace data (navigational logs, click streams, mouse movements, keyboard strokes). They used a trace parser to map raw actions into SRL “processes” (e.g., orientation, planning, monitoring, evaluation, reading, elaboration/organization, and an “other” category capturing interactions with agents). For frequency differences, they used Kruskal–Wallis tests with Mann–Whitney post hoc comparisons. For temporal/sequence differences, they applied process mining using a first-order Markov model (via pMineR), visualizing transition probabilities between processes and comparing process maps across groups.
Learning performance was assessed in three dimensions: (a) essay score improvement (difference between essay scores before and after revision), (b) knowledge gain (difference between pre- and post-test scores on AI in education), and (c) knowledge transfer (post-test score on AI in healthcare). Essay scoring used a rubric with two independent raters for the first 12 essays and single-rater scoring thereafter; inter-rater reliability was high (intraclass correlation coefficients all greater than 0.85). Knowledge tests were multiple-choice items developed and reliability-validated in prior work. Performance differences were analyzed with ANOVA and Tukey-adjusted pairwise comparisons.
The key findings are threefold. First, the groups did not differ significantly in intrinsic motivation after the task. Across IMI dimensions, ANOVA results showed no significant differences for interest/enjoyment (F = 1.087, p = 0.358, 2 = 0.029), perceived competence (F = 0.453, p = 0.716, 2 = 0.012), effort/importance (F = 1.152, p = 0.332, 2 = 0.030), or pressure/tension (F = 0.546, p = 0.652, 2 = 0.015). Although not statistically significant, descriptive patterns suggested CN had the lowest interest/enjoyment and highest pressure/tension, while CL showed the highest interest/enjoyment, perceived competence, and effort and the lowest pressure/tension.
Second, SRL processes differed significantly, especially during the revising stage. Frequency analysis indicated that during revising, AI, HE, and CL groups engaged more in elaboration/organization (writing-related processes) than CN. Conversely, AI and HE groups engaged less in reading, consistent with their reliance on conversational feedback during revision rather than returning to reading materials. Orientation processes were also higher in AI, HE, and CL than in CN during revising, suggesting more frequent revisiting of task instructions and rubric. Notably, CL uniquely showed a significant increase in evaluation processes, plausibly because the checklist tools explicitly guided rubric-based self-evaluation.
Process mining revealed temporal “loops” consistent with the authors’ metacognitive laziness hypothesis. In the AI vs. CN comparison, the AI group showed stronger transitions back to the “other” node (interactions with ChatGPT) after engaging in processes such as monitoring and evaluation, forming a prominent loop among Other, elaboration/organization, and evaluation. This pattern suggests that ChatGPT consultation became the dominant revision strategy. In contrast, the CN group connected revising more strongly to reading materials and task instructions. In the AI vs. HE comparison, the AI group again showed a closed reliance loop between revising and Other (ChatGPT), whereas the HE group exhibited more transitions linking revising to reading and linking orientation to evaluation, implying that human expert support supported metacognitive associations rather than replacing them.
Third, performance outcomes diverged by dimension. Essay score improvement differed significantly across groups (ANOVA: F = 4.549, p = 0.005, 2 = 0.108), with the AI group outperforming the other groups in post-revision gains. Pairwise comparisons (Tukey-adjusted) showed AI had greater improvement than CN (mean difference = 1.970, p-adjusted = 0.037), greater than HE (mean difference = 2.120, p-adjusted = 0.025), and greater than CL (mean difference = 2.200, p-adjusted = 0.012). However, knowledge gain showed no significant group differences (pre-test: F = 1.294, p = 0.281, 2 = 0.036; post-test: F = 0.913, p = 0.438, 2 = 0.030), and knowledge transfer also showed no differences (F = 0.019, p = 0.996, 2 = 0.000).
The authors interpret these results as evidence that GenAI can improve short-term task performance without improving intrinsic motivation or durable learning outcomes (knowledge gain/transfer). They argue that the AI group’s process patterns—particularly the reliance loop involving ChatGPT—are consistent with metacognitive laziness: learners offload metacognitive work to the tool, reducing the internal monitoring/evaluation cycles that support transfer.
Limitations are acknowledged by the authors and are also apparent from the design. The study’s null or non-significant findings could reflect limited sample size and task duration; the authors explicitly note that lack of significant differences might be due to constraints related to task duration and sample size and call for larger samples and longer-term follow-up. The sample is also not fully representative (70% female), and the task is limited to a single reading-and-writing domain, which may not capture how metacognitive laziness manifests across other learning activities. Finally, the authors note that there were no targeted, matured measures specifically designed to assess metacognitive laziness, so future work should develop and integrate more direct measurement protocols.
Practically, the findings suggest that educators and instructional designers should not assume that GenAI will automatically improve motivation or learning transfer. ChatGPT may be effective for tasks with clear rubrics and criteria, improving essay revision scores, but it may also shift learners’ SRL behavior toward tool consultation loops that weaken metacognitive engagement. Teachers and learning designers should therefore scaffold AI use: encourage learners to explicitly evaluate and justify changes against rubrics, require monitoring steps that are not delegated to the model, and use structured supports (e.g., checklist-like evaluation prompts) that keep metacognitive processes active. Learners should be cautioned against using AI as a shortcut to complete outputs; instead, they should treat AI feedback as input to their own evaluation and planning. Researchers should extend this work with multi-task, cross-context, and longitudinal studies, and with better operationalization of metacognitive laziness.
Cornell Notes
This randomized lab study compares ChatGPT support, human expert support, checklist-based writing analytics, and no support on intrinsic motivation, SRL process behavior, and learning outcomes. The authors find no significant differences in intrinsic motivation, but ChatGPT produces the largest short-term essay score improvements while leaving knowledge gain and transfer unchanged, alongside SRL process patterns consistent with “metacognitive laziness.”
What research question(s) does the paper address?
Whether different learning agents (ChatGPT, human expert, checklist tools, or no support) change (1) intrinsic motivation, (2) SRL process frequency and temporal sequences, and (3) learning performance (essay improvement, knowledge gain, and knowledge transfer).
What study design and setting were used?
A randomized experimental study in a lab setting with a two-stage reading-and-writing task (writing then revising), followed by post-testing within one day.
How were participants assigned to conditions and how many were there?
117 university students were randomly assigned to four groups: CN (no support, n = 30 in the main description), AI (ChatGPT, n = 35), HE (human expert, n = 25), and CL (checklist tools, n = 27).
How was intrinsic motivation measured and analyzed?
Using the Intrinsic Motivation Inventory (IMI) post-task; ANOVA followed by Tukey’s HSD tested differences across groups on four IMI dimensions.
What was the main motivation result?
No significant group differences in intrinsic motivation on any IMI dimension (e.g., interest/enjoyment: F = 1.087, p = 0.358).
How were SRL processes measured?
By collecting multi-channel learning trace data (logs, clicks, mouse/keyboard) and parsing them into SRL processes; frequency differences used Kruskal–Wallis/Mann–Whitney, and temporal patterns used process mining with a first-order Markov model (pMineR).
What SRL process differences emerged during revising?
AI, HE, and CL showed more elaboration/organization and more orientation than CN; AI and HE showed less reading; CL uniquely increased evaluation processes, while AI exhibited strong temporal loops linking revising to repeated ChatGPT consultations.
What were the performance results for essay improvement?
Essay score improvement differed significantly across groups (F = 4.549, p = 0.005). The AI group had the largest improvement and outperformed CN (mean difference = 1.970, p-adjusted = 0.037), HE (2.120, p-adjusted = 0.025), and CL (2.200, p-adjusted = 0.012).
Did ChatGPT improve knowledge gain or transfer?
No. Knowledge gain showed no significant differences (pre-test: F = 1.294, p = 0.281; post-test: F = 0.913, p = 0.438), and knowledge transfer was also non-significant (F = 0.019, p = 0.996).
What is the paper’s central interpretation of these mixed outcomes?
ChatGPT improves short-term task performance but may trigger metacognitive laziness—offloading metacognitive work to the tool—leading to weaker or absent gains in deeper learning outcomes like transfer.
Review Questions
How do the authors operationalize and detect “metacognitive laziness” using SRL process mining patterns?
Why might intrinsic motivation remain unchanged even when essay performance improves significantly?
What evidence in the process maps supports the claim that AI use creates a reliance loop during revision?
How do the three performance dimensions (essay improvement, knowledge gain, knowledge transfer) help distinguish short-term optimization from durable learning?
Key Points
- 1
The study compares ChatGPT, a human expert, checklist-based writing analytics, and no support on intrinsic motivation, SRL processes, and learning outcomes in a randomized lab experiment.
- 2
Intrinsic motivation did not significantly differ across groups on any IMI dimension (all ANOVA p-values > 0.05).
- 3
SRL processes differed during the revising stage: AI/HE/CL increased elaboration/organization and orientation versus CN, while AI/HE reduced reading engagement.
- 4
Process mining showed ChatGPT-driven temporal loops linking revising to repeated tool consultation, consistent with “metacognitive laziness.”
- 5
ChatGPT produced the largest essay score improvement (ANOVA F = 4.549, p = 0.005; AI outperformed CN/HE/CL with Tukey-adjusted p-values 0.012–0.037).
- 6
Despite better essay revision scores, ChatGPT did not improve knowledge gain or knowledge transfer (knowledge transfer: F = 0.019, p = 0.996).
- 7
The authors caution that GenAI may boost rubric-aligned output quality while not supporting deeper learning, and they call for longer-term, multi-task studies and better measures of metacognitive laziness.