Battle of the AIs: Can Bing and Bard Beat ChatGPT at Research?

TL;DR

Bing is strongest for research discovery tasks—finding papers, generating seed references, and supporting early literature-review direction.

Briefing Cornell Notes

Briefing

The most practical takeaway from the comparison is that no single AI tool dominates research writing end-to-end: ChatGPT (with GPT-4) tends to win when the task is turning a paper’s content into polished language, while Microsoft Bing is strongest when the job is finding starting points—especially references—and navigating the early stages of a literature search.

All three systems were tested on common research workflows: locating papers, summarizing PDFs, and transforming academic material into outputs such as bullet-point summaries, press releases, and blog posts. For paper-finding, the baseline reality is that ChatGPT lacked internet access during the tests, so it couldn’t reliably surface the newest literature. That limitation showed up immediately when asked for the latest papers on organic photovoltaic materials. Bing, with internet access, produced usable links and references, and the results improved compared with earlier attempts using tools without strong retrieval. Still, even Bing’s paper suggestions sometimes required verification and follow-up expansion using external tools like Connected Papers, Litmaps, or ResearchRabbit.

When the workflow shifted from retrieval to reading and summarizing, ChatGPT’s performance stood out. The transcript describes a “text splitter” approach to feed paper text into ChatGPT, after which it produced structured five-bullet summaries that included more of the paper’s concrete details (and felt more trustworthy to the tester) than Bard’s higher-level, figure-light summaries. Bing also summarized, but it didn’t consistently follow the requested format (for example, not delivering exactly five bullets) and tended to be less detailed.

The same pattern repeated for rewriting tasks. Asked to convert a paper into a press release, ChatGPT produced a more faithful, publication-ready structure and captured the paper’s essence more accurately. Bard generated a quick draft but introduced factual errors—such as attributing findings to the wrong journal—while Bing’s output was less aligned with strict press-release conventions, though it still offered subheadings and a usable narrative.

For blog-style writing, ChatGPT again matched the brief more closely, producing language appropriate for a science publication audience. Bard’s output skewed toward a more formal, exploratory tone that wouldn’t easily pass editorial standards, and Bing’s responses leaned toward generic “how to write a science blog” guidance rather than generating an actual publishable draft.

Finally, when starting from scratch—like drafting an introduction for a literature review on transparent electrode materials—Bing looked best for providing seed references and field orientation, even though the transcript cautions that reference accuracy still needs checking. Bard also performed reasonably on mapping the topic (e.g., noting indium tin oxide and alternatives like carbon-based materials), but Bing’s reference scaffolding made it the better launchpad.

The conclusion is a division of labor: use Bing for the hardest parts of research discovery (references, initial exploration, and PDF interaction), and use ChatGPT for language-heavy transformation once the source material is in hand. Bard is described as comparatively weak for research-specific nuance and referencing reliability in these tests.

Cornell Notes

The comparison finds a split between “research discovery” and “research writing.” Bing (with internet access) is strongest at finding papers, generating seed references, and helping with early literature-review scaffolding, though outputs still require verification. ChatGPT (GPT-4) performs best when the input is already available—summarizing papers into precise bullet points and rewriting content into press releases and blog drafts with better fidelity to the source. Bard tends to produce higher-level or format-mismatched drafts and shows more risk of factual mistakes in rewriting tasks. The practical workflow is to use Bing to gather and orient, then use ChatGPT to turn that material into publishable language.

Why did ChatGPT struggle with “latest papers,” and how did that affect the results?

ChatGPT lacked internet access during the tests, so it couldn’t reliably retrieve newly published research. When prompted to find the latest papers on new materials for organic photovoltaic devices, it produced weak or non-current references. The transcript notes that later access to browsing appeared after editing, but the comparison still treats the original limitation as the reason the “latest papers” task underperformed.

Which tool handled paper-to-summary tasks best, and what evidence supports that?

ChatGPT was judged the winner for summarizing a paper into “the most important five bullet points.” It produced more detailed, paper-faithful bullet content and better captured the paper’s findings. Bard’s summary was considered suspiciously high-level and lacked figures or concrete details, while Bing’s summary was less aligned with the requested format (it didn’t reliably deliver exactly five bullets) and provided less information overall.

How did the tools perform when rewriting a paper into a press release?

ChatGPT produced the most usable press-release draft, capturing the paper’s essence and following the expected structure more closely. Bard was fast but introduced factual errors—such as claiming a study appeared in Nature Energy when it was not. Bing generated a press-release-like output with subheadings and body/conclusion elements, but it missed some of the strict press-release structure.

What happened when the task shifted to blog writing for a publication audience?

ChatGPT produced language that matched the brief for a blog suitable for outlets like ScienceAlert, with a tone and depth the tester considered publishable. Bard produced something more formal and exploratory, described as unlikely to be sent to an editor. Bing’s output was closer to generic instruction (“how to write a science blog”) rather than generating a tailored blog draft from the paper’s content.

When no paper was provided and the goal was to start a literature review, which system was most helpful and why?

Bing was viewed as the best starting point because it generated references and seed links along with an overview of the field. For transparent electrode materials, it identified indium tin oxide as predominant and also named alternatives such as carbon-based materials and nanostructures. The transcript emphasizes that references may still be inaccurate and should be checked, but Bing’s scaffolding made it more useful for launching the research process.

Review Questions

In this comparison, what specific tasks separate “research discovery” from “research writing,” and which tool is favored for each?
What kinds of errors were observed when converting papers into press releases, and how did those errors differ across ChatGPT, Bard, and Bing?
Why does the transcript repeatedly warn that reference accuracy must be verified, even when a tool provides citations?

Key Points

1
Bing is strongest for research discovery tasks—finding papers, generating seed references, and supporting early literature-review direction.
2
ChatGPT (GPT-4) is strongest for language-heavy transformation of known content, including five-bullet summaries, press releases, and publishable blog drafts.
3
ChatGPT’s lack of internet access during testing limited its ability to retrieve the newest papers, making it unreliable for “latest literature” queries.
4
Bard’s outputs often skew high-level or miss requested formatting, and it showed a higher risk of factual mistakes when rewriting for press-release style.
5
Bing can interact with PDFs more directly in the workflow described, reducing friction compared with text-pasting approaches.
6
Even when citations are provided, reference accuracy is not guaranteed; verification remains essential.
7
A practical workflow emerges: use Bing to gather and orient, then use ChatGPT to produce polished research communication.

Highlights

ChatGPT looked best when the source material was provided and the goal was precise, publication-ready rewriting—especially five-bullet summaries and press-release drafts.

Bing’s advantage showed up most clearly in reference-heavy tasks: finding papers, producing links, and generating seed citations for literature review introductions.

Bard was the most inconsistent across formats and factual accuracy, including at least one clear journal attribution error in a press-release rewrite.