Exporting data from SCOPUS and WOS for Bibliometric Analysis

TL;DR

Start with a clear research question and translate it into reliable keywords before searching Scopus or Web of Science.

Briefing Cornell Notes

Briefing

Bibliometric analysis depends on getting the right dataset first—especially when working with large, journal-based citation databases like Scopus and Web of Science. The core workflow starts with defining a clear research question and then translating it into reliable keywords. Keyword selection is treated as the most difficult and most consequential step: without targeted, dependable terms, the resulting manuscript and analysis can’t be trusted.

Once keywords are finalized, the next step is running structured searches inside Scopus (and then repeating the same logic in Web of Science). The search process begins by entering the keywords exactly as intended, then using Boolean logic to narrow results. When multiple terms are involved, the transcript emphasizes using operators like AND and OR to control subject scope, and using double quotation marks to force the database to match terms in specific fields such as the title, abstract, or author keywords. Without quotation marks, searches can return an unmanageably large volume of irrelevant records.

A practical example is given for narrowing a broad topic into a specific subfield: “urban flooding” is treated as an intersection where “urbanization” is the main theme, but the search must also require flooding-related terminology. The same idea applies to other domains (e.g., supply chain), where the database may treat related phrases differently (such as “supply chain” versus “supply-chain”), so the search strategy should account for variations. The transcript also mentions using wildcard or string approaches (e.g., star strings) and grouping terms in brackets to capture similar concepts while still keeping the query focused.

After the initial search returns thousands of documents, filtering becomes essential. The transcript describes narrowing by time window (such as limiting to the last 10 years), restricting document types (e.g., focusing on journal articles rather than conference proceedings, books, or book series), and limiting language to English when the research context requires it. These constraints reduce noise and make downstream bibliometric analysis more coherent.

Finally, exporting the dataset is presented as a key step in the data acquisition pipeline. In Scopus, export options include formats such as CSV (recommended because it aligns with Excel-style workflows), plain text, and other text-based formats. The export selection should include the bibliographic and citation fields needed for analysis—such as citation information, bibliography details, abstracts, and author keywords—while optionally excluding funding details. The same general approach is then applied to Web of Science, using equivalent selection and export choices (including plain text options). The overall message is straightforward: careful keyword engineering plus disciplined filtering and structured export is what turns massive database results into usable bibliometric data.

Cornell Notes

Bibliometric work using Scopus and Web of Science starts with a research question and ends with an exportable dataset. The transcript stresses that keyword selection is the hardest and most important step: targeted keywords determine whether the analysis will be meaningful. Searches should use Boolean operators (AND/OR), quotation marks to match terms in title/abstract/keywords, and sometimes wildcard/string methods to capture phrase variations. After retrieving results, filtering by time range, document type (e.g., journals only), and language (often English) reduces irrelevant records. Exporting in a structured format like CSV, along with citation/bibliographic fields and abstracts/keywords, prepares the data for bibliometric analysis.

Why is keyword selection treated as the most critical step in bibliometric data acquisition?

Because the dataset quality depends on whether the query reliably captures the intended research area. If keywords are broad, miss key terminology, or aren’t consistent with how databases index terms, the results become noisy or incomplete—making the final bibliometric analysis unreliable. The transcript frames this as a direct link between keyword targeting and whether the manuscript and analysis can be “soundable.”

How do Boolean operators and quotation marks change what Scopus returns?

Boolean operators (AND/OR) control how terms combine—for example, using AND to require that records discuss both the main topic and the subfield concept. Quotation marks force exact-term matching in indexed fields like title, abstract, or author keywords. Without quotes, the database can return a much larger set of records, including items that only loosely relate to the intended terms.

What’s the purpose of narrowing results after an initial keyword search returns thousands of documents?

Filtering turns an unmanageable result set into an analyzable dataset. The transcript gives examples such as limiting to the last 10 years, restricting to journal articles (excluding proceedings, books, and book series), and selecting English-language records when the research context requires it. These steps reduce irrelevant records and improve the coherence of subsequent analysis.

How can a search be designed to focus on a subfield rather than a broad topic?

By combining terms so the query captures the intersection of concepts. The transcript’s example uses “urban flooding” where “urbanization” is the main topic, but the query must also include flooding-related terminology. This is done using Boolean logic and grouped terms (brackets) so the database returns records that discuss both aspects, not just the broader theme.

Why is CSV recommended for exporting Scopus results, and what fields should be included?

CSV is recommended because it works smoothly with Excel-style workflows, making it easier to screen and process data for analysis. The transcript advises exporting citation information, bibliography details, abstracts, and author keywords (and optionally excluding funding details). Including these fields ensures the dataset supports citation-based and content-based bibliometric methods.

How does the workflow in Web of Science relate to Scopus?

The same overall logic applies: use the search interface with structured queries (including quotation marks and Boolean logic), select the relevant records, and export in an appropriate text format. The transcript notes that Web of Science offers options like plain text for export, mirroring Scopus’s selection-and-export steps.

Review Questions

When would using quotation marks in Scopus materially change your results compared with not using them?
Describe a filtering strategy (time window, document type, language) that would reduce noise in a bibliometric dataset.
What combination of query techniques (Boolean logic, brackets, phrase variants) would you use to target a subfield like “urban flooding” rather than “urbanization” alone?

Key Points

1
Start with a clear research question and translate it into reliable keywords before searching Scopus or Web of Science.
2
Use Boolean operators (AND/OR) to control whether records must include multiple concepts or any of several related terms.
3
Use double quotation marks to match terms in indexed fields such as title, abstract, and author keywords, preventing overly broad results.
4
Account for phrase variations (e.g., “supply chain” vs “supply-chain”) using appropriate string/variant handling and grouped terms.
5
After retrieving results, narrow by time range (e.g., last 10 years), document type (journal articles), and language (often English).
6
Export only the fields needed for analysis—citation and bibliographic data, plus abstracts/keywords—while optionally excluding funding details.
7
Prefer CSV export for Scopus when the next step involves Excel-style screening and processing.

Highlights

Keyword engineering determines whether bibliometric results are usable; broad or unreliable keywords produce noisy datasets.

Quotation marks plus Boolean logic are the main tools for forcing databases to match terms in title/abstract/keywords rather than returning massive, loosely related sets.

Filtering by recency, document type (journals only), and language can shrink thousands of records into an analysis-ready corpus.

Exporting structured fields (citations, bibliography, abstracts, keywords) in CSV format supports downstream bibliometric workflows.

Topics

Bibliometric Data Acquisition
Scopus Export
Web of Science Search
Keyword Strategy
Boolean Querying