Get AI summaries of any video or article — Sign up free
Research Data Analysis with SciSpace Agent: Step‑by‑Step Guide thumbnail

Research Data Analysis with SciSpace Agent: Step‑by‑Step Guide

SciSpace·
6 min read

Based on SciSpace's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

SciSpace Agent supports an end-to-end research workflow: extract structure from PDFs, assess and clean CSV data, run descriptive/inferential statistics, and generate shareable outputs.

Briefing

SciSpace Agent is positioned as an end-to-end workflow tool that turns messy research inputs—PDFs, CSVs, and raw interview transcripts—into structured outputs, cleaned datasets, statistical findings, visualizations, and even publication-ready reports, all with prompts instead of hand-written code. The core value demonstrated is speed plus structure: upload unstructured material, ask for a specific deliverable (JSON metadata, data-quality diagnostics, regression results, thematic themes), and get back consistent, exportable artifacts that can be shared with collaborators.

The walkthrough begins with a simple test using a PDF research paper about “buy now pay later” transactions. After uploading the document, the agent is prompted to extract a five-sentence plain-language summary, key insights, and a metadata table (including title, authors, year, journal, DOI, keywords). The output is returned as a JSON object, illustrating a repeatable pattern: raw text becomes uniform structured data. The interface also logs actions via clickable “yellow badges,” letting users see what the agent did while reading the PDF.

Next comes the analytic phase, starting with a real-world problem: unclean data. A CSV of data science job postings scraped from “blast dot” is described as explicitly messy. Rather than immediately cleaning, the agent performs a comprehensive data quality assessment—producing a data structure overview (row/column counts, column names, apparent data types), sample previews (first and last rows), and checks for completeness, consistency, accuracy, and formatting. The assessment flags multiple issues: company names are contaminated (with ratings embedded), logical duplicates exist (same job in the same company), salary formats are highly inconsistent (30+ formats), and missing value encodings appear. The agent then generates a Python script, runs it, and provides an executive summary plus dashboards and charts summarizing the problems.

After diagnosing the dataset, the agent is prompted to clean it in priority order based on the earlier findings. It writes and executes the cleaning code, applies transformations such as standardizing salary data and handling missing values, and outputs a final cleaned CSV along with a cleaning dashboard showing what changed.

The workflow then shifts to quantitative analysis using a medical study dataset (30 patients; 20 treatment, 10 control) with baseline and follow-up systolic/diastolic blood pressure, medication adherence, side effects, smoking status, BMI, and quality-of-life scores. The agent produces descriptive statistics and baseline comparisons, then moves into inferential testing to confirm whether blood pressure reduction differs between groups. Results highlighted include statistically significant reductions in both systolic and diastolic pressure for the treatment group, with gender differences in response, while medication adherence and smoking status do not significantly affect efficacy in this dataset. It also flags limitations such as small sample size and missing data.

Finally, the agent extends beyond numbers into qualitative research. Using a CSV of interview transcripts, it performs inductive thematic analysis (six primary themes with 18 sub-themes), provides theme definitions and quote-level examples, and generates visualizations like heat maps and network maps. It then runs sentiment analysis with role-based breakdowns, and builds an interactive, publication-oriented web dashboard (HTML/CSS/vanilla JavaScript with Chart.js) for stakeholder sharing. The session closes with an automated LaTeX-ready, production-style research report compiled into downloadable files, reinforcing the “prompt-to-paper” promise.

Overall, the demonstration frames SciSpace Agent as a practical pipeline for research teams: extract structure from unstructured sources, diagnose and repair data quality, run descriptive and inferential statistics, model relationships (including multiple regression), and package findings into shareable visuals and reports—without requiring users to write code themselves.

Cornell Notes

SciSpace Agent is demonstrated as a prompt-driven pipeline that converts research inputs into usable outputs across the full analysis lifecycle. It first extracts structured information from unstructured PDFs into formats like JSON (summary, key insights, and metadata). It then handles messy CSV data by running automated data-quality diagnostics (completeness, consistency, accuracy, formatting), generating and executing Python scripts, and producing dashboards. After cleaning, it performs descriptive and inferential statistics on a medical dataset, highlighting significant blood pressure reductions for treatment and gender-linked response differences while noting limitations like small sample size. For qualitative work, it performs inductive thematic analysis and sentiment analysis on interview transcripts and can compile results into an interactive dashboard and a LaTeX-ready research report.

How does the agent turn a PDF research paper into structured, reusable data?

A PDF is uploaded and a prompt requests a five-sentence plain-language summary, key insights, and a metadata table (title, authors, year, journal, DOI, keywords). The agent returns the results inside a single JSON object, demonstrating a consistent extraction pattern from unstructured text. The interface also provides action logs via clickable “yellow badges,” showing what the agent did while reading the document.

What does a “comprehensive data quality assessment” include, and what kinds of problems were found in the job postings CSV?

The agent is prompted to report dataset structure (row/column counts, column names, apparent data types), preview the first and last rows, and evaluate completeness, consistency, and accuracy. It checks missing values (counts and percentages), inconsistent formatting (case/special characters), salary representation (currency units, ranges, formats), company-name duplication/formatting issues, rating ranges (0–5), and outliers such as negative or unrealistic salary values. In the demo, the critical issues were contaminated company names (ratings embedded in names), a 27.2% duplication rate (logical duplicates), and highly inconsistent salary formats (30+ formats), plus missing value encoding problems.

How does the workflow shift from diagnosing data problems to producing a cleaned dataset?

After the assessment, the agent is given a cleaning prompt that prioritizes fixes based on the earlier findings. It writes and executes cleaning code rather than relying on manual steps, then outputs a transformed final CSV. The demo emphasizes transformations like parsing and standardizing salary data and handling missing values, along with a cleaning dashboard that documents what was changed.

What statistical workflow was used for the medical study dataset, and what conclusions were highlighted?

The agent first generates descriptive statistics (mean/median/mode, standard deviation, min/max for numerical variables) and frequency distributions for categorical variables, then compares baseline characteristics by population. It then performs inferential testing focused on whether blood pressure reduction differs between treatment and control, using baseline and follow-up values per patient. Highlighted results include statistically significant reductions in both systolic and diastolic blood pressure for the treatment group, gender differences in treatment response, and no significant impact from medication adherence or smoking status in this dataset. It also notes limitations such as small sample size (20 treatment vs 10 control) and missing data.

How does the agent handle qualitative research inputs like interview transcripts?

A CSV containing participant IDs, interview duration, and transcript text is uploaded. The agent is prompted for inductive thematic analysis: identify 4–6 main themes with 2–3 sub-themes each, provide theme names and definitions, include direct quote examples, and indicate which participant roles appear in each theme. The demo reports six primary themes and 18 sub-themes, including “growth and development” as the most universal theme, and it generates visualizations such as heat maps and network maps. It then performs sentiment analysis with role-based sentiment distributions and quote-level validation checks.

What does “sharing” look like after analysis—dashboards, exports, and reports?

The agent produces export-ready charts and a cleaning dashboard for data prep. For qualitative analysis, it generates an interactive web-based dashboard (single HTML with embedded CSS/vanilla JavaScript) using Chart.js, including components like sentiment distribution charts, interactive data tables, filtering by sentiment categories and participant roles, and hover/click interactions. For formal write-ups, it can compile findings into a LaTeX-ready, production-style research report and generate downloadable files (including a compiled PDF).

Review Questions

  1. When extracting from a PDF, what specific structured outputs were requested (and in what format) to make the results reusable?
  2. What categories of data quality issues did the agent check for in the job postings CSV, and which two issues were described as most critical?
  3. In the medical study analysis, which variables were reported as significantly affecting treatment response, and which were reported as not significantly impacting efficacy?

Key Points

  1. 1

    SciSpace Agent supports an end-to-end research workflow: extract structure from PDFs, assess and clean CSV data, run descriptive/inferential statistics, and generate shareable outputs.

  2. 2

    Prompting for structured formats (like JSON) enables consistent metadata and summaries from otherwise unstructured text.

  3. 3

    For messy datasets, the agent can run a comprehensive data-quality assessment first—then generate and execute Python code to clean data in priority order.

  4. 4

    In the medical study example, the agent highlighted significant blood pressure reductions for treatment and gender-linked response differences while flagging small sample size and missing data as limitations.

  5. 5

    For qualitative datasets, the agent performs inductive thematic analysis with hierarchical themes, quote-level examples, and role-based theme mapping.

  6. 6

    The agent can produce interactive stakeholder dashboards (HTML/vanilla JavaScript with Chart.js) and compile LaTeX-ready research reports for publication workflows.

  7. 7

    Action logs in the UI (e.g., clickable badges) help users audit what the agent did during extraction and analysis.

Highlights

A PDF-to-JSON workflow turns paper text into a structured bundle: five-sentence summary, key insights, and a metadata table (title/authors/journal/DOI/keywords) returned as a single JSON object.
The job postings dataset was diagnosed before cleaning; the agent found contaminated company names (ratings embedded), logical duplicates (27.2%), and 30+ salary formats that blocked reliable salary analysis.
In the medical study, treatment showed statistically significant reductions in both systolic and diastolic blood pressure, with gender differences in response, while medication adherence and smoking status were not significant in this dataset.
Interview transcripts were converted into six primary themes (18 sub-themes) via inductive thematic analysis, then paired with role-based sentiment scoring and export-ready visualizations.
The agent can generate an interactive web dashboard and a LaTeX-ready, publication-style report from the same analysis outputs.

Topics

  • Agent Workflow
  • Data Quality Assessment
  • Statistical Inference
  • Thematic Analysis
  • Interactive Dashboards