Session 27 - Data Gathering | Data Analysis Process

TL;DR

The session’s five-step data analysis workflow is: ask questions, gather/prepare data (often the hardest), explore patterns, draw conclusions, and communicate results.

Briefing Cornell Notes

Briefing

Data analysis is framed as a five-step workflow—asking the right questions, ranking/cleaning and transforming raw data, exploring patterns, drawing conclusions, and communicating results—built on the idea that the hardest part is often getting data into usable shape. The session’s immediate focus is “data gathering,” the first practical step: pulling datasets from multiple sources and formats so later cleaning and analysis don’t collapse under missing values, wrong types, or inconsistent structure.

The course schedule is also addressed directly. Due to illness and missed live classes, the instructor announces that normal classes resume on the 25th, while two earlier sessions (Open Source on the 20th and Web Scraping on the 21st) will be handled via recorded parts. A data-cleaning live session is planned for Wednesday, and tasks from the previous two weeks are promised once health allows. The session itself is positioned as foundational: after Python, pandas, and basic visualization libraries, the next big leap is learning how to import data correctly and reliably.

The “data gathering” portion is broken down into what to read and how to read it. It starts with importing data from common file formats—especially CSV—then extends to text files, Excel sheets, JSON, and SQL tables. For CSV, the emphasis is on mastering pandas’ read_csv function parameters (not just a minimal example), because real datasets often deviate from assumptions: missing column headers, commas vs tabs, encoding mismatches, malformed lines, and incorrect data types. The session highlights practical fixes: specifying column names when headers are wrong, skipping bad rows, setting encoding for non-UTF inputs (including emoji-heavy datasets), controlling dtype to reduce memory use, and parsing date columns into real datetime objects so filtering works.

Beyond files, the session covers pulling data from APIs and from websites. APIs are described as an interface that lets one system request structured data from another—illustrated with a movie database workflow where an API key is used to fetch top-rated movies, convert JSON responses into a pandas DataFrame, and paginate through results to build a large dataset. Web scraping is treated as a fallback when no API exists: the approach uses HTTP requests plus BeautifulSoup to parse HTML, then extracts repeated elements (company names, ratings, review counts, employee counts, and headquarters) by inspecting the page structure. A key operational detail is avoiding request blocks: adding browser-like headers to prevent “bad request” or bot rejection.

Finally, the session connects gathering to the broader pipeline by showing how to export results and move data between systems. pandas DataFrames can be exported to CSV, Excel, HTML, JSON, and SQL, including writing to multiple SQL tables and appending large datasets. The overall message is that robust data gathering—handling formats, types, pagination, encoding, and extraction reliability—sets the foundation for everything that follows in data cleaning, exploratory analysis, and decision-ready communication.

Cornell Notes

The session treats data analysis as a five-step cycle: ask questions, gather and prepare data (often the most time-consuming part), explore patterns, draw conclusions, and communicate results. It then drills into “data gathering,” showing how to import datasets from CSV, text, Excel, JSON, and SQL using pandas, and how to handle real-world issues like missing headers, malformed lines, encoding problems, wrong dtypes, and date parsing. It also covers two ways to fetch data beyond local files: APIs (structured JSON responses) and web scraping (HTML extraction with BeautifulSoup and browser-like headers). The practical takeaway is that reliable ingestion—correct structure and types—makes later cleaning and analysis far easier.

Why is “data gathering” treated as foundational rather than a preliminary chore?

Because later steps—cleaning, exploratory analysis, and modeling—depend on the dataset being structurally consistent. The session frames data preparation as the “ranking/managing” stage: raw inputs often contain errors, missing values, duplicates, and wrong data types. If ingestion fails (bad encoding, malformed rows, incorrect headers, or dates left as strings), downstream analysis becomes harder or incorrect. The workflow’s five steps are designed so gathering and preparation happen early, before conclusions are drawn.

What are the most common ingestion problems with CSV/text files, and how does pandas help address them?

The session highlights: (1) missing or incorrect column headers (the first row becomes a data row), fixed by passing column names; (2) malformed lines with inconsistent column counts, fixed by using an “error_bad_lines” style parameter to skip bad rows; (3) encoding mismatches (e.g., emoji or Hindi datasets), fixed by specifying the correct encoding; (4) memory waste from wrong dtypes, fixed by using a dtype mapping; and (5) date columns arriving as strings, fixed by parsing them into datetime using parse_dates so filtering works correctly.

How does the session distinguish APIs from web scraping in practice?

APIs provide structured data via an endpoint, typically returning JSON. The session demonstrates fetching top-rated movies using an API key, converting the JSON list into a pandas DataFrame, and paginating through results to build a larger dataset. Web scraping is used when no API exists: it fetches HTML pages and extracts data by parsing the DOM with BeautifulSoup. It also requires handling anti-bot behavior by sending browser-like headers to avoid request rejection.

What does “pagination” mean in the API workflow shown, and why is it necessary?

The API returns results in pages. The session shows using a loop across page numbers (e.g., from page 1 up to the total pages) to fetch batches of movies, convert each batch into a temporary DataFrame, and append it to a master DataFrame. This is how the dataset grows from a small sample (like 20 movies per page) to the full set (thousands of movies).

How does the session connect ingestion to export and data portability?

After building or transforming DataFrames, the session shows exporting them to multiple formats: CSV, Excel, HTML, JSON, and SQL. For SQL, it demonstrates creating a database (via phpMyAdmin), then using pandas’ to_sql to write DataFrames into tables, appending when tables already exist. This closes the loop: data gathered from files/APIs/scraping can be stored and shared for later analysis or downstream systems.

Review Questions

What five-step data analysis process is used as the session’s backbone, and where does “data gathering” fit within it?
List at least four real-world issues that can break ingestion (e.g., encoding, headers, malformed lines, dtypes, date parsing) and describe the corresponding fix mentioned.
In the API-based movie dataset workflow, how are JSON responses converted into a pandas DataFrame and how does the code ensure it collects all pages of results?

Key Points

1
The session’s five-step data analysis workflow is: ask questions, gather/prepare data (often the hardest), explore patterns, draw conclusions, and communicate results.
2
Data gathering focuses on importing from CSV, text, Excel, JSON, and SQL, with emphasis on pandas read_* functions and their parameters.
3
Real datasets frequently break assumptions—so ingestion must handle missing headers, malformed lines, encoding mismatches, and incorrect dtypes.
4
Date columns should be parsed into datetime objects during ingestion so filtering and time-based logic work reliably.
5
When data isn’t directly available, APIs provide structured JSON responses; web scraping extracts data from HTML when APIs aren’t offered.
6
Pagination is essential for APIs that return results in chunks; each page’s results must be appended into a master DataFrame.
7
pandas DataFrames can be exported to multiple formats (including SQL) to move analysis outputs into other tools and workflows.

Highlights

Data analysis is framed as a five-step loop, but the session stresses that data preparation/gathering is where most failures and delays happen.

CSV ingestion isn’t “one line”—it requires handling headers, encoding, malformed rows, dtype control, and parse_dates for correct downstream behavior.

APIs and scraping are treated as two complementary ingestion routes: structured JSON via endpoints versus HTML extraction via BeautifulSoup with browser-like headers.

The movie API example demonstrates building a large dataset by paginating through results and appending DataFrames.

The session closes the loop by showing exports to CSV/Excel/HTML/JSON and SQL, including appending to existing tables.

Topics

Data Analysis Process
Data Gathering
Pandas Ingestion
API Fetching
Web Scraping
Data Export

Session 27 - Data Gathering | Data Analysis Process | DSMP 2023