Session 27 - Data Gathering | Data Analysis Process | DSMP 2023
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The session’s five-step data analysis workflow is: ask questions, gather/prepare data (often the hardest), explore patterns, draw conclusions, and communicate results.
Briefing
Data analysis is framed as a five-step workflow—asking the right questions, ranking/cleaning and transforming raw data, exploring patterns, drawing conclusions, and communicating results—built on the idea that the hardest part is often getting data into usable shape. The session’s immediate focus is “data gathering,” the first practical step: pulling datasets from multiple sources and formats so later cleaning and analysis don’t collapse under missing values, wrong types, or inconsistent structure.
The course schedule is also addressed directly. Due to illness and missed live classes, the instructor announces that normal classes resume on the 25th, while two earlier sessions (Open Source on the 20th and Web Scraping on the 21st) will be handled via recorded parts. A data-cleaning live session is planned for Wednesday, and tasks from the previous two weeks are promised once health allows. The session itself is positioned as foundational: after Python, pandas, and basic visualization libraries, the next big leap is learning how to import data correctly and reliably.
The “data gathering” portion is broken down into what to read and how to read it. It starts with importing data from common file formats—especially CSV—then extends to text files, Excel sheets, JSON, and SQL tables. For CSV, the emphasis is on mastering pandas’ read_csv function parameters (not just a minimal example), because real datasets often deviate from assumptions: missing column headers, commas vs tabs, encoding mismatches, malformed lines, and incorrect data types. The session highlights practical fixes: specifying column names when headers are wrong, skipping bad rows, setting encoding for non-UTF inputs (including emoji-heavy datasets), controlling dtype to reduce memory use, and parsing date columns into real datetime objects so filtering works.
Beyond files, the session covers pulling data from APIs and from websites. APIs are described as an interface that lets one system request structured data from another—illustrated with a movie database workflow where an API key is used to fetch top-rated movies, convert JSON responses into a pandas DataFrame, and paginate through results to build a large dataset. Web scraping is treated as a fallback when no API exists: the approach uses HTTP requests plus BeautifulSoup to parse HTML, then extracts repeated elements (company names, ratings, review counts, employee counts, and headquarters) by inspecting the page structure. A key operational detail is avoiding request blocks: adding browser-like headers to prevent “bad request” or bot rejection.
Finally, the session connects gathering to the broader pipeline by showing how to export results and move data between systems. pandas DataFrames can be exported to CSV, Excel, HTML, JSON, and SQL, including writing to multiple SQL tables and appending large datasets. The overall message is that robust data gathering—handling formats, types, pagination, encoding, and extraction reliability—sets the foundation for everything that follows in data cleaning, exploratory analysis, and decision-ready communication.
Cornell Notes
The session treats data analysis as a five-step cycle: ask questions, gather and prepare data (often the most time-consuming part), explore patterns, draw conclusions, and communicate results. It then drills into “data gathering,” showing how to import datasets from CSV, text, Excel, JSON, and SQL using pandas, and how to handle real-world issues like missing headers, malformed lines, encoding problems, wrong dtypes, and date parsing. It also covers two ways to fetch data beyond local files: APIs (structured JSON responses) and web scraping (HTML extraction with BeautifulSoup and browser-like headers). The practical takeaway is that reliable ingestion—correct structure and types—makes later cleaning and analysis far easier.
Why is “data gathering” treated as foundational rather than a preliminary chore?
What are the most common ingestion problems with CSV/text files, and how does pandas help address them?
How does the session distinguish APIs from web scraping in practice?
What does “pagination” mean in the API workflow shown, and why is it necessary?
How does the session connect ingestion to export and data portability?
Review Questions
- What five-step data analysis process is used as the session’s backbone, and where does “data gathering” fit within it?
- List at least four real-world issues that can break ingestion (e.g., encoding, headers, malformed lines, dtypes, date parsing) and describe the corresponding fix mentioned.
- In the API-based movie dataset workflow, how are JSON responses converted into a pandas DataFrame and how does the code ensure it collects all pages of results?
Key Points
- 1
The session’s five-step data analysis workflow is: ask questions, gather/prepare data (often the hardest), explore patterns, draw conclusions, and communicate results.
- 2
Data gathering focuses on importing from CSV, text, Excel, JSON, and SQL, with emphasis on pandas read_* functions and their parameters.
- 3
Real datasets frequently break assumptions—so ingestion must handle missing headers, malformed lines, encoding mismatches, and incorrect dtypes.
- 4
Date columns should be parsed into datetime objects during ingestion so filtering and time-based logic work reliably.
- 5
When data isn’t directly available, APIs provide structured JSON responses; web scraping extracts data from HTML when APIs aren’t offered.
- 6
Pagination is essential for APIs that return results in chunks; each page’s results must be appended into a master DataFrame.
- 7
pandas DataFrames can be exported to multiple formats (including SQL) to move analysis outputs into other tools and workflows.