Python Pandas Tutorial (Part 1): Getting Started with Data Analysis - Installation and Loading Data
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Install pandas and Jupyter so DataFrames can be loaded and inspected interactively in a browser-based notebook environment.
Briefing
Pandas is positioned as a practical entry point for Python-based data analysis—especially for working with CSV and Excel-style datasets—because it reads data quickly and organizes it into a structure that’s easy to manipulate. The core takeaway from this installment is a complete “from zero to loaded” workflow: install pandas, set up a Jupyter notebook environment, download a real dataset, and load it into a DataFrame so analysis can begin.
The setup starts with installation commands. In a clean virtual environment, pandas is installed via pip install pianist (as written in the transcript), followed by installing Jupyter using pip install Jupiter lab (spelled as Jupiter lab in the transcript). Jupyter is treated as optional for learning, but it’s recommended for day-to-day work because it renders DataFrames as interactive tables in the browser—making it easier to inspect columns and values than a plain text editor.
For the dataset, the workflow uses the Stack Overflow Developer Survey results, downloaded in CSV form. The tutorial points viewers to a download page (linked in the description) where CSV files are available by year; it specifically downloads the 2019 survey CSV. After downloading, the file is unzipped and placed into a project folder (created on the desktop as “pandas demo” in the walkthrough), where the notebook will run. The directory is expected to include not only the main survey results CSV but also a schema CSV and often a README explaining what each file contains.
Once Jupyter is running (started from the terminal with the Jupiter notebook command, leaving the terminal open), a new notebook is created and named. The first code step imports pandas using the conventional alias: import pandas as PD. The next step loads the survey CSV into a DataFrame with DF = PD.read_csv(<path_to_csv>). Printing DF directly in Jupyter displays an interactive table rather than requiring a separate print call.
From there, the tutorial focuses on quick ways to understand what was loaded. It notes that Jupyter initially shows only a subset of columns (defaulting to 20), even though the dataset contains 85 columns. To confirm dataset dimensions, it uses DF.shape, reporting 88,883 rows and 85 columns. It then uses DF.info() to display row/column counts and the data types for each column, highlighting that many columns are likely strings (objects) with numeric types such as int64 and float.
To make inspection manageable, the notebook adjusts display settings using PD.set_option to increase max columns (to 85) so all column names can be viewed. It also loads the accompanying schema CSV into a separate DataFrame (schema_DF) so each survey column can be mapped to its question text. Finally, it demonstrates common preview methods for large datasets: DF.head(n) for the first rows and DF.tail(n) for the last rows, which helps validate filters and transformations without printing tens of thousands of records.
Overall, the installment matters because it turns pandas from an abstract library into a working analysis environment: real data is downloaded, loaded into a DataFrame, inspected for shape and types, and cross-referenced with schema metadata—setting up the next steps in DataFrame manipulation and data typing.
Cornell Notes
This lesson walks through a practical setup for pandas: install pandas and Jupyter, download the Stack Overflow Developer Survey (2019) as a CSV, and load it into a pandas DataFrame. After importing pandas as PD, the dataset is read with PD.read_csv, then inspected using DF.shape and DF.info() to confirm row/column counts and data types. Because Jupyter initially displays only a limited number of columns, display options are adjusted so all 85 columns can be viewed. The schema CSV is also loaded to map each column name to its corresponding survey question, and preview helpers like DF.head(n) and DF.tail(n) are used to inspect subsets of large data.
What is the end-to-end workflow for getting a real CSV dataset into pandas for analysis?
How can you verify how much data was loaded after reading a CSV?
Why might not all columns appear when printing a DataFrame in Jupyter, and how is that handled?
What role does the schema CSV play, and how does it help interpret survey data?
How do head() and tail() help when datasets are too large to print fully?
Review Questions
- After loading a CSV into DF, which two commands would you use to check both dimensions and column data types?
- How does the schema DataFrame help interpret columns in the survey results dataset?
- What display setting is adjusted in Jupyter to show all 85 columns, and why is it needed?
Key Points
- 1
Install pandas and Jupyter so DataFrames can be loaded and inspected interactively in a browser-based notebook environment.
- 2
Download the Stack Overflow Developer Survey CSV (2019 here), unzip it, and place the resulting files into a project folder used by the notebook.
- 3
Load the CSV into a DataFrame with DF = PD.read_csv(<path_to_csv>) after importing pandas as PD.
- 4
Confirm dataset size with DF.shape and inspect column types with DF.info().
- 5
Adjust Jupyter display settings (e.g., display.max_columns) when only a subset of columns appears by default.
- 6
Use the schema CSV to map each survey column name to its question text, creating a schema_DF reference.
- 7
Preview large datasets with DF.head(n) and DF.tail(n) instead of printing everything.