Get AI summaries of any video or article — Sign up free
Python Pandas Tutorial (Part 1): Getting Started with Data Analysis - Installation and Loading Data thumbnail

Python Pandas Tutorial (Part 1): Getting Started with Data Analysis - Installation and Loading Data

Corey Schafer·
5 min read

Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Install pandas and Jupyter so DataFrames can be loaded and inspected interactively in a browser-based notebook environment.

Briefing

Pandas is positioned as a practical entry point for Python-based data analysis—especially for working with CSV and Excel-style datasets—because it reads data quickly and organizes it into a structure that’s easy to manipulate. The core takeaway from this installment is a complete “from zero to loaded” workflow: install pandas, set up a Jupyter notebook environment, download a real dataset, and load it into a DataFrame so analysis can begin.

The setup starts with installation commands. In a clean virtual environment, pandas is installed via pip install pianist (as written in the transcript), followed by installing Jupyter using pip install Jupiter lab (spelled as Jupiter lab in the transcript). Jupyter is treated as optional for learning, but it’s recommended for day-to-day work because it renders DataFrames as interactive tables in the browser—making it easier to inspect columns and values than a plain text editor.

For the dataset, the workflow uses the Stack Overflow Developer Survey results, downloaded in CSV form. The tutorial points viewers to a download page (linked in the description) where CSV files are available by year; it specifically downloads the 2019 survey CSV. After downloading, the file is unzipped and placed into a project folder (created on the desktop as “pandas demo” in the walkthrough), where the notebook will run. The directory is expected to include not only the main survey results CSV but also a schema CSV and often a README explaining what each file contains.

Once Jupyter is running (started from the terminal with the Jupiter notebook command, leaving the terminal open), a new notebook is created and named. The first code step imports pandas using the conventional alias: import pandas as PD. The next step loads the survey CSV into a DataFrame with DF = PD.read_csv(<path_to_csv>). Printing DF directly in Jupyter displays an interactive table rather than requiring a separate print call.

From there, the tutorial focuses on quick ways to understand what was loaded. It notes that Jupyter initially shows only a subset of columns (defaulting to 20), even though the dataset contains 85 columns. To confirm dataset dimensions, it uses DF.shape, reporting 88,883 rows and 85 columns. It then uses DF.info() to display row/column counts and the data types for each column, highlighting that many columns are likely strings (objects) with numeric types such as int64 and float.

To make inspection manageable, the notebook adjusts display settings using PD.set_option to increase max columns (to 85) so all column names can be viewed. It also loads the accompanying schema CSV into a separate DataFrame (schema_DF) so each survey column can be mapped to its question text. Finally, it demonstrates common preview methods for large datasets: DF.head(n) for the first rows and DF.tail(n) for the last rows, which helps validate filters and transformations without printing tens of thousands of records.

Overall, the installment matters because it turns pandas from an abstract library into a working analysis environment: real data is downloaded, loaded into a DataFrame, inspected for shape and types, and cross-referenced with schema metadata—setting up the next steps in DataFrame manipulation and data typing.

Cornell Notes

This lesson walks through a practical setup for pandas: install pandas and Jupyter, download the Stack Overflow Developer Survey (2019) as a CSV, and load it into a pandas DataFrame. After importing pandas as PD, the dataset is read with PD.read_csv, then inspected using DF.shape and DF.info() to confirm row/column counts and data types. Because Jupyter initially displays only a limited number of columns, display options are adjusted so all 85 columns can be viewed. The schema CSV is also loaded to map each column name to its corresponding survey question, and preview helpers like DF.head(n) and DF.tail(n) are used to inspect subsets of large data.

What is the end-to-end workflow for getting a real CSV dataset into pandas for analysis?

The workflow is: (1) install pandas and Jupyter, (2) download the Stack Overflow Developer Survey CSV (2019 in the walkthrough), (3) unzip and place the data files into a project folder, (4) start Jupyter notebook from the terminal, (5) create a new notebook, (6) import pandas as PD, and (7) load the CSV into a DataFrame with DF = PD.read_csv(<csv_path>). Once loaded, the DataFrame can be inspected directly in Jupyter.

How can you verify how much data was loaded after reading a CSV?

Use DF.shape to get a tuple of (rows, columns). In the walkthrough, DF.shape reports 88,883 rows and 85 columns. Then use DF.info() to get a more detailed summary: the number of entries, the total columns, and each column’s data type (for example, object for strings, int64 for integers, and float for decimals).

Why might not all columns appear when printing a DataFrame in Jupyter, and how is that handled?

Jupyter displays only a limited number of columns by default (the tutorial notes it shows 20 columns initially). To view all 85 columns, the notebook changes a display setting: PD.set_option('display.max_columns', 85). After rerunning the DataFrame display, scrolling shows the full set of column names.

What role does the schema CSV play, and how does it help interpret survey data?

The schema CSV provides the mapping between each survey column name and the human-readable question text. The tutorial loads it into schema_DF using PD.read_csv on the schema file path, then uses it as a reference to understand what each column means (e.g., a column like main_branch or hobbyist corresponds to a specific survey question). This avoids guessing column meanings when working with many columns.

How do head() and tail() help when datasets are too large to print fully?

Instead of printing the entire DataFrame (tens of thousands of rows), preview methods show manageable slices. DF.head() returns the first five rows by default, and DF.head(10) returns the first ten. Similarly, DF.tail() returns the last five rows by default, and DF.tail(10) returns the last ten. These previews help confirm that loading and later filtering steps behave as expected.

Review Questions

  1. After loading a CSV into DF, which two commands would you use to check both dimensions and column data types?
  2. How does the schema DataFrame help interpret columns in the survey results dataset?
  3. What display setting is adjusted in Jupyter to show all 85 columns, and why is it needed?

Key Points

  1. 1

    Install pandas and Jupyter so DataFrames can be loaded and inspected interactively in a browser-based notebook environment.

  2. 2

    Download the Stack Overflow Developer Survey CSV (2019 here), unzip it, and place the resulting files into a project folder used by the notebook.

  3. 3

    Load the CSV into a DataFrame with DF = PD.read_csv(<path_to_csv>) after importing pandas as PD.

  4. 4

    Confirm dataset size with DF.shape and inspect column types with DF.info().

  5. 5

    Adjust Jupyter display settings (e.g., display.max_columns) when only a subset of columns appears by default.

  6. 6

    Use the schema CSV to map each survey column name to its question text, creating a schema_DF reference.

  7. 7

    Preview large datasets with DF.head(n) and DF.tail(n) instead of printing everything.

Highlights

A complete setup path is demonstrated: install pandas + Jupyter, download real survey CSV data, and load it into a DataFrame ready for analysis.
DF.shape and DF.info() provide immediate validation of row/column counts and data types after reading the CSV.
Jupyter’s default column display limit can hide most columns; setting display.max_columns to 85 reveals the full schema.
Loading the survey schema CSV turns cryptic column names into readable question text for interpretation.
DF.head(n) and DF.tail(n) offer quick sanity checks without flooding the notebook with tens of thousands of rows.

Topics

  • Pandas Installation
  • Jupyter Notebook Setup
  • Loading CSV Data
  • DataFrame Inspection
  • Survey Schema Mapping