Python Pandas Tutorial (Part 2): DataFrame and Series Basics

TL;DR

Treat a DataFrame as a 2D table: rows are records and columns are fields.

Briefing Cornell Notes

Briefing

Pandas’ core data structures—DataFrame (2D) and Series (1D)—become much easier to work with once they’re treated like rows/columns containers rather than “mysterious pandas objects.” A DataFrame behaves like a table: each row is one record (in the example, one survey respondent), and each column is a field (each survey question). A Series behaves like a single column: when one column is pulled from a DataFrame, the result is a Series with its own index and additional methods.

The tutorial starts by grounding DataFrames in a real dataset setup: survey results are loaded into a main DataFrame (DF) and a separate schema DataFrame describes what each question/column means. Using DF.head() reveals the table structure—multiple rows and multiple columns—while the schema DataFrame clarifies column semantics (for instance, identifying what “hobbyist” corresponds to in the questionnaire). This framing matters because it explains why selecting data in pandas feels like slicing a table: columns are fields, rows are records.

To build intuition, the lesson contrasts pandas with plain Python dictionaries. A dictionary of lists can represent multiple records: keys act like column names (e.g., first, last, email), and each list element corresponds to a row. Creating a DataFrame from that dictionary (PD.DataFrame) produces a tabular view with an index on the left (initially 0, 1, 2, …). When a single column is accessed (DF['email']), pandas returns a Series, not a raw list—confirmed by checking the object type. The Series is described as “rows of a single column,” and the DataFrame is presented as a container holding multiple Series objects.

From there, the tutorial moves into practical selection patterns. Single-column selection uses bracket notation (DF['column']), while dot notation (DF.column) is shown as an alternative but discouraged as a preference because column names can collide with DataFrame methods (e.g., a column named count could conflict with a count method). Selecting multiple columns requires an inner list (DF[['last', 'email']]); otherwise pandas may treat the strings incorrectly. The lesson also shows how to list all columns via DF.columns.

Row selection introduces two indexers: .iloc for integer-location access and .loc for label-based access. .iloc uses positional integers (DF.iloc[0] for the first row), and can select multiple rows with a list of integers. .loc uses index labels (DF.loc[0] for the row labeled 0) and supports selecting columns by name alongside row labels. Both indexers can combine row and column selection, returning either a Series (single column) or a smaller DataFrame (multiple columns).

Finally, the tutorial applies these ideas to the Stack Overflow survey dataset. It uses DF.shape to confirm scale (88,000 rows and 85 columns), selects the entire “hobbyist” column (DF['hobbyist']), and demonstrates value_counts() to count “yes” vs “no” responses. It then selects a specific respondent row with DF.loc[0, 'hobbyist']-style logic, pulls multiple rows for that column, and uses slicing to grab ranges of rows and columns—highlighting that .loc slicing is inclusive on the end label (so hobbyist through employment includes the employment column). The takeaway is that once DataFrame/Series mental models and the .loc/.iloc rules are clear, extracting exactly the rows and columns needed becomes straightforward and powerful.

Cornell Notes

Pandas’ DataFrame is a two-dimensional table (rows and columns), while a Series is one-dimensional data (a single column with an index). Pulling a column from a DataFrame (e.g., DF['email']) returns a Series, and selecting multiple columns (DF[['last','email']]) returns a smaller DataFrame. Row selection uses two indexers: .iloc selects by integer position, and .loc selects by label (and can mix row labels with column names). These tools let you slice survey data efficiently—count responses with value_counts(), filter specific respondents, and slice column ranges with inclusive .loc end labels. Understanding these mechanics is the foundation for more advanced filtering later.

How should a learner mentally model a pandas DataFrame versus a Series?

A DataFrame is treated as a table: rows represent records (e.g., one survey respondent per row) and columns represent fields/questions (e.g., “hobbyist,” “open source,” etc.). A Series is treated as one column’s worth of data: when a single column is accessed from a DataFrame (DF['email']), pandas returns a Series object. The Series includes its own index (the row labels), while the DataFrame includes both row and column structure.

Why does bracket notation (DF['col']) get preferred over dot notation (DF.col) for column access?

Dot notation can collide with DataFrame methods. If a DataFrame has a method like count and there is also a column named count, DF.count would refer to the method rather than the column. Bracket notation DF['count'] avoids that ambiguity and reliably targets the column name.

What’s the practical difference between .iloc and .loc when selecting rows?

.iloc selects by integer location (positional indexing). For example, DF.iloc[0] returns the first row by position. .loc selects by label (index values). With the default RangeIndex, labels often match positions (so DF.loc[0] looks similar), but the key distinction is that .loc is label-based and becomes more important once custom indexes are set.

How do you select multiple columns, and why is the inner list required?

Multiple columns require an inner list: DF[['last','email']]. Without the inner list, pandas may interpret the strings as a single column name and raise a KeyError. When multiple columns are selected, the result is a DataFrame (not a Series), because it contains more than one column.

How can you count how many survey respondents answered “yes” vs “no” for a question like “hobbyist”?

First select the column as a Series (DF['hobbyist']), then apply value_counts(). The tutorial reports that about 71,000 respondents said “yes” and about 18,000 said “no” for the hobbyist question, using value_counts() to compute the distribution quickly.

What does slicing do for rows and columns, and what’s special about .loc slicing endpoints?

Slicing works like list slicing for rows/columns selection, but .loc slicing is inclusive on the end label. The tutorial demonstrates slicing columns from hobbyist through employment using something like hobbyist:employment, and notes that the employment column is included—avoiding confusion that would happen if the end label were excluded.

Review Questions

When a single column is selected from a DataFrame, what object type is returned, and how does that differ from selecting multiple columns?
Give one example of how .iloc and .loc differ in what they accept for row selection.
Why might DF.col fail for a column named the same as a DataFrame method, and how does DF['col'] prevent that?

Key Points

1
Treat a DataFrame as a 2D table: rows are records and columns are fields.
2
Treat a Series as a 1D column view; selecting DF['column'] returns a Series with its own index.
3
Use bracket notation for column access to avoid collisions with DataFrame methods (e.g., a column named count).
4
Select multiple columns with an inner list (DF[['a','b']]) to return a DataFrame, not a Series.
5
Use .iloc for integer-position row selection and .loc for label-based row selection; both can select columns too.
6
Use value_counts() on a Series to quickly compute distributions like “yes” vs “no.”
7
When slicing with .loc, remember the end label is inclusive, which affects column-range selections.

Highlights

A single-column selection from a DataFrame (DF['email']) returns a Series, while selecting multiple columns (DF[['last','email']]) returns a DataFrame.

.iloc uses integer locations; .loc uses labels—default indexes can make them look similar until custom indexes are introduced.

Bracket notation is safer than dot notation because column names can overlap with DataFrame methods.

value_counts() turns a question column into an instant frequency table (e.g., hobbyist “yes” vs “no”).

.loc slicing includes the end label, making column-range selections like hobbyist through employment straightforward.

Topics

DataFrame Basics
Series Basics
Column Selection
Row Selection
.iloc vs .loc

Mentioned

Corey Schafer
PD

Python Pandas Tutorial (Part 2): DataFrame and Series Basics - Selecting Rows and Columns