Python Pandas Tutorial (Part 2): DataFrame and Series Basics - Selecting Rows and Columns
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Treat a DataFrame as a 2D table: rows are records and columns are fields.
Briefing
Pandas’ core data structures—DataFrame (2D) and Series (1D)—become much easier to work with once they’re treated like rows/columns containers rather than “mysterious pandas objects.” A DataFrame behaves like a table: each row is one record (in the example, one survey respondent), and each column is a field (each survey question). A Series behaves like a single column: when one column is pulled from a DataFrame, the result is a Series with its own index and additional methods.
The tutorial starts by grounding DataFrames in a real dataset setup: survey results are loaded into a main DataFrame (DF) and a separate schema DataFrame describes what each question/column means. Using DF.head() reveals the table structure—multiple rows and multiple columns—while the schema DataFrame clarifies column semantics (for instance, identifying what “hobbyist” corresponds to in the questionnaire). This framing matters because it explains why selecting data in pandas feels like slicing a table: columns are fields, rows are records.
To build intuition, the lesson contrasts pandas with plain Python dictionaries. A dictionary of lists can represent multiple records: keys act like column names (e.g., first, last, email), and each list element corresponds to a row. Creating a DataFrame from that dictionary (PD.DataFrame) produces a tabular view with an index on the left (initially 0, 1, 2, …). When a single column is accessed (DF['email']), pandas returns a Series, not a raw list—confirmed by checking the object type. The Series is described as “rows of a single column,” and the DataFrame is presented as a container holding multiple Series objects.
From there, the tutorial moves into practical selection patterns. Single-column selection uses bracket notation (DF['column']), while dot notation (DF.column) is shown as an alternative but discouraged as a preference because column names can collide with DataFrame methods (e.g., a column named count could conflict with a count method). Selecting multiple columns requires an inner list (DF[['last', 'email']]); otherwise pandas may treat the strings incorrectly. The lesson also shows how to list all columns via DF.columns.
Row selection introduces two indexers: .iloc for integer-location access and .loc for label-based access. .iloc uses positional integers (DF.iloc[0] for the first row), and can select multiple rows with a list of integers. .loc uses index labels (DF.loc[0] for the row labeled 0) and supports selecting columns by name alongside row labels. Both indexers can combine row and column selection, returning either a Series (single column) or a smaller DataFrame (multiple columns).
Finally, the tutorial applies these ideas to the Stack Overflow survey dataset. It uses DF.shape to confirm scale (88,000 rows and 85 columns), selects the entire “hobbyist” column (DF['hobbyist']), and demonstrates value_counts() to count “yes” vs “no” responses. It then selects a specific respondent row with DF.loc[0, 'hobbyist']-style logic, pulls multiple rows for that column, and uses slicing to grab ranges of rows and columns—highlighting that .loc slicing is inclusive on the end label (so hobbyist through employment includes the employment column). The takeaway is that once DataFrame/Series mental models and the .loc/.iloc rules are clear, extracting exactly the rows and columns needed becomes straightforward and powerful.
Cornell Notes
Pandas’ DataFrame is a two-dimensional table (rows and columns), while a Series is one-dimensional data (a single column with an index). Pulling a column from a DataFrame (e.g., DF['email']) returns a Series, and selecting multiple columns (DF[['last','email']]) returns a smaller DataFrame. Row selection uses two indexers: .iloc selects by integer position, and .loc selects by label (and can mix row labels with column names). These tools let you slice survey data efficiently—count responses with value_counts(), filter specific respondents, and slice column ranges with inclusive .loc end labels. Understanding these mechanics is the foundation for more advanced filtering later.
How should a learner mentally model a pandas DataFrame versus a Series?
Why does bracket notation (DF['col']) get preferred over dot notation (DF.col) for column access?
What’s the practical difference between .iloc and .loc when selecting rows?
How do you select multiple columns, and why is the inner list required?
How can you count how many survey respondents answered “yes” vs “no” for a question like “hobbyist”?
What does slicing do for rows and columns, and what’s special about .loc slicing endpoints?
Review Questions
- When a single column is selected from a DataFrame, what object type is returned, and how does that differ from selecting multiple columns?
- Give one example of how .iloc and .loc differ in what they accept for row selection.
- Why might DF.col fail for a column named the same as a DataFrame method, and how does DF['col'] prevent that?
Key Points
- 1
Treat a DataFrame as a 2D table: rows are records and columns are fields.
- 2
Treat a Series as a 1D column view; selecting DF['column'] returns a Series with its own index.
- 3
Use bracket notation for column access to avoid collisions with DataFrame methods (e.g., a column named count).
- 4
Select multiple columns with an inner list (DF[['a','b']]) to return a DataFrame, not a Series.
- 5
Use .iloc for integer-position row selection and .loc for label-based row selection; both can select columns too.
- 6
Use value_counts() on a Series to quickly compute distributions like “yes” vs “no.”
- 7
When slicing with .loc, remember the end label is inclusive, which affects column-range selections.