Session 15 - Numpy Tricks | Data Science Mentorship Program (DSMP) 2022-23

TL;DR

Use `np.sort` to keep results as NumPy arrays and control sorting direction with `axis` (row-wise vs column-wise).

Briefing Cornell Notes

Briefing

The session’s core focus is a fast, practical tour of lesser-known NumPy functions—especially those that turn common data-wrangling and analytics tasks into one-liners. Instead of sticking to the usual “sort, reshape, basic indexing” toolkit, the class targets functions that many learners either forget or never learn, arguing that remembering them directly boosts speed and problem-solving power in real projects and interviews.

After outlining the broader DSMP schedule (maths/stats and then machine learning later), the instructor lays out today’s plan: cover roughly 20–25 NumPy functions, using a shared notebook as the working reference. Each function is introduced with a concrete purpose and a small example, with emphasis on what the function returns (for instance, whether it returns a NumPy array vs. a Python list) and how parameters change behavior. The session repeatedly frames these as “hidden” utilities—less famous than core operations, but extremely useful once they’re in muscle memory.

The walkthrough begins with sorting. NumPy’s `np.sort` is highlighted as a way to sort arrays while returning NumPy arrays (not Python lists), with options for sorting along a specified `axis`, choosing sorting algorithm via `kind` (defaulting to quicksort), and controlling order using slicing tricks for descending. Next comes `np.append`, showing how appending behaves differently for 1D vs 2D arrays and how the `axis` parameter determines whether a new row or column is added.

The session then moves into “shape and uniqueness” utilities: `np.concatenate` for joining multiple arrays along rows or columns, `np.unique` for extracting distinct values, and `np.expand_dims` for adding a new dimension—an operation the instructor connects to machine learning workflows where models often expect batched inputs (e.g., converting single samples into batch-like 2D/3D structures).

For conditional logic and reduction-style queries, the class covers `np.where` (both returning indices/values based on a condition and replacing values using boolean masks), plus `np.argmax`/`np.argmin` to find the index positions of maximum/minimum values along a chosen axis—useful for “where is the peak?” questions in analytics. It also introduces `np.cumsum` and `np.cumprod` for cumulative sums/products along an axis, linking cumulative sums to real-world “running totals” like total subscribers over time.

Statistical and distribution tools follow. `np.percentile` is used to interpret percentile ranks (including 0%/100% mapping to min/max and the median at 50%), and the instructor notes how percentile-based measures connect to summaries like five-number summaries and box plots. `np.histogram` is presented as a frequency counter over bins (returning counts rather than drawing a plot by default). Finally, `np.corrcoef` is used to compute correlation coefficients (Pearson correlation), with a caution that correlation is computed between variables provided in the input arrays.

The session also includes array manipulation utilities: `np.isin` for checking whether multiple values exist in an array, `np.flip` for reversing along an axis (or both row/column depending on axis specification), `np.put` for overwriting elements at specified indices (noting it performs in-place changes), and `np.delete` for removing elements by index (including multiple deletions). It closes by adding set operations (`np.union1d`, `np.intersect1d`, `np.setdiff1d`, `np.in1d`) and `np.clip` for constraining values to a range—useful for handling outliers or enforcing valid bounds.

Overall, the session’s message is straightforward: build a practical memory of these NumPy “power” functions and practice mapping them to problem statements, because the same utilities recur across data science, machine learning preprocessing, and interview-style questions.

Cornell Notes

This session is a rapid, practical roundup of NumPy functions that many learners overlook—aimed at making data tasks and interview problems faster. It starts with sorting and array construction (`np.sort`, `np.append`, `np.concatenate`) and then moves into “shape/uniqueness” utilities (`np.unique`, `np.expand_dims`) that matter for ML-style batching. Conditional and query functions (`np.where`, `np.argmax`, `np.argmin`, `np.isin`) are used to filter, locate peaks, and check membership. The session also covers analytics tools like `np.cumsum`, `np.cumprod`, `np.percentile`, `np.histogram`, and `np.corrcoef`, plus manipulation utilities (`np.flip`, `np.put`, `np.delete`, `np.clip`) and set operations (`np.union1d`, `np.intersect1d`, `np.setdiff1d`).

How does `np.sort` differ from sorting a Python list, and how do `axis` and `kind` change the result?

`np.sort` returns a NumPy array, which keeps the workflow inside NumPy (useful when later operations expect NumPy arrays). For 2D arrays, sorting direction is controlled by `axis`: setting `axis=0` sorts column-wise (each column independently), while `axis=1` sorts row-wise. The `kind` parameter selects the sorting algorithm; the default is quicksort, and the worst-case behavior mentioned is tied to quicksort (with an option to switch to `mergesort`/`heapsort` via `kind`). Descending order is handled by using slicing (e.g., reversing the sorted output).

When should `np.append` be used, and why does `axis` matter for 2D arrays?

`np.append` is similar to Python list append but works on NumPy arrays. For 1D arrays, it simply adds elements to the end. For 2D arrays, the default behavior can flatten or append in a way that may not match the intended “add a new column” or “add a new row” operation. Passing `axis` determines whether the appended data extends along rows or columns. The session demonstrates adding a column by specifying the correct `axis` so the new values align as a column rather than being appended in a flattened manner.

What problem does `np.expand_dims` solve, and how is it connected to machine learning inputs?

`np.expand_dims` adds an extra dimension to an array. The instructor explains it as converting shapes like 1D → 2D, or 2D → 3D, by inserting a new axis at a chosen position (`axis`). This matters in ML because many models expect batched inputs (e.g., multiple samples at once). If a model expects a batch dimension, `expand_dims` can turn a single sample into a batch-like structure by adding the missing dimension.

How do `np.where`, `np.argmax`, and `np.argmin` work together for conditional analysis?

`np.where` supports two common patterns: (1) returning indices/locations where a condition is true (e.g., values greater than 50), and (2) replacing values using a boolean mask (e.g., set all values > 50 to 0 while leaving others unchanged). `np.argmax` and `np.argmin` then answer “where is the extreme?” by returning index positions of the maximum/minimum along a specified `axis`. In 2D, choosing `axis=0` vs `axis=1` changes whether extremes are found per column or per row.

What do `np.percentile`, `np.histogram`, and `np.corrcoef` measure, and what are their key outputs?

`np.percentile` computes threshold values at a given percentile (e.g., 100% equals the maximum; 0% equals the minimum; 50% corresponds to the median). `np.histogram` bins numeric data and returns frequency counts per bin (it’s a counting tool rather than a plotting tool by default). `np.corrcoef` returns correlation coefficients (Pearson correlation) between variables; the session notes that correlation between a variable and itself yields 1, and the meaningful cross-correlation comes from pairing different variables.

How do `np.flip`, `np.put`, `np.delete`, and `np.clip` differ as array-manipulation tools?

`np.flip` reverses elements along an axis (or multiple axes), preserving shape while changing order. `np.put` overwrites elements at specified indices with new values; it performs updates on the array (in-place behavior is emphasized). `np.delete` removes elements by index and returns a new array with those elements removed. `np.clip` constrains values to a range by replacing anything below the lower bound with the lower bound and anything above the upper bound with the upper bound—useful for limiting outliers or enforcing valid ranges.

Review Questions

Which NumPy functions would you use to (a) find indices where values exceed a threshold, and (b) replace those values while keeping other values unchanged?
For a 2D array, how would you decide whether to use `axis=0` or `axis=1` with `np.argmax` to find row-wise vs column-wise maxima?
Give one example of when `np.expand_dims` is necessary before feeding data into a machine learning model.

Key Points

1
Use `np.sort` to keep results as NumPy arrays and control sorting direction with `axis` (row-wise vs column-wise).
2
Prefer `np.concatenate` over manual stacking when joining multiple arrays; choose `axis` to control whether you join rows or columns.
3
Apply `np.expand_dims` to add missing batch/feature dimensions required by many ML pipelines.
4
Use `np.where` for both conditional indexing and conditional replacement via boolean masks.
5
For “peak/lowest location” queries, `np.argmax` and `np.argmin` return index positions along a chosen `axis`.
6
Track running totals with `np.cumsum` and running products with `np.cumprod` for cumulative analytics.
7
For constraints and cleanup, combine `np.clip` (range enforcement) with `np.delete`/`np.put` (removal vs overwrite).

Highlights

`np.where` supports two powerful modes: returning indices where a condition is true, or performing value replacement using boolean masks.

`np.argmax`/`np.argmin` answer “where is the extreme?” by returning indices, and `axis` determines whether extremes are computed per row or per column.

`np.expand_dims` is directly tied to ML batching—adding a dimension so models can process single samples as batch-like inputs.

`np.histogram` provides frequency counts per bin (not just a picture), making it ideal for distribution summaries.

`np.put` overwrites values at specified indices, while `np.delete` removes indices entirely—two different operations with different outcomes.

Topics

NumPy Sorting
Array Shaping
Conditional Indexing
Statistical Functions
Array Manipulation

Session 15 - Numpy Tricks | Data Science Mentorship Program (DSMP) 2022-23 | Free Session