Session 15 - Numpy Tricks | Data Science Mentorship Program (DSMP) 2022-23 | Free Session
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use `np.sort` to keep results as NumPy arrays and control sorting direction with `axis` (row-wise vs column-wise).
Briefing
The session’s core focus is a fast, practical tour of lesser-known NumPy functions—especially those that turn common data-wrangling and analytics tasks into one-liners. Instead of sticking to the usual “sort, reshape, basic indexing” toolkit, the class targets functions that many learners either forget or never learn, arguing that remembering them directly boosts speed and problem-solving power in real projects and interviews.
After outlining the broader DSMP schedule (maths/stats and then machine learning later), the instructor lays out today’s plan: cover roughly 20–25 NumPy functions, using a shared notebook as the working reference. Each function is introduced with a concrete purpose and a small example, with emphasis on what the function returns (for instance, whether it returns a NumPy array vs. a Python list) and how parameters change behavior. The session repeatedly frames these as “hidden” utilities—less famous than core operations, but extremely useful once they’re in muscle memory.
The walkthrough begins with sorting. NumPy’s `np.sort` is highlighted as a way to sort arrays while returning NumPy arrays (not Python lists), with options for sorting along a specified `axis`, choosing sorting algorithm via `kind` (defaulting to quicksort), and controlling order using slicing tricks for descending. Next comes `np.append`, showing how appending behaves differently for 1D vs 2D arrays and how the `axis` parameter determines whether a new row or column is added.
The session then moves into “shape and uniqueness” utilities: `np.concatenate` for joining multiple arrays along rows or columns, `np.unique` for extracting distinct values, and `np.expand_dims` for adding a new dimension—an operation the instructor connects to machine learning workflows where models often expect batched inputs (e.g., converting single samples into batch-like 2D/3D structures).
For conditional logic and reduction-style queries, the class covers `np.where` (both returning indices/values based on a condition and replacing values using boolean masks), plus `np.argmax`/`np.argmin` to find the index positions of maximum/minimum values along a chosen axis—useful for “where is the peak?” questions in analytics. It also introduces `np.cumsum` and `np.cumprod` for cumulative sums/products along an axis, linking cumulative sums to real-world “running totals” like total subscribers over time.
Statistical and distribution tools follow. `np.percentile` is used to interpret percentile ranks (including 0%/100% mapping to min/max and the median at 50%), and the instructor notes how percentile-based measures connect to summaries like five-number summaries and box plots. `np.histogram` is presented as a frequency counter over bins (returning counts rather than drawing a plot by default). Finally, `np.corrcoef` is used to compute correlation coefficients (Pearson correlation), with a caution that correlation is computed between variables provided in the input arrays.
The session also includes array manipulation utilities: `np.isin` for checking whether multiple values exist in an array, `np.flip` for reversing along an axis (or both row/column depending on axis specification), `np.put` for overwriting elements at specified indices (noting it performs in-place changes), and `np.delete` for removing elements by index (including multiple deletions). It closes by adding set operations (`np.union1d`, `np.intersect1d`, `np.setdiff1d`, `np.in1d`) and `np.clip` for constraining values to a range—useful for handling outliers or enforcing valid bounds.
Overall, the session’s message is straightforward: build a practical memory of these NumPy “power” functions and practice mapping them to problem statements, because the same utilities recur across data science, machine learning preprocessing, and interview-style questions.
Cornell Notes
This session is a rapid, practical roundup of NumPy functions that many learners overlook—aimed at making data tasks and interview problems faster. It starts with sorting and array construction (`np.sort`, `np.append`, `np.concatenate`) and then moves into “shape/uniqueness” utilities (`np.unique`, `np.expand_dims`) that matter for ML-style batching. Conditional and query functions (`np.where`, `np.argmax`, `np.argmin`, `np.isin`) are used to filter, locate peaks, and check membership. The session also covers analytics tools like `np.cumsum`, `np.cumprod`, `np.percentile`, `np.histogram`, and `np.corrcoef`, plus manipulation utilities (`np.flip`, `np.put`, `np.delete`, `np.clip`) and set operations (`np.union1d`, `np.intersect1d`, `np.setdiff1d`).
How does `np.sort` differ from sorting a Python list, and how do `axis` and `kind` change the result?
When should `np.append` be used, and why does `axis` matter for 2D arrays?
What problem does `np.expand_dims` solve, and how is it connected to machine learning inputs?
How do `np.where`, `np.argmax`, and `np.argmin` work together for conditional analysis?
What do `np.percentile`, `np.histogram`, and `np.corrcoef` measure, and what are their key outputs?
How do `np.flip`, `np.put`, `np.delete`, and `np.clip` differ as array-manipulation tools?
Review Questions
- Which NumPy functions would you use to (a) find indices where values exceed a threshold, and (b) replace those values while keeping other values unchanged?
- For a 2D array, how would you decide whether to use `axis=0` or `axis=1` with `np.argmax` to find row-wise vs column-wise maxima?
- Give one example of when `np.expand_dims` is necessary before feeding data into a machine learning model.
Key Points
- 1
Use `np.sort` to keep results as NumPy arrays and control sorting direction with `axis` (row-wise vs column-wise).
- 2
Prefer `np.concatenate` over manual stacking when joining multiple arrays; choose `axis` to control whether you join rows or columns.
- 3
Apply `np.expand_dims` to add missing batch/feature dimensions required by many ML pipelines.
- 4
Use `np.where` for both conditional indexing and conditional replacement via boolean masks.
- 5
For “peak/lowest location” queries, `np.argmax` and `np.argmin` return index positions along a chosen `axis`.
- 6
Track running totals with `np.cumsum` and running products with `np.cumprod` for cumulative analytics.
- 7
For constraints and cleanup, combine `np.clip` (range enforcement) with `np.delete`/`np.put` (removal vs overwrite).