
Before you touch a model, you need fluency in arrays and dataframes. This episode covers vectorization, broadcasting, views versus copies, axis semantics, and the pandas selection and groupby muscle memory every later ML episode silently assumes.
Phase 1 foundations: the two libraries every later machine learning episode assumes you already know. This is array and dataframe fluency, not modeling yet.
NumPy. Why Python lists are slow for numbers (boxed objects, pointer-chasing) and why the ndarray is fast (contiguous typed memory, vectorized C loops, BLAS/LAPACK). Anatomy of an array: dtype, shape, strides, and why slicing returns a view. Creating arrays (arange, linspace, zeros, random.default_rng), ufuncs and vectorization, broadcasting rules and the (3,) vs (3,1) pitfall, views vs copies (slices are views, fancy/boolean indexing copies), the axis argument, and linear algebra (@ vs *, np.linalg.solve for the normal equations). Pitfalls: silent integer overflow, NaN propagation, float equality (np.isclose), and ragged arrays.
pandas. What it adds over NumPy: labeled axes, heterogeneous columns, missing data, time series, and relational ops. Reading data with read_csv (parse_dates, dtype, na_values), .loc vs .iloc, boolean filtering, the historical SettingWithCopyWarning and why Copy-on-Write replaced it, missing data handling, split-apply-combine with groupby, merge/join/concat and many-to-many explosions, vectorized .str and .dt, reshaping and tidy data, categoricals, method chaining, and the PyArrow backend. Crossing back to ML with df.to_numpy().
Version notes: NumPy 2.0 (June 2024, NEP-50 promotion, StringDType), current line 2.4.x, and pandas 3.0 (January 2026: Copy-on-Write is the only mode, SettingWithCopyWarning removed, string dtype by default, non-nanosecond datetimes). See What's new in pandas 3.0 and InfoQ's coverage.
News roundup. SpaceX agreed to acquire Cursor parent Anysphere for $60B all-stock (CNBC, Fortune). Tencent reportedly backed ex-Qwen lead Junyang Lin's new world-models lab (Techmeme). AI leaders met G7 at Evian on voluntary commitments (CNBC). OpenAI published Deployment Simulation. And Gemini API preview image/video models are slated to shut down late June 2026.
Let's start with the news, and the headline this week is more consolidation in the AI coding tool market. On June sixteenth, SpaceX, which merged with xAI earlier this year, agreed to buy Anysphere, the company behind the Cursor coding tool, in an all-stock deal valuing Cursor at sixty billion dollars. Cursor shareholders convert into SpaceX Class A shares, priced off a seven-trading-day volume-weighted average. This landed about four days after SpaceX's record Nasdaq debut under the ticker S-P-C-X, which raised seventy-five billion dollars at a hundred thirty-five dollars a share. That earlier initial public offering we already covered. This sixty-billion-dollar Cursor acquisition is the new, distinct event, and it's expected to close in the third quarter of 2026 pending regulatory approval.
The numbers around Cursor tell a mixed story. Its annualized revenue had climbed to roughly four billion dollars by early June, up from about one billion last November. But its market share reportedly slid from around forty-one percent a year ago to about twenty-six percent in May, and it had burned through roughly three point two billion dollars cumulatively. SpaceX says the two companies have spent recent months building a shared AI model that will ship inside both Cursor and xAI's Grok. SpaceX shares reportedly jumped about seventeen percent on the news. Why this matters to you as a learner: the coding-tool market keeps consolidating around the big labs. GitHub Copilot sits with Microsoft, Claude Code with Anthropic, Codex and Windsurf with OpenAI, and now Cursor with SpaceX. Fewer independent vendors, and coding agents increasingly tied to a lab's own frontier model.
In China, a new lab founded by Junyang Lin, the former technical lead of Alibaba's Qwen models, has reportedly raised several hundred million dollars at a roughly two-billion-dollar post-money valuation. The lab focuses on world models and embodied intelligence. Tencent reportedly put in about twenty million dollars, with HongShan and Gaorong Ventures reportedly in talks. That's an unusually high valuation for a Chinese AI startup with no product yet, and the round isn't formally closed, so treat it as reported. It signals capital still flowing into world models and embodied AI specifically.
On policy, AI leaders including Sam Altman, Dario Amodei, Demis Hassabis, Arthur Mensch, and Aidan Gomez met G7 leaders at Evian on June seventeenth. The agenda covered frontier risk, youth online safety, and infrastructure. The expected outcome is a package of voluntary, non-binding commitments, and those tend to become the de facto global baseline before any binding rules arrive.
Two quick practitioner notes. OpenAI published a method called Deployment Simulation, which replays past de-identified user conversations through a candidate model to predict new bad behaviors before shipping. And a heads-up: several Gemini preview image and video models are reportedly slated to shut down in late June, so check your pinned model IDs and migrate. Now, on to arrays and dataframes.
Last episode we set up your toolkit: Python, notebooks, git, and a virtual environment manager. Today we pick up the two libraries that everything else in this course quietly assumes you already know: NumPy and pandas. This is not modeling yet. We are not training anything. What we're building today is fluency with arrays and dataframes, the working tools that every later episode, train-test split, linear regression, gradient descent, leans on without explaining. Open a notebook and run these snippets as we go, ideally with the percent-timeit magic so you can feel the speed differences yourself.
Let me set some version context up front, because the libraries have changed in ways that matter when you read older tutorials. NumPy two point oh shipped in June of 2024. That was the first major release since version one back in 2006, eleven months of work by two hundred twelve contributors. It came with breaking changes. There was a binary compatibility break, meaning C extensions had to be rebuilt. There were revised type-promotion rules, which I'll come back to. And there was a Python-side cleanup where aliases like np dot float and np dot int and np dot bool were removed, some functions were renamed, and about a hundred little-used members were reorganized. New in that release were a variable-length UTF-8 string dtype, a strings namespace, and compliance with the array-API standard. NumPy two point x is current today. Later point releases added more, and the latest patch line in mid-2026 is the two point four series.
Pandas has its own timeline. Pandas two point oh shipped in April 2023, and its headline was an optional Arrow backend plus opt-in Copy-on-Write. Then pandas three point oh shipped in January of 2026, and it flipped three big defaults. First, Copy-on-Write is now the default and the only mode, chained assignment no longer works, and the old SettingWithCopyWarning has been removed entirely. Second, strings now default to a dedicated string dtype instead of the generic object dtype, Arrow-backed when PyArrow is installed. Third, the default datetime resolution is no longer nanoseconds, it's microseconds or whatever resolution the input had, which avoids the old out-of-bounds limits that capped dates before sixteen seventy-eight and after twenty-two sixty-two. I'll narrate the SettingWithCopyWarning story later even though it's now historical, because you will see it constantly in pre-2026 tutorials and older codebases, and understanding why it existed is exactly the mental model that makes Copy-on-Write make sense.
Let's start with why NumPy exists at all. A plain Python list is slow for numeric work, and the reason is structural. A list is really an array of pointers to boxed Python objects. Each element is a full heap object. A Python integer, for instance, is around twenty-eight bytes, carrying a reference count, a type pointer, and the actual value. So iterating over a list means pointer-chasing across scattered memory, and every time you add two elements, the operation dispatches through Python's dynamic type machinery, per element, one at a time. There's no cache locality and no use of the processor's vector instructions.
NumPy's core type, the ndarray, fixes this by being one contiguous block of typed memory. It's a single run of raw bytes, all the same dtype, plus a small header. A float64, the standard double-precision float, takes eight bytes. So a million float64 values is eight megabytes packed end to end. Compare that to a Python list of a million floats, where you pay for the float payloads, the object boxing, and the pointer array on top, often four or five times the memory. The array is dense and the elements sit next to each other, which the CPU loves.
That density is what makes vectorization possible. Vectorization means expressing an operation on the whole array at once, like adding two arrays together or taking the square root of every element, so that the actual loop runs in precompiled C rather than in the Python interpreter. For linear algebra specifically, NumPy hands off to highly tuned libraries called BLAS and LAPACK, things like OpenBLAS or Intel's Math Kernel Library. So you make one Python call, and then a tight C loop runs with no per-element interpreter overhead, often using the processor's single-instruction-multiple-data vector units. Concretely, summing or multiplying a million numbers in a Python loop versus a NumPy vectorized operation is commonly ten to a hundred times faster, and matrix multiplication routed through BLAS can exceed a hundred times. Try it: make an array with arange of one million, then time a Python generator that squares and sums each element, which takes tens of milliseconds, against the vectorized version that squares the whole array and sums it, which comes in well under a millisecond.
There's a trade you're making. The ndarray is fixed dtype and fixed size. You give up the flexibility of a Python list, which can hold mixed types and grow on demand, in exchange for speed and memory density. Appending to an array actually reallocates the whole buffer, so the idiom is to size your array up front, or build a Python list first and convert it once at the end.
Now let's open up the anatomy of the ndarray, because a few attributes explain almost everything. An array has a dtype, like int64, float64, float32, bool, or complex128. It has a shape, which is a tuple, and a number of dimensions called ndim, and a total element count called size. It has itemsize, the bytes per element, and nbytes, the total. And it has strides. Strides are the secret to how NumPy works. A two-dimensional array is, underneath, just one flat buffer. The strides tell NumPy how many bytes to jump to move one row down and how many to move one column over. This is exactly why slicing can return a view: a view is the same underlying buffer with a different shape, different strides, and a different starting offset, with no data copied at all.
Related to strides is memory order. The default is C order, also called row-major, where consecutive elements of a row sit next to each other in memory. The alternative is Fortran order, column-major, where consecutive elements of a column are adjacent. This matters for cache performance, you want to iterate along the contiguous axis, and it matters for interoperability with Fortran and BLAS code. You can request Fortran order at creation, convert with asfortranarray, and check the contiguity flags on an array. And the reason the dtype is fixed and homogeneous is precisely so NumPy can pack the data tightly and run typed C loops with no per-element type checking. That same fixedness is also a footgun, because it opens the door to overflow and to confusing integers with floats, which we'll hit in the pitfalls section.
Let's create some arrays. The most direct is to pass a Python list, and NumPy infers the dtype. There's arange, which works like Python's range but returns an array, so arange from zero to ten with step two gives you zero, two, four, six, eight. Be careful using arange with floating-point steps, because floating-point rounding makes the endpoint unreliable; for evenly spaced floats use linspace instead. Linspace from zero to one with five points gives you zero, a quarter, a half, three-quarters, and one, and notice it includes both endpoints and you specify the number of points, not the step. Then there are the filled constructors: zeros, which takes a shape tuple, ones, full for an arbitrary fill value, and empty, which leaves the memory uninitialized and is therefore fast but you must not read it before writing. There's eye for an identity matrix, and the underscore-like variants such as zeros-like that match the shape and dtype of an existing array.
For random data, the modern interface is to make a generator with default-rng and an optional seed, then call methods on it: random for uniform values in a shape, normal for a Gaussian with a given mean and standard deviation, and integers for random whole numbers in a range. Reshaping is its own small toolkit. Reshape gives you a new shape, returning a view when it can. Passing negative one for one dimension tells NumPy to infer that size, so reshape to negative one comma one turns a flat array into a column vector. To flatten back down, ravel returns a view when possible, while flatten always returns a copy.
Now the heart of the matter: vectorization and ufuncs. A ufunc, short for universal function, is an elementwise operation implemented in C, things like add, square-root, exp, maximum, and the comparison operators. When you write a plus sign between two arrays, that dispatches to the add ufunc under the hood. The mantra to internalize is that a Python for loop over array elements is almost always a sign you're doing it wrong, and you should replace it with a whole-array expression. For example, to normalize an array by subtracting its mean and dividing by its standard deviation, you write that as one expression, subtract a dot mean from a, divide by a dot std, and the entire thing runs in C. Ufuncs also support extras: an out parameter to write into an existing array, a where parameter to operate only on selected positions, and reduction methods like reduce and accumulate.
Broadcasting is the next idea, and it's where a lot of clean NumPy code comes from. The rule is that NumPy aligns the shapes of two arrays from the right, and two dimensions are compatible when they're equal, or when one of them is one and gets stretched, or when one is missing and is treated as one. If none of those hold, you get a value error saying the operands could not be broadcast together. The simplest case is a scalar plus an array, the scalar is stretched across everything. A shape of three-by-four plus a shape of just four adds that row to every row of the matrix. A column plus a row produces an outer grid: take a column of three values and add a row of four values, and you get a three-by-four addition table. And mean-centering a matrix of a hundred rows by five columns is just the matrix minus its column means, where the means have shape five and broadcast down every row.
Broadcasting matters for two reasons. It avoids materializing huge intermediate arrays, you don't need to tile a vector out to a full matrix, and it keeps the code clean. Nearly all machine learning feature math, centering, scaling, adding a bias vector to a batch, is written this way. But there's a classic pitfall worth burning into memory. A one-dimensional array of shape three and a column of shape three-by-one are not the same. If you add an array of shape three to that same array reshaped to three-by-one, you don't get a length-three result, you get a three-by-three matrix, because the shape-three array is treated as one-by-three and broadcasts against the three-by-one column into a full grid. The fix when you reduce along an axis is to pass keepdims equals true, so that a mean taken along axis one keeps its shape as a hundred-by-one rather than collapsing to a hundred, which means it broadcasts cleanly back against the original hundred-by-five matrix.
Now, indexing and slicing, and specifically the difference between views and copies, which is the single most important gotcha in NumPy. Basic slicing returns a view. If you take a slice of an array, the slice shares memory with the original, so assigning into the first element of the slice actually mutates the parent array. This is by design, because you do not want to copy a million rows every time you look at a subset, but it's also the number one aliasing bug people hit. When you need an independent array, call dot copy explicitly.
The contrast is that fancy indexing and boolean indexing return copies. Fancy indexing means indexing with a list or array of integer positions, like pulling out elements zero, two, and four. Boolean indexing means indexing with a mask, like selecting all elements greater than five. Both of those give you a fresh copy, not a view. So the rule of thumb is simple: slices are views, fancy and boolean indexing make copies. Boolean masking is also how you do conditional updates, you can assign zero into every element where the array is greater than five in one line. When you combine conditions, use the ampersand, the pipe, and the tilde for and, or, and not, and you must wrap each condition in parentheses, because the Python keywords and and or will raise an error about the truth value of an array being ambiguous. For two dimensions you index with a row and column, you can take a whole column or a range of rows, you can insert a new axis with the newaxis token, and the where function gives you a vectorized if-then-else: where the condition holds take from x, otherwise take from y.
Aggregations bring us to the axis argument, which confuses everyone at first. Functions like sum, mean, std, min, max, argmin, argmax, and cumsum operate over the whole array by default. When you pass an axis, that axis names the dimension that gets collapsed. So sum with axis zero collapses the rows and gives you one value per column. Sum with axis one collapses the columns and gives you one value per row. The mnemonic is that axis zero goes down the rows, producing column results, and the axis you pass is the one that disappears from the shape. And again, keepdims equals true is your friend when you want to divide an array by its own row sums, because it keeps the result shaped so it broadcasts back.
Linear algebra is where this connects directly to the math the course needs soon. The dot product of two vectors, written with the dot function or the at operator, gives you a scalar, and that's exactly the operation linear regression leans on. For matrix multiplication you use the at operator or the matmul function, and the at operator is preferred over the older dot for matrices because it broadcasts over batches and reads cleanly. Here is a distinction that trips people constantly: the asterisk between two arrays is elementwise multiplication, the Hadamard product, which needs broadcastable shapes, while the at operator is true matrix multiplication, where the inner dimensions must match, an m-by-k times a k-by-n gives an m-by-n. Confusing those two is one of the most common NumPy mistakes. The linalg submodule gives you inverse, solve, determinant, eigenvalues, singular value decomposition, and norm. Prefer solve over computing an inverse, it's more numerically stable and faster. As a forward reference, the normal-equation solution to linear regression, X-transpose-X inverse times X-transpose-y, is written as solve of X-transpose-times-X and X-transpose-times-y. Transpose is the dot-T attribute, and eye builds the identity matrix.
A word on randomness and reproducibility, because it matters more than it seems. The modern approach, since NumPy one point seventeen, is to make a generator with default-rng and a seed, then draw from it with random, normal, integers, choice, and shuffle. Seeding makes your runs reproducible, which is essential so that your train-test split and your weight initialization come out the same way every time. The legacy approach uses a global seed function plus the older rand, randn, and randint functions; you'll see it everywhere in tutorials and it works, but the new generator is preferred because it has better statistical quality, gives independent streams, and avoids hidden global state. Scikit-learn, by the way, takes a random-state argument for the same reason. The lesson is to seed everything stochastic so your experiments are comparable.
Let me gather the NumPy pitfalls in one place, because they cause real bugs. First, the views-versus-copies aliasing we covered, where calling copy defensively trades a little memory for a lot of saved debugging. Second, integer versus float dtype. An array made with arange is int64, and doing an in-place divide-by-two on an integer array raises an error, while assigning a value like two point seven into an integer array silently stores just two. Third, integer overflow is silent: an int32 accumulation will wrap around past its maximum with no warning at all. That revised type-promotion rule in NumPy two point oh, the NEP-50 change, is related: a float32 plus the Python float three point oh now stays float32 rather than getting upcast based on the scalar's value, which is something to watch when porting old code.
Fourth, not-a-number, written NaN. Any arithmetic that touches a NaN produces a NaN, and famously NaN is not equal to itself, so a comparison of NaN to NaN returns false. That means a sum over an array with any NaN in it comes out as NaN, so you reach for the NaN-aware versions like nansum and nanmean, and you detect NaNs with the isnan function. Fifth, never test floating-point values for equality with the double-equals operator; zero point one plus zero point two does not equal zero point three in floating point. Use isclose or allclose, which take relative and absolute tolerances. And sixth, ragged lists: trying to make an array from lists of different lengths raises an error in NumPy two point x, because the shapes have to be rectangular.
Now let's move up to pandas, which is built on top of NumPy and adds five things. First, labeled axes: every row and column has an index, so you select by meaning instead of by position. Second, heterogeneous columns, so one table can hold an integer age, a string name, a datetime signup, and a float balance side by side. Third, first-class missing data, NaN and NA. Fourth, real time-series tooling. And fifth, relational operations, groupby, join, and pivot, which feel like SQL and a spreadsheet combined. The two core types are the Series, which is one labeled one-dimensional column, values plus an index, and the DataFrame, which is a dictionary of Series that share one row index, essentially a spreadsheet or a SQL table held in memory. Classically each column is a NumPy array, and pandas is the labeling and relational layer sitting on top.
Reading data is usually where you start, and read-csv is the workhorse. Its friends are read-parquet, which is columnar, typed, and fast, and is preferred for real pipelines, plus read-excel, read-json, and read-sql. The read-csv parameters worth knowing: dtype to force column types, parse-dates to turn date columns into real datetimes, because otherwise they come in as object strings and the datetime accessor won't work; usecols to read only some columns for a memory win; na-values to declare which strings count as missing, like NA or a dash or a question mark; nrows to read a sample; and index-col to set the row index. Then always inspect what you loaded. Call head to peek at the top, info to see dtypes plus non-null counts plus memory, describe for summary statistics, and check dtypes, shape, and columns. This is where you catch a numeric column that loaded as object because of a stray dollar sign or comma. In pandas three point oh, text loads as the dedicated string dtype, Arrow-backed when PyArrow is present.
Selection in pandas centers on two accessors, dot-loc and dot-iloc. The dot-loc accessor is label-based, so you select by the index labels and column names, and importantly its slices are inclusive of the endpoint, so a label slice from a to c includes c. The dot-iloc accessor is integer-position-based, so you select by numeric position, and its slices are exclusive, like normal Python, so positions zero through three give you zero, one, and two. The reason the plain square brackets are ambiguous is that they do three different things depending on what you pass: a column name selects a column, a slice selects rows, and a boolean mask filters rows, all with the same brackets. A single column name gives you a Series; a list of column names gives you a DataFrame. For boolean filtering you write the frame indexed by a condition, and the cleaner form uses dot-loc with the condition and a list of columns together. Combine conditions with the ampersand and pipe and parentheses, just like NumPy, and there's also a query method that takes a string expression as an alternative.
Now the SettingWithCopyWarning story, which is historical as of pandas three point oh but still everywhere in older material. The bug it warned about was chained indexing. If you filtered a frame and then assigned into a column of that filtered result in two separate bracket steps, the first step might return a copy, and your assignment would silently do nothing to the original. Whether you got a view or a copy was unpredictable, and that unpredictability is exactly why the warning existed, telling you a value was being set on a copy of a slice. The fix was always to do it in a single dot-loc call, selecting the rows and the column together, or to take an explicit copy when you genuinely wanted an independent sub-table. Under pandas three point oh's Copy-on-Write, every subset behaves like a copy, so chained assignment reliably does nothing to the original with no silent partial success, and the warning has been removed. Copy-on-Write defers the actual copying until a write happens, so reads and slices stay cheap. The habit you keep is unchanged: assign through a single dot-loc.
Missing data deserves its own treatment. The three flavors are NaN for floats, NaT for datetimes, and the newer pandas NA for nullable types. You detect missingness with isna and notna, and the standard first diagnostic is to call isna and then sum, which counts missing values per column. You handle it with dropna, which has options for how, for a subset of columns, and for dropping along columns instead of rows; or with fillna, filling with a constant or with column means; or with forward-fill and back-fill. Why does this matter for machine learning? Most scikit-learn estimators reject NaNs outright, so you have to impute or drop before fitting. And how you handle it is itself a modeling decision: dropping rows biases your sample, mean-imputation distorts the variance, and sometimes the fact that a value is missing is itself a signal. Note too that a column containing any NaN is forced to float unless you use the nullable integer type.
Groupby is the workhorse for aggregation, and the mental model, due to Hadley Wickham, is split-apply-combine. You split the rows into groups by a key, apply a function to each group, and combine the results. So grouping by city and taking the mean of a sales column gives you average sales per city. You can group by several keys at once and aggregate multiple columns with different functions, summing sales while counting unique customers. Know the difference between size, which counts rows per group, and count, which counts non-null values per column. And know transform, which returns a result the same shape as the input, aligned back to the original rows, perfect for group-mean centering, versus filter, which keeps or drops whole groups. This is the in-memory equivalent of SQL's GROUP BY, and it's how you build aggregate features.
Combining tables comes through merge, join, and concat. Merge is a SQL join: you merge a left and right frame on a key with a how parameter that can be inner, which is the default, or left, right, or outer. You can join on differently named columns with left-on and right-on, and the join method joins on the index. Concat stacks frames, along axis zero for rows, like a UNION ALL, or along axis one for columns. Watch out for the many-to-many explosion: if both sides have duplicate keys, the merge produces the Cartesian product within each key and your row count blows up. That's the classic why-did-my-row-count-explode bug, and you guard against it with the validate parameter set to one-to-many and the indicator parameter to see where each row came from.
Pandas gives you vectorized string and datetime operations that run as C-level loops. The string accessor offers lower, contains, strip, replace, split with expand to spread results into columns, and len. The datetime accessor offers year, month, day-of-week, hour, and to-period, but it needs a real datetime column, which is why parse-dates matters at load time. And here's a warning about apply: calling apply with a plain Python function runs a Python-level loop, the slow for-loop equivalent that forfeits vectorization. Reach for the string accessor, the datetime accessor, or plain column arithmetic whenever a vectorized path exists; apply is the last resort, not the reflex.
Reshaping connects to the idea of tidy data. Pivot and pivot-table go from long to wide, for instance indexing by date with products as columns and summing sales into the cells, and pivot-table aggregates duplicate entries while bare pivot errors on them. Melt goes the other way, wide to long. Stack and unstack move levels between the row index and the column index. Tidy data, again from Hadley Wickham, means each variable is a column, each observation is a row, and each type of observational unit is its own table. Tidy, long-format data is what most machine learning feature code expects, so wide spreadsheets usually need melting first.
A few performance and memory notes round this out. For low-cardinality repeated strings, like a status column with a handful of distinct values, convert to the category dtype, which stores integer codes plus a lookup and saves memory while speeding up groupby; that's distinct from the high-cardinality Arrow-backed string default. Method chaining with assign makes readable pipelines, where you query, then assign a new computed column with a lambda, then group and aggregate, all in one flowing expression, and it pairs naturally with Copy-on-Write. On performance: never use iterrows, which is a Python loop that boxes each row into a Series; use itertuples if you truly must iterate, but really you should vectorize. Prefer column arithmetic, the string and datetime accessors, groupby, and merge over apply. The PyArrow backend gives you faster input-output, less memory, nullable types, string operations that can be five to ten times faster, and up to roughly fifty percent less memory on text. Downcasting dtypes and using the category type shrink memory further.
Let me close by connecting all of this to where the course goes next, because that's the whole point of today. The shape the next episodes assume is this: you load, clean, and engineer your data in pandas, then split it into X, the feature matrix, and y, the target, then hand NumPy arrays to scikit-learn or to your own math. Splitting X and y is as simple as dropping the target column to get X and selecting the target column to get y. To cross the boundary back into NumPy you call to-numpy, which is preferred over the older dot-values attribute, and you can ask for a specific dtype like float32. Scikit-learn often accepts DataFrames directly and even preserves your feature names, but under the hood it's NumPy plus BLAS, which is the full-circle payoff of everything today.
So that's the muscle memory: vectorized thinking instead of Python loops, broadcasting, axis semantics, label-versus-position selection, missing-data handling, split-apply-combine, and tidy reshaping. This episode builds directly on the toolkit setup from last time, so run these snippets in a notebook with percent-timeit and feel the speed. And it unlocks the next two episodes, the machine learning workflow with train-test split, where you'll use X and y, to-numpy, seeding, and splitting, and then linear regression, which leans on dot products, the at operator, the linalg solve, and broadcasting to build the design matrix. The vectorization you learned today is exactly what makes from-scratch gradient descent later run fast enough to be worth typing. Get comfortable here, and the rest of the course stops feeling like magic.