MLA 003 Storage: HDF, Pickle, Postgres

May 24, 2018
Click to Play Episode

Practical workflow of loading, cleaning, and storing large datasets for machine learning, moving from ingesting raw CSVs or JSON files with pandas to saving processed datasets and neural network weights using HDF5 for efficient numerical storage. It clearly distinguishes among storage options - explaining when to use HDF5, pickle files, or SQL databases - while highlighting how libraries like pandas, TensorFlow, and Keras interact with these formats and why these choices matter for production pipelines.

Show Notes
Try a walking desk

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Data Ingestion and Preprocessing

  • Data Sources and Formats:

    • Datasets commonly originate as CSV (comma-separated values), TSV (tab-separated values), fixed-width files (FWF), JSON from APIs, or directly from databases.
    • Typical applications include structured data (e.g., real estate features) or unstructured data (e.g., natural language corpora for sentiment analysis).
  • Pandas as the Core Ingestion Tool:

    • Pandas provides versatile functions such as read_csv, read_json, and others to load various file formats with robust options for handling edge cases (e.g., file encodings, missing values).
    • After loading, data cleaning is performed using pandas: dropping or imputing missing values, converting booleans and categorical columns to numeric form.
  • Data Encoding for Machine Learning:

    • All features must be numerical before being supplied to machine learning models like TensorFlow or Keras.
    • Categorical data is one-hot encoded using pandas.get_dummies, converting strings to binary indicator columns.
    • The underlying NumPy array of a DataFrame is accessed via df.values for direct integration with modeling libraries.

Numerical Data Storage Options

  • HDF5 for Storing Processed Arrays:

    • HDF5 (Hierarchical Data Format version 5) enables efficient storage of large multidimensional NumPy arrays.
    • Libraries like h5py and built-in pandas functions (to_hdf) allow seamless saving and retrieval of arrays or DataFrames.
    • TensorFlow and Keras use HDF5 by default to store neural network weights as multi-dimensional arrays for model checkpointing and early stopping, accommodating robust recovery and rollback.
  • Pickle for Python Objects:

    • Python's pickle protocol serializes arbitrary objects, including machine learning models and arrays, into files for later retrieval.
    • While convenient for quick iterations or heterogeneous data, pickle is less efficient with NDarrays compared to HDF5, lacks significant compression, and poses security risks if not properly safeguarded.
  • SQL Databases and Spreadsheets:

    • For mixed or heterogeneous data, or when producing results for sharing and collaboration, relational databases like PostgreSQL or spreadsheets such as CSVs are used.
    • Databases serve as the endpoint for production systems, where model outputs - such as generated recommendations or reports - are published for downstream use.

Storage Workflow in Machine Learning Pipelines

  • Typical Process:

    • Data is initially loaded and processed with pandas, then converted to numerical arrays suitable for model training.
    • Intermediate states and model weights are saved using HDF5 during model development and training, ensuring recovery from interruptions and facilitating early stopping.
    • Final outputs, especially those requiring sharing or production use, are published to SQL databases or shared as spreadsheet files.
  • Best Practices and Progression:

    • Quick project starts may involve pickle for accessible storage during early experimentation.
    • For large-scale, high-performance applications, migration to HDF5 for numerical data and SQL for production-grade results is recommended.
    • Alternative options like Feather and PyTables (an interface on top of HDF5) exist for specialized needs.

Summary

  • HDF5 is optimal for numerical array storage due to its efficiency, built-in compression, and integration with major machine learning frameworks.
  • Pickle accommodates arbitrary Python objects but is suboptimal for numerical data persistence or security.
  • SQL databases and spreadsheets are used for disseminating results, especially when human consumption or application integration is required.
  • The selection of a storage format is determined by data type, pipeline stage, and end-use requirements within machine learning workflows.
Try a walking desk

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Transcript
You're listening to machine learning applied. In this episode, we're gonna talk about storing your num, PA, and PANDAS data onto disc. Now here's the traditional workflow. First, you'll get your data from somewhere. You'll get your data set. They call it your data set. That's gonna be, you know, housing prices and all their features, number of, number of bathrooms, number of bedrooms, whatever. Or if you're doing natural language processing, it's gonna be a corpus of text. Maybe it's just a whole bunch of text and you're, you know, building sort of a text generating process or you're building a supervised learning machine learning model, and what you have is a column of texts. Okay. Big paragraphs of, of sayings. And then another column of sentiments, whether happy, sad, angry, et cetera. Maybe you're building a sentiment analysis bot for IMDB reviews or tweets or whatnot. So the original data you receive from your customer or your client, or from online will usually be in a spreadsheet, a CSV file or a TSV file tab. Separated values instead of comma separated values. Or another common format is called fixed width file, FWF. It's kind of like a TSV, except that it's not necessarily tabs. It could be like 10 spaces, always indicates a break in columns. Or maybe you're getting JSON file from an API or data from a database. Okay. It could be any number of sources. Well, as with the last episode, the magic of Pandas is that Pandas can come in here and it has a function for reading any possible data set under the sun. You can read A-C-A-C-S-V, you can read A-T-S-V-A-F-W-F-A, uh, database, A-J-S-O-N-A-P-I response, et cetera. These are all just one-liners. Okay, so it's like Pandas read csv, and then it has all sorts of options for edge cases or filing coding or any sort of issues that you might bump up against. So you'll read in your data, your dataset through pandas, and then. You'll munge your data, you'll impute your data. You'll remove rows which have missing values, or you may forward fill those missing values into the null slots, imputing your data. So you'll do all sorts of data manipulation in Pandas before you want to feed your data to your machine learning model. Now one thing to note before your data hits the model, it needs to be numbers. TensorFlow and K os, neural networks, they can't handle strings, they can't handle true or false. Everything has to be a number. Everything has to be converted to a number. And so Pandas has tools for converting all of your stuff to numbers. If you have, let's say, categorical strings like male or female or other, or. MSL for small, medium, large, et cetera, you can do what's called one hot ENC coating these strings. The pandas function for that is get dummies. It's kind of a weird word I think dummies. It has a special meaning in statistics about, uh, indicator variables that are ones or zeros. So, uh, get dummies will one hot encode your string variables, your categorical string variables, mind you into numbers. And, uh, you'll just figure out what to do with all the other things. You'll turn your booleans into zeros and ones, et cetera. Pandas will do this all for you. So you'll turn all of your dataset data into numbers through pandas, and now you can pull out a pandas, your num pie array, the array, that rep that, that's underlying your Panda's data. All you have to do is say your df, your data frame. Usually it's stored in a variable called df. You just say DF values, and that's your NumPy array for all the data inside your data frame. And then you'll pass that ND array, your NumPy array, you'll pass it to TensorFlow or Carass. So you ingested your dataset through Pandas. You did your data manipulation through pandas. Now you have your clean data, your processed data, and you pull out the NumPy array version deep down inside, and you hand it off to Caros or to TensorFlow to crunch on those numbers. Now TensorFlow and Caros both have the option of saving intermediate values of the training process of the model. Let's say you have a deep neural network that takes a very long time to train. Now if, if it's maybe one hour, and that's kind of a long time, right? You go and you get a coffee, you come back and check if it's done. Uh, you don't necessarily need to save away the model. And when I say saving the model, I'm talking about the neuron weights. The, the weights inside of each of your neuron in a neural network, those will be stored as NumPy arrays. Each neuron will be an array of weights. Every layer will be an array of those arrays. Okay, so now you have a 2D matrix, and the whole neural network will be an array of those of those layers. So A 3D array. A 3D NumPy array. That's how KR OS and TensorFlow are storing in ram, the weights of your neural network. Now, if it takes an hour to train your model, you don't really need to save those weights away somewhere. But if it takes three days, you might want to save periodically your weights to a file or to a database or something. Maybe once every four hours or three hours or something. Save those weights away in case any number of things in case your computer crashes, in case you want to kill your model on the second day so that you can start using it in its current state. You're like, okay, you've had enough time to train. I'll just start using you the way you are. Then you'll want to pull those weights out from disc, load them up into your model, and then use them to make your predictions. Or what's called early stopping. If we detect that your, your model is training well on the training set, it's getting a very high score, a very high root, mean squared error or accuracy score while training on the training set. But. While it's testing how well it's doing on the validation set, the validation set, it starts to do worse and worse over time. There's this curve you'll see online, it's gonna kind of hard to describe your training error might, you know, come up from top and move its way down so it's getting better and better over time. Lower error is better and so we're going down over time. And then your validation error goes down too. It's going down alongside your train error. They're moving in tandem, coming down and down, but at some point your validation error starts to go back up. It's like a parabola, like a U, and so it's at that, it's at that bowl in the validation error. Where your model starts to overfit and it starts to underperform on the validation set, we want to cut it off right there. We want your model to stop training and use the weights that it had before. Well, it's very difficult to do that if you're not saving away your weights, because if you're saving your weights, we say snapshot one, snapshots two, snapshot three. We can look at the point at which we're doing worse than before. And say, oh look, five snapshots a ago was the ideal, so stop training, delete the last five, and roll back to what was saved on disc five network weight snapshots ago. So it allows you to roll back all sorts of benefits to saving your neurons to disc. So that's a lot. A whole lot of stuff. I apologize. But now we're gonna talk about how we save those weights. Like I said, everything's in a NumPy ND array in ram. We're gonna take that and put it onto disc. So the name of the game is storing your NumPy Array. We have a whole bunch of options in the neural network frameworks like TensorFlow and Ks. The library they use is called HDF. HDF and specifically the version we're on of this file format is HDF five. And so you'll see things stored as H five files, or you'll see a library called H five pi. But the, the sort of core file format is called HDF. Now this is a really cool sort of file format. What it allows you to do is you, you open up your HDF store, just like you would open a file for writing to, or a pending to in Python. Like normal. You open it and now what it allows you to do. Save NumPy arrays to a key, you know, like a key value store where the value is NumPy arrays. You don't actually have to save the file explicitly. Every time you assign a NumPy array to a key in your store, in your HDF file, it will just save it automatically. So let's say you create an HDF five store and we'll just call it store, say store equals blah, blah, blah, blah, blah, to create one of these things. And it's tied to a file on disc. So you'll actually save a file on disc called something H five. Okay. Store equals, you know, blah, blah, blah, open, blah, blah, blah, H five, and then you'll say Store. Do network weights. Equals the NumPy array of network weights, and that was it. It actually saved it to disc. It's really efficient. It's very good at storing NumPy arrays because it works with num PA's knowledge of the data type of the data it's storing. So in the same way that Num PA is really good at storing data in memory, HDF five is good at storing that NumPy data on disc very efficiently for very fast lookup and manipulation. It also has built in some really efficient, very fast compression utilities, so that as you save this data onto disc, it's also compressing it on the fly. So if you're gonna store NumPy ND arrays on disc, and that's pretty much it. You're just, for the most part, just storing ND arrays. Then HDF five is the way to go. And of course, as always with pandas, surprising you with the amount of functions it has. Coming to the rescue, there is a PD two HDF function for saving your Pandas data frame or NumPy array to an HDF store. Like I said, KRS and TensorFlow use. HDF five by default. So if you're saving your models away with KR Os, you don't really have to know anything about the process. You just give it a file name and it will do all the work for you. But that's what it's doing is saving it to an HDF file, and that's a very powerful file format for very efficient NumPy storage and manipulation. Now the next option you could do is to store what's called a pickle file. A pickle file is a traditional Python, uh, file format for storing python objects, python numbers, or dicks or arrays, anything that you constructed in Python, including a TensorFlow model in its current trained state, for example. You can save those away in a pickle file. Now, pickle is sort of a poor man's approach to saving data. Like I said, you can store anything, not just nd arrays. So if you're storing a lot of stuff beyond NumPy arrays, then it might be a preferable approach, then HDF five, but it doesn't really compress well. It's uh, these file, the file form is actually known to be insecure. So be careful where you're storing these things. If you have private data contained inside. I don't know, for whatever reason, pickle files I see tend to be used in industry, kind of like training wheels until you can move yourself off of pickle and onto a proper data storage format. You'll see people using PICKLE in the first, like 10 to 20 commits of their repository, the first, you know, month of coding. And then they'll move away from pickle towards something more dedicated to the type of data that they're storing specifically. So in the case of ND arrays, we like to store those in in HDF five files. And in the case of more heterogeneous mixed data, okay, not just ND arrays, but lots of strings and dates and any other types of data. We like to store those in. UM, CSV files, spreadsheets, or proper databases, SQL databases, and Postgres tends to be the SQL Darling of 2018. There's plenty of competitors out there. There's Oracle and SQ L Server and my SQL, SQL light, et cetera. A lot of people really gravitate towards Postgres. So HDF five files for NumPy arrays, pickle files for look, you just kind of quick and dirty. Wanna wanna store some Python data into a file until you get a proper data storage solution in place. And then databases or spreadsheets for more production ready, heterogeneous. Data. Now what I see a lot is people will store their data as CSV files, spreadsheets if they want to send that to a manager or a customer, and then databases like Postgres sort of come into play at the end of the pipeline, Postgres, people will push the results of their machine learning model because we're actually now taking those results to production. Okay? The data needs to get to the customer. We need to share the data with fellow scientists or something like this. So usually what will happen is we ingest the data from a data set. We crunch it with pandas, we send it through our machine learning pipeline. We store intermediate steps using HDF five, and then we publish the results, whatever is the result of our machine learning model into the database, into Postgres. As an example, let's say the website Pandora, which is all about taking thumbs up and thumbs downs from people who are listening to songs and turning that into a new playlist. Based on a recommender system, what will happen is the user will either thumbs up a song or thumbs down a song. It will ingest that process into the machine learning model, crunch the numbers, maybe store some intermediate results in in HDF five files. The result of all that might be to generate a new playlist and we'll store that playlist in a Postgres database. So the SQL database is sort of the downstream end to your machine learning pipeline. Or for example, clients that I work with who are very tech savvy and they're familiar with our DB SS themselves, they know sql. I'll send them the connection string to the database where I save the results of our machine learning models. So it's a sort of collaborative central location where the results of our models get published and we, and we can all look at each other's tables and see what kind of results the others are getting. Now, those aren't the only three file storage options in Python, of course. Plenty more. Those are three that are very common that you'll see you. You may start to see some other competitors to HDF five in particular. There's one called Feather, and then there's one called Pie Tables, which actually I think uses HDF five under the hood. It's just a little bit more robust and feature rich. So look into those offline pie tables and feather, but for the most part, the process is start with pickle as your training wheels graduate to HDF five for your numerical storage, and then publish your results to a database, which is either gonna be used by your app or your website, or going to be looked at by your fellow colleagues and researchers. Python machine learning oriented file storage.
Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.