Databricks | Machine Learning Guide Podcast

MLA 021 Databricks
Jun 21, 2022
Click to Play Episode
Databricks emerges as a compelling platform in the realm of data analytics and machine learning operations. Combined with its versatile Delta Lake and integrations with major cloud providers, it offers a robust solution for both beginners and enterprises, covering data analytics and machine learning needs.
Try a walking desk to stay healthy while you study or work!
Show Notes
Raybeam and Databricks: Ming Chang from Raybeam discusses Raybeam's focus on data science and analytics, and how their recent acquisition by Dept Agency has expanded their scope into ML Ops and AI. Raybeam often utilizes Databricks due to its comprehensive nature.
Understanding Databricks: Contrary to initial assumptions, Databricks is not just an analytics platform like Tableau but an ML Ops platform competing with tools like SageMaker and Kubeflow. It offers functionalities for creating notebooks, executing Python code, and using a hosted Spark cluster and Delta Lake for data storage.
Choosing the Right MLOps Tool: Depending on client requirements, Raybeam might recommend different tools. Decision factors include client's existing expertise, infrastructure needs, and scaling challenges. Databricks is often recommended for its ease of use and features.
Databricks Features: Offers a hosted solution for Spark clusters on AWS, Azure, or GCP; integrates with IDEs like VSCode through Databricks Connect; provides a unique Git integration for version control of notebooks; and utilizes Delta Lake for version control of Parquet files, enhancing operations like edit and delete.
Parquet and Delta Lake: Parquet files are optimized for big data, and Delta Lake provides transaction-like operations over Parquet by maintaining version history.
Pricing and Usage: Databricks adds a nominal fee on top of cloud provider charges. It's accessible for single developers and startups, making it suitable for various scales of operations.
Ming Chang's Picks: Discusses interests in automated stock trading projects and building drones with Raspberry Pi, highlighting the intersection of programming and physical computing.
Additional Resources

For a hands-on look at Ming Chang's drone project, follow his developments or connect for insights on building a Raspberry Pi-powered drone.
Transcript
Welcome back to Machine Learning Applied. In this episode, I'm interviewing Ming Chang from Raybeam. Raybeam is Dept Agency's latest acquisition. Remember I mentioned earlier that Dept Agency is the parent organization. They're out of Amsterdam and they've been acquiring various companies. I come to them through Rocket Insights, one of their acquisitions, and Raybeam is their latest acquisition. Raybeam's primary focus is data science, data analytics, whereas Rocket Insights is actually primarily app development. So it's really good to have Raybeam on the team because I'll be able to pick their brains about various topics deep in the data science space. In this particular episode, we're talking about Databricks. Now, going into this episode, I actually thought Databricks was something of an analytics platform, like a desktop application, similar to Tableau. But it turns out, as you'll see in this episode, I was wrong about that. Databricks is also an ML Ops platform, which actually makes it a competitor now to SageMaker and Kubeflow and some of the other tools we've talked about. So I'm probably gonna take a step back from ML Ops after this episode so we can talk about other things in data science and machine learning. We've kind of beaten the dead horse on ML Ops, but I do know that Databricks is a product that kept coming up over and over in my exposure in data science. And so I thought it would be worth deep diving since it tends to be a favorite tool within Raybeam who definitely knows what they're doing in the data science space. So let's dive in and have Ming talk us through Databricks. Today, we have Ming Chang from Raybeam. So Ming Chang, if you could introduce yourself, your role in what you do at Raybeam and a little bit about Raybeam's background in data science. Okay, cool. I'm Ming, part of Raybeam. I'm a software engineer at Raybeam and we just got acquired by DEP. And Raybeam really focuses on data and analytics and kind of expanding into ML Ops and AI. And at Raybeam, I'm mostly focusing on Python, Databricks and deployment of code. More of the bottom layer of the infrastructure layer of coding. So in the past few episodes, we've talked a lot about ML Ops. We've talked about SageMaker using that as an end-to-end deployment solution on AWS from ingesting your data, data lake, data transformation, feature store, and then pipelining that through to various locations. The final solution being a deployed machine learning model. So that's ML Ops. Today, we're gonna be talking about Databricks. And I don't know anything about Databricks, to be honest. I understand that it is not an ML Ops platform. It's something different, right? Well, so from a big picture point of view, conceptually, it's very similar to SageMaker actually. So you can kind of think of them as competing products. What Databricks says they are is just data analytics and AI all hosted on the cloud. What they're offering is an interface for you to create notebooks, execute Python code. They have a hosted Spark cluster that you can create in the UI where your notebooks and your Python code can execute. And they also have a hosted Delta Lake, which is a storage layer. So I haven't used SageMaker myself previously, but I think it's conceptually very, very similar. It is conceptually similar, okay. And why does Raybeam choose Databricks over some alternative solutions? Like in the last episode, we talked about Kubeflow. I'm familiar with SageMaker. And then of course there are other cloud solutions on GCP and Azure and so on. Why does Raybeam choose Databricks? Well, so we're definitely not just a Databricks house. We have multiple clients and we recommend to the clients whichever tool is best for their goals. So the client that I'm working with right now is using Databricks and we did recommend that to them, but we do recommend SageMaker to others. We recommend other solutions to different clients just because each company may have goals and they may have employees with a different historical knowledge. They have pipelines that are already running on a different platform, for example. So we're definitely not gonna go to a client and say, hey, let's rip out everything you have that's working well for you and let's go with Databricks, right? So it's definitely whatever tool is gonna help them achieve their goals. Got it. What other alternative tools are there in this space? Is Snowflake another competitive product like this? Sure, so Databricks, one of its strong points is in big data and data analytics and it also has a competing storage layer. So yeah, in that regard, Snowflake is a competitor. I think Databricks has a little bit more features in terms of machine learning that Snowflake doesn't really focus on, but yeah, for at least in the data area, they are competitors. Though I think there are some differences also, right? So Databricks, one of their key value adds is that they've rewritten Spark in C++, which they say is gonna give you a lot better execution time, which we have seen in our current client that they've been able to improve their execution times a lot. So that's one of the cool things about Databricks. And correct me if I'm wrong, before even entering the machine learning space, I have heard of Databricks. It strikes me that they've been around the block for quite a while and maybe some of these cloud solutions like SageMaker are relatively new. Is there maybe a sense in which Databricks is a traditional platform for the same types of tooling and is maybe more tried and true, more mature, has been around the block for a while? I actually don't know the exact year that Databricks started. I wouldn't categorize them as one being more traditional and one being newer though. I think it's more so different feature sets with different focuses. Yeah, definitely not one is more traditional, one is newer. Yeah, it's a different user experience and a different feature set for different tasks. But in the end, they're still similar and just computing platforms. And is there, so I know with SageMaker, the focus is primarily in code, deploying a stack to the cloud for MLOps. Does Databricks come with, is it like a desktop application that has maybe some GUI for exploring your data? Maybe it starts on the desktop and then it eventually deploys to the cloud or is it quite similar to SageMaker in the sense that it's a code to cloud platform? So I think this is one of the bigger differences. The interface on Databricks is either a, basically on their website. On their website, it looks a little bit like JupyterHub in terms of you have that website, you have kind of a navigation area and you can create notebooks in that UI, but they do have another option, which is they integrate with whichever IDE that you're using locally on your laptop. So I like using VS Code, for example, some people like to use PyCharm, but they have this thing called Databricks Connect where you can actually execute code from your IDE and any Spark code executes on the hosted cluster that resides at Databricks, which is kind of cool because your user experience is basically you're coding on your IDE. There's no difference to testing stuff out on your actual laptop and anything that actually accesses big amounts of data gets executed on their cluster, which in the UI you can configure that cluster to be as big as you need it to be. So that seems like a very cool feature that they have. That's really slick. But they don't have a program that you download onto your desktop. So yeah, truly it does just sound like a competitor in the space of MLOps data engineering and all that stuff. Yeah, if you wanna deploy stuff to their hosted clusters, yeah, you can do that. And can you also deploy it to your own cloud provider, AWS, GCP, Azure? So how Databricks works is that they're actually using one of those three that you mentioned. So their servers are actually on either AWS, Azure, or GCP. And when you sign up, you actually choose one of those. They're just kind of providing the layer on top. And when you choose one on registration, you will connect it to your own account and it will spin up these services in your cloud? Yes. Can you walk me through maybe the decision steps you might make towards choosing one of these MLOp products over another? Let's say we've got Kubeflow SageMaker or GCP Azure's own services or Databricks or Snowflake and so on. The reason I'm reaching out to Raybeam through you is your exposure to the options in this space. And it seems like there is like a strength in Databricks, but also a familiarization with various competing solutions. I'm a SageMaker guy. The last interview with Dirk, his specialty was in Kubeflow for the generality, the universality of it so that it can be deployed in various cloud providers, depending on the client. And then it seems that Raybeam has a broader territory of tools they use for different clients. A big one being Databricks. How might you decide which tool to use given different circumstances? Okay, that's a really good question. So when we go to a client, often the client may not know what big issues they have. We don't know what kind of big issues they have. So we actually go through a pretty lengthy exploration process just to find out what kind of troubles they have during their day-to-day operations, what are their biggest needs, and we'll figure out maybe where are the biggest weaknesses in their current process, and then figure out what kind of new process that we can bring to them that can basically achieve their biggest goals. So say that they're missing a piece of infrastructure in the compute side, then maybe we'll add a tool that fits their needs there. Maybe they're missing something in the data analytics side. We'll add a tool there. So it's a very dynamic thing. We explore the needs of every client and see from there. Do you have any example specific decisions that might lean one way over another for the selection of some product? Yeah, sure. So say if a client has a lot of expertise in a particular product and their processes using that product are already going very well, then we're definitely gonna keep that product and keep those processes versus, for example, if you're running on a Spark cluster that your company created yourself and it's kind of old and you're having problems with scaling that infrastructure, then we're definitely going to say, help you migrate to a cloud infrastructure that solves those scalability problems for you. And this is something that our current client had some trouble with. So they had a on-premises cluster that couldn't scale to the size of their models. And so Databricks has helped solve that because before you execute anything, you can choose a cluster that fits the size of the model instead of having to tweak the code of the model to fit the size of the cluster. And if you're up for it, can you give us like a quick rundown of Databricks? Sort of the beginning to end, some of the tools involved, the various phases like data lake, data warehouse, feature engineering and stuff like that. What's a day in the life of Databricks look like? Yeah, sure. So I'm actually not as experienced in kind of the machine learning flow on Databricks. So far, I've been dealing with a lot of the ETL pipelines and data pipelines. So I'll focus on that area. When you log into Databricks, the first thing you have to do is create a cluster and you kind of choose the number of CPUs, the amount of RAM, et cetera, and how many workers you want, et cetera. So then after that, you're able to go into the notebook interface and there that interface is very similar to something like JupyterHub, anything with notebooks where you get to write some Python code and execute it on your data. So they actually have kind of a tutorial notebook as well, where you get to use SQL to access your tables, et cetera. It's kind of similar to Snowflake in that regard. You have an area where you can see your database and the tables and an area for you to type in your SQL to execute. But what I found cool with Databricks was that your notebooks are a little bit more permanent in terms of you create your notebooks, you edit them, you run them, but you can also through their UI, commit your notebooks to a Git repo. If you have a Git repository created somewhere, you can version control your notebooks, which is kind of cool. So you don't accidentally lose your notebooks. And I've done that with my own JupyterHub notebooks. I've done that plenty of times where I accidentally deleted something and didn't notice until later. So having that Git integration is super useful. And then on the data side, does it integrate with whatever services you might already be using like RDS or S3 or whatever bucket equivalent is for whatever cloud provider you're using? Yeah, you can definitely access data from S3, from Snowflake, wherever you have it stored. Okay, so it seems like here, I thought coming into this interview that it was quite different than the current MLOps solutions. It sounds quite similar. It sounds like there's simply a slew of various competitors in this space and it's sort of just pick your poison in the same way that there are various IDEs and web UI frameworks or whatever the case. Yeah, I think that's the general picture. And so probably the best deciding factor is where are you coming from? Are you already using a cloud provider? Or is the client already using some tool set, tangential or even completely involved in one of the tools you might be selecting from? And that's always the starting point. And then if you have a blank slate where the client is just getting started, they even haven't collected data or anything like this. They just know they want to start a project. How might you go about deciding, do you have a favorite tool? Or would you go a different angle in deciding on how to set up their stack? So Databricks is actually a fairly well-rounded tool. It doesn't necessarily do everything the best out of all the tools, but it has a lot of features. It is generally easy to use and it's good enough in everything. So I think it's actually a very good tool that I would recommend to someone or some company who's getting started out in the data space and in the ML space. So it's got you covered in both data and ML, and we haven't found a feature that's really lacking in it. Whatever you want to do, it's there. So I would recommend Databricks to someone new and it's just the user experience in general is pretty good. What about pricing? If SageMaker on AWS's pricing is tied completely to just the services you're using, there's not an added cost. How does Databricks fit into this scenario? So it's pretty interesting because Databricks gives you a choice on which cloud to run on, and each of the clouds would be a little bit different on pricing. But say you chose Databricks on AWS versus Databricks on JCP, the pricing would follow what those cloud providers charge. But on top of that, Databricks charges a little bit more fee for usage of their tool. So it's hard to say exactly how much you'll be charged, but it depends on which cloud you're using. And, but what they charge on top is fairly small. So I've seen people save a lot of costs after migrating from Snowflake to Databricks. Is it conducive for single developers or startups, or is this more enterprise focused pricing wise? Yeah, I think pricing for startups and single developers is not going to be a problem. Depending on how much data you have, you might not even, if you're just starting out as a single developer, you might not really need a cloud platform, but they actually do make things fairly easy for you to start out if you're an individual. So I still wouldn't, I don't think it's a bad choice to start out with Databricks even if you're an individual. Cool. Have you used Kubeflow before? Do you know how Databricks compares to it in complexity? I haven't used Kubeflow, but I know that Databricks provides a hosted solution for MLflow. So they use MLflow. And so when you're interacting with Databricks, you can use MLflow. Well, we're not that far in the episode. Do you have anything else to add on Databricks? Let me think about this. Yeah, I think so. So Databricks offers something called Delta Lake, which is basically a layer on top of Parquet files. And this is Delta Lake is open source, but they provide the hosted solution for Delta Lake as well. So when you have Parquet files and a lot of data, one of the inconveniences with Parquet is that you can't edit or delete rows. You can only insert. So if you want to change something, you have to write a whole new Parquet file with that row changed. And that's a bit annoying. So what Delta Lake does is it keeps track of the version history of Parquet files. So to the user, it looks like, oh, I now have edit and delete functionality on my Parquet files. So that's pretty cool. Yeah, I'll justify that a little bit too from my own experience. So we talk about this journal app in this podcast series a bunch, and I do embeddings of users entries using sentence transforms, hugging face sentence transforms. Takes this document, turns it into a 768 dimension vector, currently stores it to RDS as a double double, and then two square brackets. Bad idea. This thing wants to be stored somewhere else. Feature store or as NumPy files on S3. And the management of these files, just as you indicated, it does not operate like a transactional database operates where you can just update a row or delete a row. And that can cause performance issues. It can cause like code complexity. Just keeping that in mind, even if there's not necessarily performance issues at stake, keeping that in mind means extra code. So you're saying that Delta Lake obfuscates that process for you and then can treat your large storage files as if they're a transactional database. Yes, exactly. And what is a Parquet file? We haven't talked too much about that on this series. So this'll be a good opportunity to dive into that. Okay. I guess if you imagine a CSV, for example, you're storing each record as a row in that text file. A Parquet file is similar except each record is in a column. And so it's kind of like a flipped CSV if you think about it that way. Flipped diagonally. Yeah, just for its format. And so that causes a lot of efficiencies when you're querying a lot of data. So it's basically a slightly different format than a CSV that is more optimized for big data. Okay. Yeah, I always thought of it as an optimized version for data analytics because in analytics, you're gonna be running queries and aggregator, aggregate functions over columns rather than rows. So if you want the max of some feature or some column, the way that the Parquet is stored on disk as opposed to a CSV allows you to quickly scan for aggregation queries. So good for analytics. But I always wondered, does it have a place in machine learning? Because a lot of the way you work with machine learning code is kind of row centric. It's a little bit more traditional. But at the very least, it's definitely, Parquet seems to be a standard or the standard file format of data analytics pipelines like these tools, Databricks and SageMaker and all that. Yeah, it's very popular. I know there's a tool called PyArrow that allows you to sort of scan over your S3, what do you call it, keys, folders basically within the bucket and looking for certain entries within your bucket as if you're running SQL queries, but it's not using AWS Athena. It's not actually like a SQL query engine. It's just a file seeker and it works best when Parquet is being used. So something about the format of that file makes it really easy to work with and to navigate. It's a Python library for working with files stored on disk. So it sounds like Delta does all that for you. You wouldn't use a library. It's handled all out of the box. And a lot of even the SageMaker tooling, for example, there's this thing called AWS Athena lets you query or file data in S3 as if you were running a SQL query, but you can't update files in place. It's all read only. So PyArrow is just a Python library that lets you treat files on disk. Basically it sounds like a library version of Delta Lake. It lets you navigate your files on disk and make updates as necessary. Not quite SQL, but close to. And from what I understand, it's very Parquet first-class citizen. They really like Parquet a lot. Okay, a lot of these Python packages, there's a chance that the implementations of some of the tools that we're talking about could be using PyArrow already. There's probably a very good chance. Well, I like to wrap up these episodes per Shippet.io's format with a pick. So I like to ask you maybe something you're really into lately, like a book or a hobby or some new big discovery that you're tinkering around with. I guess, can I talk about two? Yeah, of course. So previous to working at Raybeam, I actually had a fair bit of time to write my own project because I was very interested in the stock market. And at the same time, I like programming. So what I did was actually attempt to write this framework for automated trading. And so I connected to this service called Alpaca that provided a stock trading API. And so I would download stock data and try and analyze the peaks and the valleys of the price fluctuations, and then try and make trades.
 Unfortunately, I wasn't able to test a solution that actually made money consistently, but that was kind of a cool experience. No success, huh? Well, actually, it's funny, about halfway through this podcast series, I started in I think 2016 and then about halfway through the pet project of the podcast, eventually switched to the journal app, was a Bitcoin trading bot using deep reinforcement learning, a framework called TensorForce. And sure enough, no success. So no success. Yeah, that would be pretty interesting to explore again sometime in the future. Do you have a repository on GitHub or anything people can take a look at? Uh, no, unfortunately it wasn't advanced enough to, to really publish. Yeah. And then number two? Number two. So the other thing that I've been interested in was kind of drones, just kind of building a small drone and putting a Raspberry Pi on there and writing some, some Python programs on the Raspberry Pi to control where that drone flies. And that's been a very cool experience also, especially because once you write some code, you actually see, see the thing do stuff. Physical. Yeah. This isn't just a script. One thing I like about web app programming is you see results on the, but this is a physical device in the world that you just programmed. And the risks are real too, because if you have a bug in your code, your drone might run into a tree. Have you gotten VR all wired up to it yet? No, but that'd be, that'd be cool to be able to see everything that it sees while it's flying. That's something that I really want to do. Yeah. Let me know if you get that all wired up. That's a, I saw some people doing that, like in the cliffs of Moher in Ireland, and they got to explore the areas that no tourists get to explore and do it all visually looks so cool. All right, Ming. Well, this has been wonderful. Thanks for joining. Yeah, it was great chatting with you.
Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.