Databricks is a cloud-based platform for data analytics and machine learning operations, integrating features such as a hosted Spark cluster, Python notebook execution, Delta Lake for data management, and seamless IDE connectivity. Raybeam utilizes Databricks and other ML Ops tools according to client infrastructure, scaling needs, and project goals, favoring Databricks for its balanced feature set, ease of use, and support for both startups and enterprises.
Welcome back to Machine Learning Applied. In this episode, I'm interviewing Ming Chang from Ray Beam. Ray Beam is Det Agency's latest acquisition. Remember I mentioned earlier that DET Agency is the parent organization. They're out of Amsterdam and they've been acquiring various companies. I come to them through Rocket Insights, one of their acquisitions and Ray Beam is their latest acquisition.
Ray Beam's. Primary focus is data science, data analytics, whereas Rocket Insights is actually primarily app development. So it's really good to have Rayma on the team because I'll be able to pick their brains about various topics deep in the data science space. In this particular episode, we're talking about Databricks.
Now going into this episode, I actually thought Databricks was something of an analytics platform, like a desktop application similar to Tableau. But it turns out, as you'll see in this episode, I was wrong about that. Databricks is also a an ML ops platform, which actually makes it a competitor now to SageMaker and Cube Flow and some of the other tools we've talked about.
So I'm probably gonna take a step back from ML Ops after this episode. So we can talk about other things in data science and machine learning. We've kind of beaten the Dead horse on ML ops, but I do know that Databricks is a product that kept coming up over and over in my exposure in data science. And so I thought it would be worth deep diving since it tends to be a favorite tool within RA Beam who definitely knows what they're doing in the data science space.
So let's dive in and have Ming talk us through Databricks. Today we have Ming Chang from Ra Beam. So Ming Chang, if you could introduce yourself, your role and what you do at RA Beam and a little bit about RA Beam's. Background in, in data science.
Okay, cool. I am Ming part of RA Beam. I'm a software engineer at RA Beam and we just got acquired by Debt and Ream really focuses on data and analytics and kind of expanding into ML ops and ai.
And at Reba I'm mostly focusing on Python, databricks, and, uh, deployment of code, more of the bottom layer of the infrastructure layer of coding.
So in the past few episodes, we've talked a lot about ML ops. We've talked about SageMaker, using that as an end-to-end deployment solution on AWS from ingesting your data, data lake, data transformation feature store.
Then pipelining that through to various locations. The final solution being a deployed machine learning model. So that's ML Ops. Today we're gonna be talking about Databricks, and I don't know anything about Databricks to be honest. I understand that it is not an ML ops platform. It's something different, right?
Well, so from a big picture point of view, conceptually, it's very similar to SageMaker actually. So you can kind of think of them as competing products. I. What Databricks says they are is, is just data analytics and AI all hosted on the cloud. What they're offering is an interface for you to create notebooks, execute python code.
They have a hosted spark cluster that you can create in the UI where. Your notebooks and your Python code can execute. And they also have a hosted Delta Lake, which is a storage layer. So I haven't used SageMaker myself previously, but I think it's conceptually very, very similar.
It is conceptually similar.
Okay. And I. Why does Ray Beam choose Databricks over some alternative solutions? Like in the last episode, we talked about cube flow. I'm familiar with SageMaker, and then of course there are other cloud solutions on GCP and Azure and and so on. Why does Ray Beam choose Databricks?
Well, so we're definitely not just a Databricks house.
We have multiple clients and we recommend to the clients whichever tool is best for their goals. So the client that I'm working with right now is using Databricks, and we did recommend that to them, but we do recommend SageMaker to others. We recommend other solutions to different clients just because each company may have goals and they may have employees with a different.
Historical knowledge, they have pipelines that are already running on a different platform, for example. So we're, we're definitely not gonna go to a client and say, Hey, let's rip out everything you have that's working well for you and let's go with Databricks. Right. So it's definitely whatever tool is gonna help them achieve their goals.
Got it. What other alternative tools are there in this space? Is Snowflake another competitive product like this?
Sure. So Databricks, one of its strong points is in big data and data analytics and it, it also has a competing storage layer. So yeah, in that regard, snowflake is a competitor. I think Databricks has a little bit more.
Features in terms of machine learning that Snowflake doesn't really focus on. Yeah, for, for at least in the data area, they are competitors though I think there are some differences also. Right? So Databricks, one of their key value adds is that they've rewritten Spark in c plus plus, which they say is gonna give you a lot of a lot better execution time, which we have seen in our current client that they've been able to improve their ex execution times a lot.
And so that's one of the cool things about Databricks.
And correct me if I'm wrong, before even entering the machine learning space, I have heard of Databricks. It strikes me that they've been around the block for quite a while and maybe some of these cloud solutions like SageMaker are relatively new.
Is there maybe a sense in which Databricks is a traditional platform for the same types of tooling and is maybe more tried and true, more mature, has been around the block for a while.
I, I actually don't know the exact year that Databricks started. I wouldn't categorize them as one being more traditional and one being newer, though.
I think it's more So different feature sets be different focuses. Yeah, definitely not when it's more traditional. One is newer, yeah. It's a different user experience and a different feature set for, for different tasks. But in the end there's still similar and just computing platforms.
And is there, so I know with SageMaker the focus is primarily in code deploying a stack to the cloud for ML ops.
Does Databricks come with? Is it like a desktop application that has maybe some GUI for exploring your data? Maybe it starts on the desktop and then it eventually deploys to the cloud, or is it quite similar to SageMaker in the sense that it's a code to cloud platform?
So I think this is one of the bigger differences.
The interface on Databricks is either A, basically on their website, on their website, it looks a little bit like Jupyter Hub in terms of you have that website, you have kind of a navigation area, and you can create notebooks in that ui. They do have another option, which is they integrate with whichever IDE that you're using locally on your laptop.
So I, I like using VS code. For example, some people like to use pie charm, but they have this thing called Databricks Connect, where you can actually execute code from your IDE and any spark code executes on the hosted cluster that resides at Databricks. Which is kind of cool because your user experience is basically you're coding on your IDE.
There's no difference to testing stuff out on your actual laptop and anything that actually accesses big amounts of data, gets executed on their cluster, which in the UI, you can configure that cluster to be as big as you need it to be. So that seems like a very cool feature that they have. That's really slick, but they, they don't have a program that you download onto your desktop,
so, yeah.
Truly it does just sound like a, a competitor in the space of ML ops, data engineering and all that stuff.
Yeah. If you, if you wanna deploy stuff to their hosted clusters Yeah, you can, you can do that. Can you
also deploy it to your own cloud provider, A-W-S-G-C-P Azure.
So how Databricks works is that they're actually using one of those three that you mentioned.
So their servers are actually on either AWS Azure or GCP. And when you, when you sign up, you actually choose one of those. They're just kind of providing the layer on top, and
when you choose one on registration, you'll ch you'll connect it to your own account and it will spin up these services in your cloud.
Yes. Can you walk me through s maybe the decision steps you might make towards choosing one of these ML up products over another, let's say we've got Cube Flow, Sage Maker, or PCP Azure's own services. Or Databricks or Snowflake and so on. The reason I'm reaching out to Ray Beam through you is your exposure to the options in this space, and it seems like there is like a strength in in Databricks, but also a familiarization with various competing solutions.
I'm a SageMaker guy. The last interviewer with Dirk, his specialty was in cube flow for the generality, the universality of it so that it can be deployed in various cloud providers depending on the client. Then it seems that Ray Beam has a broader territory of tools they use for different clients, a a big one being Databricks.
How might you decide which tool to use given different circumstances?
Okay, that's a really good question. So when we go to a client, often the client may not know what big issues they have. We don't know what kind of big issues they have. So we actually go through a pretty lengthy exploration process.
Just to find out what kind of troubles they have during their day-to-day operations, what are their biggest needs, and we figure out maybe where are the biggest weaknesses in their current process. And then figure out what kind of new process that we can bring to them that can basically achieve their biggest goals.
So say that they're missing a piece of infrastructure in the compute side, then maybe we will add a tool that fits their needs there. Maybe they're missing something in the data and analytics side. We'll add a tool there. So it's a, a very dynamic thing. We, we explore the needs of every client and, and see from there.
Do you have any like example specific decisions that that might lean one way over another for the selection of some product?
Yeah,
sure.
So say if a client has a lot of expertise in a particular product and their processes using that product are already going very well. Then we're definitely gonna keep that product and keep those processes versus, for example, if, if you're running on a spark cluster that your company created yourself and it's kind of old and you're having problems with scaling that infrastructure, then we're definitely going to say, help you migrate to a cloud infrastructure.
That solves those scalability problems for you. And this is something that our current client had some trouble with. So they had a on-premises cluster that couldn't scale to the size of their models. And so Databricks has helped solve that because before you execute anything, you can choose a cluster that fits the size of the model instead of having to tweak the code of the model to fit the size of the cluster.
And if you're up for it, can you give us like a quick rundown of, of Databricks, sort of the beginning to end some of the tools involved, the, the various phases like data lake, data warehouse, feature engineering, and stuff like that. What's a day in the life of Databricks look like?
Yeah, sure. So I'm actually not as experienced in kind of the machine learning flow on Databricks.
So far I've been dealing with a lot of the ETL pipelines data, data pipelines, so I'll focus on that area. When you log into Databricks, the first thing you have to do is create a cluster and you kind of choose the number of CPUs, the amount of ram, et cetera, and how many workers you want. Et cetera. So then after that you are able to go into the notebook interface and there that interface is very similar to something like Jupyter Hub, anything with notebooks where you get to write some Python code and execute it on your data.
So they actually have kind of a tutorial notebook as well where you get to use. SQL to access your tables, et cetera. It's kind of similar to Snowflake in that regard. You have an area where you can see your database and the tables and an area for you to type in your SQL to execute, but what I found cool with Databricks was that your notebooks are a little bit more permanent in terms of.
You create your notebooks, you edit them, you run them, but you cannot also through their ui, commit your notebook to a gi repo. If you have a, a GI repository created somewhere, you can version control your notebooks, which is kind of cool. So you don't act accidentally lose your notebooks. And I've done that with my own Jupiter hub notebooks.
I've, I've done that plenty of times where I accidentally deleted something and. Didn't notice until later. So having that get integration is super useful.
And then on the data side, does it integrate with whatever services you might already be using, like RDS or S3 or whatever bucket the equivalent is for whatever cloud provider you're using?
Yeah, you can, you can definitely access, uh, data from S3, from Snowflake, wherever you have it stored.
Okay, so it seems like here, I thought coming into this interview that it was quite different than the current ML ops solutions, but it sounds quite similar. It sounds like there's simply a slew of various competitors in this space, and it's sort of just pick your poison in the same way that there are various IDs and web UI frameworks or whatever the case.
Yeah, and I think that's the general picture. And so probably the best deciding factor is where are you coming from? Are you already using a cloud provider? Are you al or is the client already using some tool set tangential or even completely involved in one of the tools you might be selecting from? And that's always the starting point.
And then if you have a blank slate where the client is just getting started, they even haven't collected data or anything like this, they just know they wanna start a project. How might you go about deciding on, do you have a favorite tool or would you go. A different angle in deciding on how to set up their stack.
So Databricks is actually a fairly well-rounded tool. It doesn't necessarily do everything the best out of all the tools, but it has a lot of features. It is generally easy to use and is good enough in everything. So I think it's actually a very good tool that I would recommend to someone or some company who's.
Getting started out in the data space and in the ML space. So it, Scott, you covered in both data and ml and we haven't found a feature that's really lacking in it. Whatever you want to do, it's there. So I would recommend that Databricks to someone new and it's just the user experience in general is pretty good.
What about pricing? If SageMaker on a w S's pricing is tied completely to just the services you're using, there's not an added cost. How does Databricks fit into this scenario?
So it's pretty interesting because Databricks gives you a choice on which cloud to run on. Um, and each of the clouds would be a little bit different on pricing.
But say you chose Databricks on AWS versus Databricks on JCP, the pricing would follow what those cloud providers charge, but. On top of that, Databricks charges a little bit of a, a, a bit more fee for and for usage of their tool. So it's hard to say exactly how much you'll be charged, but it depends on which cloud you're using and, but what they charge on top is, is fairly small.
So I've seen people save a lot of cost after migrating from Snowflake to Databricks.
Is it conducive for single developers or startups, or is this more enterprise focused pricing wise?
Yeah, I think pricing for startups and single developers is not gonna be a problem. Depending on how much data you have, you might not, even if you're just starting out as a single developer, you might not really need a cloud platform.
They actually do make things fairly easy for you to start out if you're an individual. So I still wouldn't, I don't think it's a bad choice to start out with Databricks, even if you're an individual. Cool. Have you used Cube Flow before? Do you know how Databricks compares to it in com? In complexity? I haven't used Cube Flow, but I know that Databricks provides a hosted solution for ML Flow, so they use ML Flow, and so when you're interacting with Databricks, you can use ML Flow.
Well, we're not that far in the episode.
Do you have
anything else to add on Databricks? Let me think about this. Yeah, I think so. So Databricks offers, uh, something called Delta Lake, which is basically a layer on top of Parquet files. And th this is Delta Lake is open source, but they provide the hosted solution for Delta Lake as well.
So when you have Parquet files and a lot of data, one of the inconveniences with Parquet is that you can't edit or delete rows. You can only insert. So if you want to change something, you have to write a whole new Parquet file with that row changed and, and that's a bit annoying. So what Delta like does is it keeps track of a version history of Parquet files.
So to the user, it looks like, oh, I now have. Edit and delete functionality on my Parquet files. So that's pretty cool.
Yeah. Uh, I'll justify that a little bit too from my own experience. So we talk about this journal app in this podcast series a bunch, and I do embeddings of users entries using sentence transforms.
How you face sentence transformers. Takes this document, turns it into a 768 dimension vector. Currently stores it to RDS as a double, double, double, and then two square bracket. Bad idea. This thing wants to be stored somewhere else, feature store or as num pi files on S3 and the management of these files, just as you indicated, it does not operate like a transactional database operates where you can just update a row or delete a row and that can cause performance issues.
It can cause like code complexity. Just keeping that in mind, even if there's not necessarily performance issues at stake. Keeping that in mind means extra code. So you're saying that Delta Lake obfuscates that process for you and then can treat your large storage files. As if they're a transactional database.
Yes, exactly.
And what is a Parquet file? We haven't talked too much about that on this series, so that'll, this'll be a good opportunity to dive into that.
Okay. I guess if you imagine a CSV, for example, you're storing each record as a row in that text file. A Parquet file is similar except each record is in a column.
And so. It's kind of like a flipped CSV, if you think about it that way. Flipped diagonally. Yeah, just worse format. And so that causes a lot of efficiencies when you're querying a lot of data. So it's basically a slightly different format than a CSV that is more optimized for big data. Big
data. Okay. Yeah.
I always thought of it as an optimized version for data analytics, because in analytics you're gonna be running queries and aggregator aggregate functions over columns rather than rows. So if you want the the max of some feature or some column, the way that the parquet is stored on disk as opposed to A CSV allows you to quickly scan for aggregation queries.
So good for analytics. But I always wondered, does it have a place in machine learning? Because a lot of the, the way you work with machine learning code is kind of row centric. It's a little bit more traditional, but at the very least, it's definitely, parquet seems to be a standard or the standard file format of data analytics.
Pipelines like these tools, data Databricks and SageMaker and all that. It's very popular. I know there's a tool called Pi Arrow that allows you to sort of scan over your S3, what do you call it? Keys folders, basically within the bucket. And look, looking for certain entries within your bucket as if you're running SQL queries.
But it's not using AWS Athena, it's not actually like a SQL Query engine. It's just a file seeker and it works best when Parquet is being used. So something about the, the format of that file makes it really easy to work with and to navigate. It's just a, it's, it's a Python library for working with files stored on disk.
So it sounds like Delta Lake does all that for you. You wouldn't use a library. It's handled all outta the box and a lot of, even the SageMaker tooling for ex, there's this thing called AWS Athena lets you query or file data in S3 as if you were running a SQL query. But there's no, you can't update files in place.
It's, it's all read only. So Pi Arrow is just a Python library that lets you treat files on disk. Basically, it sounds like Delta, like a library version of Delta Lake. It lets you navigate your files on disk and make updates as necessary. Not quite sql, but close to. And from what I understand, it's very Parquet first class citizen.
They really like Parquet a lot.
A lot of these Python packages, there's a chance that the implementations of some of the tools that we're talking about could be using pyro already, and there's probably a very good chance.
Well, I like to wrap up these episodes per ShipIt iOS format with a pick. So I like to ask you maybe something you're really into lately, like a book or a hobby or some new big discovery that you're tinkering around with.
I guess I'll, can I, can I talk about two? Yeah, of course. So previous to working at Ray Beam, I actually had a fair bit of time to. Write my own project because I was very interested in the stock market and at the same time, I, I like programming. So what I did was actually attempt to write this framework for, for automated trading.
And so I connected to this service called Alpaca that provided a stock trading API. And so I would download stock data and try and. Analyze the peaks and the valleys and the, the price fluctuations and then try and make trades. I, unfortunately, I wasn't able to test a solution that actually made money consistently, but that, that was kind of a cool experience.
I. No
success, huh? Well, actually it's funny about halfway through this podcast series I started in, I think 2016 and then about halfway through the pet project of the podcast, eventually switched to the journal app was a Bitcoin trading bot using deep reinforcement learning, a framework called Tensor Force, and sure enough, no success.
So
no success. Yeah, that would, that would be pretty interesting to explore again sometime in the future.
Do you have a, a repository on GitHub or anything people can take a look at? Uh, no.
Unfortunately it wasn't absent enough to, to really publish. Yeah. And then number two. Number two, so the other thing that I've been interested in was kind of drones.
I. Just kind of building a small drone and putting a raspberry pie on there and writing some, some Python programs on the raspberry pie to control where that drone flies. And that's been a very cool experience also, especially because. Once you write some code, you actually see, see the thing do stuff.
Physical. Yeah. This isn't just a script. One thing I like about web app programming is you see results on the, but this is a physical device in the world
that you just ed. Yeah. And the risks are real too, because if you have a bug in your code, it, your drone might run into a tree. Have you gotten VR all wired up to it yet?
No, but that'd be, that'd be cool to be able to see everything that it sees while it's flying. That's something that I really wanna do.
Let me know if you get that all wired up. That's a, I saw some people doing that like in the cliffs of Mohair in Ireland and they got to explore the areas that no tourists get to explore and do it all visually.
Looks so cool. Alright Ming, well this has been wonderful. Thanks for joining. Yeah, it was great chatting with you.