MLA 020 Kubeflow and ML Pipeline Orchestration on Kubernetes

Jan 28, 2022

Click to Play Episode

Machine learning pipeline orchestration tools, such as SageMaker and Kubeflow, streamline the end-to-end process of data ingestion, model training, deployment, and monitoring, with Kubeflow providing an open-source, cross-cloud platform built atop Kubernetes. Organizations typically choose between cloud-native managed services and open-source solutions based on required flexibility, scalability, integration with existing cloud environments, and vendor lock-in considerations.

Resources

Resources best viewed here

Designing Machine Learning Systems

Machine Learning Engineering for Production Specialization

Data Science on AWS: Implementing End-to-End, Continuous AI and Machine Learning Pipelines

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Dirk-Jan Verdoorn - Data Scientist at Dept Agency

Managed vs. Open-Source ML Pipeline Orchestration

Cloud providers such as AWS, Google Cloud, and Azure offer managed machine learning orchestration solutions, including SageMaker (AWS) and Vertex AI (GCP).
Managed services provide integrated environments that are easier to set up and operate but often result in vendor lock-in, limiting portability across cloud platforms.
Open-source tools like Kubeflow extend Kubernetes to support end-to-end machine learning pipelines, enabling portability across AWS, GCP, Azure, or on-premises environments.

Introduction to Kubeflow

Kubeflow is an open-source project aimed at making machine learning workflow deployment on Kubernetes simple, portable, and scalable.
Kubeflow enables data scientists and ML engineers to build, orchestrate, and monitor pipelines using popular frameworks such as TensorFlow, scikit-learn, and PyTorch.
Kubeflow can integrate with TensorFlow Extended (TFX) for complete end-to-end ML pipelines, covering data ingestion, preprocessing, model training, evaluation, and deployment.

Machine Learning Pipelines: Concepts and Motivation

Production machine learning systems involve not just model training but also complex pipelines for data ingestion, feature engineering, validation, retraining, and monitoring.
Pipelines automate retraining based on model performance drift or updated data, supporting continuous improvement and adaptation to changing data patterns.
Scalable, orchestrated pipelines reduce manual overhead, improve reproducibility, and ensure that models remain accurate as underlying business conditions evolve.

Pipeline Orchestration Analogies and Advantages

ML pipeline orchestration tools in machine learning fulfill a role similar to continuous integration and continuous deployment (CI/CD) in traditional software engineering.
Pipelines enable automated retraining, modularization of pipeline steps (such as ingestion, feature transformation, and deployment), and robust monitoring.
Adopting pipeline orchestrators, rather than maintaining standalone models, helps organizations handle multiple models and varied business use cases efficiently.

Choosing Between Managed and Open-Source Solutions

Managed services (e.g., SageMaker, Vertex AI) offer streamlined user experiences and seamless integration but restrict cross-cloud flexibility.
Kubeflow, as an open-source platform on Kubernetes, enables cross-platform deployment, integration with multiple ML frameworks, and minimizes dependency on a single cloud provider.
The complexity of Kubernetes and Kubeflow setup is offset by significant flexibility and community-driven improvements.

Cross-Cloud and Local Development

Kubeflow operates on any Kubernetes environment including AWS EKS, GCP GKE, and Azure AKS, as well as on-premises or local clusters.
Local and cross-cloud development are facilitated in Kubeflow, while managed services like SageMaker and Vertex AI are better suited to cloud-native workflows.
Debugging and development workflows can be challenging in highly secured cloud environments; Kubeflow’s local deployment flexibility addresses these hurdles.

Relationship to TensorFlow Extended (TFX) and Machine Learning Frameworks

TensorFlow Extended (TFX) is an end-to-end platform for creating production ML pipelines, tightly integrated with Kubeflow for deployment and execution.
While Kubeflow originally focused on TensorFlow, it has grown to support PyTorch, scikit-learn, and other major ML frameworks, offering wider applicability.
TFX provides modular pipeline components (data ingestion, transformation, validation, model training, evaluation, and deployment) that execute within Kubeflow’s orchestration platform.

Alternative Pipeline Orchestration Tools

Airflow is a general-purpose workflow orchestrator using DAGs, suited for data engineering and automation, but less resource-capable for heavy ML training within the pipeline.
- Airflow often submits jobs to external compute resources (e.g., AI Platform) for resource-intensive workloads.
- In organizations using both Kubeflow and Airflow, Airflow may handle data workflows, while Kubeflow is reserved for ML pipelines.
MLflow and other solutions also exist, each with unique integrations and strengths; their adoption depends on use case requirements.

Selecting a Cloud Platform and Orchestration Approach

The optimal choice of cloud platform and orchestration tool is typically guided by client needs, existing integrations (e.g., organizational use of Google or Microsoft solutions), and team expertise.
Agencies with diverse client portfolios often benefit from open-source, cross-cloud tools like Kubeflow to maximize flexibility and knowledge sharing across projects.
Users entrenched in a single cloud provider may prefer managed offerings for ease of use and integration, while those prioritizing portability and flexibility often choose open-source solutions.

Cost Optimization in Model Training

Both AWS and GCP offer cost-saving compute options for training, such as spot instances (AWS) and preemptible instances (GCP), which are suitable for non-production, batch training jobs.
Production workloads that require high uptime and reliability do not typically utilize cost-saving transient compute resources, as these can be interrupted.

Machine Learning Project Lifecycle Overview

Project initiation begins with data discovery and validation of the client’s requirements against available data.
Cloud environment selection is influenced by client infrastructure, business applications, and platform integrations rather than solely by technical features.
Data cleaning, exploratory analysis, model prototyping, advanced model refinement, and deployment are handled collaboratively with data engineering and machine learning teams.
The pipeline is gradually constructed in modular steps, facilitating scalable, automated retraining and integration with business applications.

Educational Pathways for Data Science and Machine Learning Careers

Advanced mathematics or statistics education provides a strong foundation for work in data science and machine learning.
Master’s degrees in data science add the most value for candidates from non-technical undergraduate backgrounds; those with backgrounds in statistics, mathematics, or computer science may benefit more from self-study or targeted upskilling.
When evaluating online or accelerated degree programs, candidates should scrutinize the curriculum, instructor engagement, and peer interaction to ensure comprehensive learning.

Accelerate Your AI Strategy with TylerAI Strategy Call with Tyler

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Transcript

 ​

 Welcome back to Machine Learning applied. In this episode, I'm talking to Dirk from Depth Agency about Cube Flow, K-U-B-E-F-L-O-W. It is an extension on Kubernetes for machine learning pipeline orchestration. In the last few episodes, we talked about SageMaker for accomplishing the same thing, and then in the DevOps episode we talked about just general DevOps deployment of your web stack to the cloud.

And if you'll recall, there are solutions for hosting web stacks in the cloud, like A-W-S-E-C-S, and then there's the open source equivalent to that called Kubernetes and an AWS service called EKS or Elastic Kubernetes Service allows you to use Kubernetes on AWS. So you can either use ECS or E-K-S-E-C-S is gonna be using AWS as a managed service for hosting your Docker containers.

And so it's gonna be easier to use an EKS or Elastic Kubernetes service is gonna allow you to host your Kubernetes cluster on AWS. The benefit of Kubernetes is that it is open source and cross platform. You can host your Kubernetes cluster on A-W-S-G-C-P or Azure. Setting it up in those various providers is gonna be a bit different.

So it's not just a turnkey one size fits all, but it's gonna be more cross cloud provider compatible than if you had set up your cluster on ECS. So A-W-S-E-C-S, their managed container hosting solution is easier, but there's vendor lockin and then EKS. Their managed Kubernetes solution is harder. It's more complex for certain, as you recall, the discussion with ott.

But it is open source and it is general case cross cloud compatible, and you can use it on local host or on-prem. And so the equivalent in the machine learning space is Amazon SageMaker for their hosted machine learning offerings versus cube flow and extension on top of Kubernetes for the open source cross cloud compatible version of the same.

And before we get into this interview, I just wanna talk about what these things are, these pipeline orchestration tools, because SageMaker is not just about hosting machine learning model. That's the tip of the iceberg. It's a very useful solution. If you want to get your machine learning model in the cloud, it is easier to use SageMaker model endpoints or batch inference jobs than it is to try to spin up your own machine learning model, hosted solution, even on ECS or EC2 or anything like that.

So even just deploying your machine learning model via SageMaker is gonna be a simpler solution than roll your own. But the main value of SageMaker is not just hosting your model, it's all the other stuff that comes as part of the package, what we call pipeline orchestration. So when you're training your model, you have your data, it comes from somewhere like AWS S3 or their relational database, service RDS or DynamoDB or something.

Okay. And then that gets piped into a future transformation step, and then that gets piped into the training step, the training phase of this pipeline. That training phase might be splitting out your data or splitting out your compute instances into multiple instances so that the training can happen in parallel, and then rejoin the progress downstream as the next step in the pipeline.

Do some checks, some bias checks, some drift detection, and then deploy your model to the cloud. And then additionally, as part of these pipeline orchestration tools, there is a continuous monitoring aspect where if something goes wrong in the model pipeline, it might trigger a retraining job by reusing the entire pipeline from the beginning.

Or let's say that new data comes in over time. The users have interacted with your website for a while. Let's say you have a matchmaking service for music or videos, and they thumbs up and thumbs down certain things. And so you want to retrain, not because anything went wrong, but because you're a little bit out of date.

So these machine learning pipeline orchestration tools like SageMaker or Cube Flow are the end goal of creating a production ready, deployed machine learning end-to-end solution. Not just the machine learning model in the cloud, but the entire process of what goes eventually into the trained and deployed inference model in the cloud.

And so all those steps of the pipeline are important. And so everything we've talked about with SageMaker, SageMaker has these tools for stringing together steps of this pipeline, this pipeline orchestration. One will be ingesting the data and one will be data engineering, so feature engineering and imputation and all that.

Another will be training, including parallelization and so on. That SageMaker, which is AWS, is hosted offering. It's easy to use, but you're locked into AWS. So today we're talking about Cube Flow, which is an extension on top of Kubernetes. It's the open source version of SageMaker. Effectively, it is universal, so it can be hosted on all the different cloud providers.

And again, per the discussion with Ger Wat, the downside is gonna be more complex to manage. So that's an introduction to the topic and let's dive right in. Alright, welcome back to the show everybody. Today we have Dirk from Depth Proper, and Dirk, go ahead and introduce yourself. Yeah, thank you Tyler. Hi, my name is, and then I'm part of the Dutch data science team.

At depth, I help my clients with building smart solutions to discover valuable insights in their data and, and to solve critical business challenges, both with machine learning as well as more like low level data analysis. And I work a lot also on the machine learning engineering part. And, uh, that's also what we're gonna talk about today.

Fantastic. In the past few episodes, we've been talking a lot of DevOps with, with a goal of, of achieving ML ops, getting your machine learning model into the cloud. Mm-hmm. And in this journey of ML ops, I've realized it's not just a matter of getting your, your model in the cloud as some rest endpoint, which SageMaker can accomplish very easily, but a lot of times there's a pipeline you want to build out.

So you've got your training phase that will periodically ingest the data, uh, maybe shard that out to multiple nodes mm-hmm. That splits the data up, crunches it a certain way, brings it back together, pipes it through the machine learning model. And so there's all these different pipelining tools. I'm familiar with SageMaker, I've presented that on the show.

The last episode we talked a little bit about DevOps, uh, a little bit about Kubernetes in DevOps. And today your specialty right in, in ML ops is cube flow. Yeah, that's totally right. And, and it completely touches upon, uh, Kubernetes as well as it's basically a, a orchestrator for pipelines on Kubernetes.

So talk to us about pipelines in general. Mm-hmm. I have introduced the topic to my listeners, but I, I don't know that I've gone sufficiently down the motivating the, the entire process. And then eventually we will get into like, what are the options out there and why cube flow. Yeah. So talk to us about pipelines.

Yeah, definitely. So the, the first thing that's, that's good to know is often when you have like these, these examples, these toy examples of machine learning problems or any data problem, it's very much focused on, on the solution itself and training a model to predict something and then use that output to optimize business processes or, or decision making.

However, what we see at our clients mainly is that a lot of our clients have different business problems for which we develop different models, but you get at a point where you have so many models in place. That knowing how well a model is still performing and, and retraining that model is gonna take a lot of time.

And you basically can't just spend time on retraining and checking if a model is still properly working, if there are also new use cases that that, that you're working on for that client. So at that point, you're kind of forced to, to move away from standalone models and hop on machine learning, pipeline train because that really helps you to first of all keep track of the performance of the model.

It helps you to automatically, uh, retrain models and it just makes your life much more easier as it's taking a lot of time away. And then keeping the quality of the models up. It almost sounds like what Continuous integration, continuous deployment, CICD is for the software world. Mm-hmm. Machine learning model or machine learning pipelines is for the machine learning deployment world.

Does that sound right? Yeah. Good analogy. That's a perfect analogy. I think that, I think that if you, if you look at what data science field in, in general, it's much more moving towards, uh, a continuous integration and a continuous deployment of models. Of course, like the big companies like Google and Facebook, they already do this for a long time.

That's also why they now have these frameworks and these platforms that basically allow you as a data scientist or a machine learning engineer to do that as well. And they have invested already a lot of money in those solutions. But I think like more like the smaller companies, they, especially in the beginning it was more focused around building a model and, and solving a business solution.

And sometimes it kind of lacked the continuous development and, and deployment and integration of like the machine learning solutions. Got it. I think a lot of companies now start to realize that. Machine learning pipelines are just as important as building a model because model without a, without a, a proper pipe pipeline behind it is eventually not gonna perform as well as when it was just developed.

Because data changes over time, especially now, for example, with Covid. So when you have these models that predict customer behavior or that take customer behavior into account to predict something else, uh, especially with covid, that retraining part is more important than ever because customer behavior really changed during that period.

So you can imagine when you use a model that was trained on data before that period, that it's not gonna perform as well as it was, uh, during like Covid or after Covid, because behavior totally changed. So it keeps up. So these pipelines keep up with, with the times, with, with the change in data, I. Do you find yourself, I know that these frameworks and what, what are we gonna call these frameworks are, is it right to call them pipeline frameworks or, I think the proper term would be pipeline orchestrators.

Do you find yourself using these exclusively, or do you still do your model development on local hosts? Just in a Docker file, TensorFlow, whatever. It kind of depends. So especially with clients that we work with already longer, we now move towards developing from a pipeline perspective. And of course the development of the model and the testing of the model and the first version of evaluation of the model that can still happen outside of that pipeline.

But everything around it and the structure is basically set up to eventually be used for a pipeline kind of architecture. And then we also have these clients where we just started building solutions and there sometimes we see that there's first a need to, you know, see how machine learning or data science solutions can help improve their business.

So you could say they're less mature. Then we often see that pipelines are not really the, the way to go yet because we first need to really discover what data science and machine learning can bring for that client. But for long term relationships and, and clients that are really mature, yeah, we really move towards developing from a pipeline perspective rather than a standalone model.

I like that it's, uh, measured twice. Cut once. You're kind of already in, in the mindset and platform developing towards the pipeline. Then when it's time for production, you're 10 steps ahead of the game. Sorry for all the diversion. Your expertise is in this area, so mm-hmm. I'm gonna just go ahead and let you drive and take us on a journey.

Talk to us about cube flow and, and the various other options and all that stuff. So the reason I took Q Flow is that, because that's one of the main orchestrators or one of the pipeline focused project that, that we at Deb chose to go with. And to maybe start off with like what really Q Flow is, I already mentioned it before.

Q Flow is basically a solution that's focuses to deploy a machine learning pipeline on a Kubernetes backend. Uh, and that helps to make the pipeline very scalable. Like that's one of, of course the key benefits of, of Kubernetes, the scalability that you have. And also at the same time, cube Flow integrates a lot of these different kind of frameworks within.

So it allows you to either use TensorFlow or Psychic Learn. Or pyr and all of those modules or packages, how you like to call them, you can very easily incorporate those in a cube flow Salute and cube flow itself doesn't necessarily have to be a, a pipeline cube flow also offers the possibility to basically submit a job on a Kubernetes cluster.

So just like a, a training job or a data processing job. But it also at the same time, allows you using the cube flow pipelines architecture to build a full on pipelines. And that really gives cube flow as a whole, a lot of flexibility, uh, besides also it being open source. So a lot of people develop and, and contribute to the framework itself and, and what you see because it's open source.

A lot of these challenges that we as a company deal with as well. Other companies deal with those as well of course, and because of that, that open source nature of cube flow, you see that development goes really fast and every day there's like something new released, which might also solve one of your problems, which otherwise you would've had to spend quite some time on to solve it yourself.

In the last episode with Jira Wat, we were kind of comparing cube flow with AWS's ECS service, elastic container service, and so we have the cloud native offerings, Microsoft Azure, Google Cloud platform, and Amazon Web Services. In our episode, we recommended AWS, if nothing else, but popularity sake. Mm-hmm.

It's just like if you want to find a job, you're probably most likely to find a job in the AWS marketplace 'cause of its popularity and in SageMaker land. The way they do pipelines is, well, it's, there's a lot of overlap with cube flow's offerings. Mm-hmm. Kubernetes offering, so SageMaker would probably be the equivalent of cube flow in cloud native on AWS.

And then ECS is the equivalent of Kubernetes on AWS. Mm-hmm. Uh, a container orchestrator platform, but it's all closed source. We don't have visibility into how any of this stuff works, but by that standard, it would seem that Kubernetes and cube flow would be preferable. Why not get open source if you're gonna be getting the same thing in the first place on AWS and Jira Watt's statement was that Kubernetes is quite complex, a large pill to swallow from a developer's perspective.

Mm-hmm. Compared to ECS. Yeah. And I wonder if that's similar to your own experience, and if so, how does Cube Flow compare in that analogy to Cloud native machine learning offerings? Yeah, so maybe the, the first good thing to know is even though Q Flow is originally built on Kubernetes, it's not just limited, so like in Google Cloud, because that's also a little bit of the history of how Q Flow originated, and it's come from Google and how they internally deploy machine learning pipelines for all their services, and they build like this entire TensorFlow package around it called TFX, which we will touch upon later as well.

Q Flow is not just limited to Kubernetes, and also at the same time, Q Flow takes a lot of this complex deployment process that you have with Kubernetes. Takes it away and does it for you. So especially as a machine learning engineer or a data scientist, you might not be very knowledgeable about Kubernetes or ions and outs of how to deploy something on Kubernetes, but Cube Flow really facilitates that for you and helps you with that.

At the same time, cube Flow can be deployed on most big platforms like AWS Airflow, Azure, Google Clouds, and in Google Cloud you even have the option to deploy it on Kubernetes, which it originated from, or. More like a AI platform, what they now call like the Vertex AI platform. So I think whereas when you focus, for example, on more like an AWS or just an Azure, you're very limited to that platform.

With Q Flow, you can reach across just one platform and have the possibility and opportunity to not just stick to one platform, but also deploy in other platforms if you need it. And of course, the setup process, it's gonna be a little bit different, but a lot of the complex stuff that comes with it is handled by QO and I think that makes a huge argument for using Q4 together with the open source nature that it has.

Yeah, and I think another great argument with it being cross-platform compatible is local host development with SageMaker. One thing I come up against in developing with the cloud in mind, the benefit is that, again, that measure twice cut once you're already in the mindset of working with VPCs and security groups and IM policies before you can even run your model.

But on the downside, good luck with debugging and local development. If you need to be in A VPC, you're not really gonna spin this up on your local pc. You're gonna have to create a fashion host in the cloud or a client VPN that's able to connect to the VPC and all that stuff. So. The complexity that Jira and I discussed of Kubernetes in use case in general.

Mm-hmm. That's one thing, but all, we're not accounting for many times the complexity of developing towards ECS when your development environment is local hosts. So it seems like there's plenty of pros and cons on both fronts if we want to talk about the debate of complexity. And then, do you guys use Kubernetes for the general DevOps stuff for web web hosting as well?

I think that kind of depends. Per client. I'm not really aware of that part as I'm more like part of the data science team. So not really of the, uh, web and uh, app development team. I know there are some deployments that happen on, on Kubernetes, but further than that, I'm really not sure how the DevOps or the, uh, development teams adapt, tackle that part.

Your preferred cloud is GCP, is that correct? I wouldn't say preferred. It's more like a lot of our clients work with that. I think in Europe in general, for a long time, Google has together with Azure, has been one of the, the bigger cloud platforms. AWS is kind of taking over, I think at the moment. But I think it's also because we are in the marketing business or a lot of our clients, the solutions that we build for them are marketing related and a lot of those clients work with Google products like Google Ads, Google Analytics, those kind of Google based products.

And that's also the reason why they often go for Google Cloud because it's a logical decision to make if a lot of your other systems are also by Google, developed by Google. That makes sense. And how about Azure? We do see a lot of Azure, but what we do see is the declines that use Azure are often less marketing focused.

You might say it's more like operational business operations, financially focused solutions that we then built for them. Yeah, so for example, we have a client that's in the, the real estate development business and we built more like operational solutions for them and they are also on Azure. That adds a lot to the, the puzzle.

When I meet people who are trying to get, you know, started in the cloud and they say, which one should I pick? I always say AWS just because it's most popular. So it's an easy answer, but I never have any real technical argument for any of the specific cloud providers to provide. Yeah, I think if you're in the marketing business and you're using a lot of the Google products, it's one of the most logical steps to make from a.

Connectivity standpoint. Of course you can get the data from Google Analytics or Google Ads, uh, to any of the other platforms, but some of the integrations are much more straightforward with Google Cloud than with, for example, analytics, Google Analytics and AWS. And then on Azure's front, if obviously if you're a Microsoft shop, it's a s shoe in.

But then potentially, as you mentioned in finance as an example, or operations, people in finance are so used to Excel, that's a Microsoft product. Seems like it's an easy step towards that platform if you're coming from an environment in which Microsoft products would have made sense on the desktop.

Yeah, definitely. And also, for example, uh, a lot of the companies that are, for example, using Microsoft Dynamics or, or those kind of packages or those kind of softwares that also would make sense to stay with a Microsoft related cloud. I know for some companies that aren't allowed to use Google, it's uh, mainly a big thing in, I think Switzerland.

A lot of the companies there, they wanna stay away from Google. So they're either on like their own custom or still on-prem or also on Azure. Fantastic. So that could also be a reason. Does GCP have its own version of SageMaker, its own non cube flow based ML ops offering, possibly including pipeline orchestration?

Yeah, so since I think a couple of months now, Google Cloud choose to have the, what they call like AI platform. That's where all your AI related stuff lift in. So your Jupyter notebooks, your job logs, your model endpoints, all that kind of stuff. And they recently changed that to something they call Vertex ai.

And with that, they give some more possibility, whereas AI platform before you had like a pipeline option, but the pipeline option was mainly consisting out of a cube flow kind of pipeline. And it was based on Kubernetes. Basically what you had to do is spin up a Kubernetes cluster. Deploy Q flow on that, and then you could build your models on that, for example, with TFX, and then when you would go to the pipeline section in AI platform, it would actually open up a, the ui d interface of of Q flow.

Now with Vertex ai, they have a little bit of different approach. You still can run cube flow on that, but you can also orchestrate your pipelines. It's still from within cube flow, but all the Kubernetes stuff is handled for you. So you don't have to spin up like a, a Kubernetes cluster or those kind of things.

So it makes a little bit more easier than the old way is more managed. Yeah. Is it a rebranding? Are they migrating their original offerings to this new Vertex AI solution? Is it a rebranding or is it an al alternative solution? I think it's a little bit of both. So some of the stuff from AI platform right now, they're rebranding it and, and including it only within the new Vertex AI section and under the Vertex AI brand.

But for example, the option for pipelines is still available in both AI platform as well as the Vertex platform. And also interfaces are different. If you go to the AI platform pipeline interface, you get basically the cube flow interface. However, when you go to the uh, interface of pipelines in Vertex ai, you really get a dedicated Google Clouds pipeline interface that's really developed by Google and not necessarily by cube flow.

So potentially Vertex AI is the SageMaker equivalent on GCP, the managed machine learning offering? Yeah. Kubernetes was developed by Google. For Google, obviously it's gonna have first class support on GCP. Cube Flow is developed by Google as well. Yeah, it's developed by the TensorFlow team. Okay. Or not necessarily the TensorFlow team, but it, it's built around TensorFlow and TensorFlow pipelines to facilitate those TensorFlow pipelines.

Okay. Yeah. What is TensorFlow pipelines? What is TFX? TensorFlow extended? Yeah. Yeah. TensorFlow Extended. Extended, yeah. TensorFlow extended, basically started with in Google and that runs or originally run on on Q Flow. And I think Q Flow is still the only option through which you can deploy dense flow extended pipelines.

And that was a kind of a way for Google to basically make pipelines for all their big services. So all the models that are behind, uh, Google email or Google Calendar or Google Maps, like all these big services in which they have a lot of machine learning models running. They all needed a, a continuous deployment, of course, 'cause these are huge services and also it needed to be scalable because you can imagine those services are used by, by millions of people.

So that's how they developed or as an internal service. They built what is now known as Q Flow. And TensorFlow itself for a long time was just model building. So like deep learning models, machine learning models, probabilistic models. With the extended package, it reaches further than just the model part.

It basically takes up the entire part before building and training a model. So like cleaning up your data, transforming your data, checking for anomalies, creating a schema from your data, but even reading your data from, for example, BigQuery or a a, a cloud bucket in Google Cloud. And also the part after the machine learning model.

So the evaluation of the model checking if. The retrained version of a model is as performing as well as the previous version. 'cause for example, you can imagine that if you do continuous training for a model using those pipelines, that you end up end up in a situation where you retrain your model, but the performance of it is not necessarily better than the previous version.

So also, that is something you have to take into account. And finally, of course, pushing your model to either an endpoint, a location in Google Cloud storage, or creating a big query model out of it. All those steps, they basically build modules within the TensorFlow package around that to facilitate those steps.

And all those steps together create a, what they call a TFX pipeline, which is then deployed through cube flow. What is, uh, this sounds like there's a lot of overlaps between TFX and Cube flow. What are the differences between the two? So basically TensorFlow extended is just the logic part. So how are you building up your model?

How are you generating your schema? How are you transforming your model? How are you evaluating your model? And for that, they all made like these components, which they call like components. So you have like a trainer component and you have a evaluated component, a statistic generated component, and all these components in form, like a pipeline.

And they are deployed on Kubernetes, for example, through click cube flow. So basically the backend of that pipeline is cube flow and that executes the 10 flow extended code. I see. And so if you're using Cube Flow, especially on GCP, you're also gonna be using TensorFlow extended very likely. If you are working with a TensorFlow model, definitely.

But you're not limited to. You could as well use PyTorch. You could use Psyche Learn, or any of the other major machine learning frameworks that are out there. But originally it is developed around TensorFlow, et cetera, because it was like a internal service at Google. Is there any amount of TensorFlow extended you could use if you're developing your model in PyTorch?

Because it sounds like there's a lot of nice utilities in there. Yeah, so, so then you get into the compatibility issues between PyTorch and TensorFlow, like theoretically you could definitely use those, but the way PyTorch is build up and, and TensorFlow is built up, theoretically they're the same, but technically they're sometimes quite different.

I'm not sure if you ever tried, but try using both PyTorch and TensorFlow in your same machine learning code. It's gotta be a lot of debugging, and in a lot of cases, you're gonna end up using either one or the other. Yeah, I actually, the podcasts kind of toy project is this journal app and, and I do have to use both those frameworks in certain cases just based on certain packages I'm making use of.

Mm-hmm. And if I am using both of them in the same, I actually kick off background jobs. So they're running on two separate Python memory space. 'cause man, they just, they duke it out in a Python script. They'll battle with each other. Yeah, definitely. And, and also just like the types of how the outputs and inputs are, are defined in, in both packages are very similar, but it's sometimes like those little details that ruins the compatibility.

I know there are like quite some cases where people do manage to combine the two, but it's a painstakingly process. So let's look at where we're at. We have, if you want to get your machine learning model hosted, you have Azure, GCP, and AWS. And a whole bunch of others, but those are the main three. Each of those three allow you to do it yourself using open source tooling.

That would be Kubernetes and Cube Flow. And they also allow you to do it through their offerings, which makes it a little bit easier, but you're locked into their offering. So those are called managed solutions. AWS is is SageMaker. Gcps is Vertex ai. Yeah. And it sounds like there's a whole bunch of other open source offerings as well.

We've got Airflow ML Flow and what el what else are you familiar with and to those that you can speak to, how do they compare to each other? So what I'm familiar with is pretty limited to Airflow and Q Flow and, and Google Cloud's pipelines. I know a little bit about also Azure pipelines. Not like practically, but more theoretically.

What I like about the, the cube flow, sometimes like when I have these discussions with colleagues about this and they work, for example, in in Azure pipelines, I. They have a lot of these small issues like authentication issues, compatibility issues, information being shared between different components.

That's all handled by the combination of TFX and and and cube flow. So that's really why I prefer cube flow, but that's also, I might be quite biased because I use, uh, a lot of TensorFlow in what I do in the solutions I build. And I just find the combination between TFX and cube flow to be very good in the beginning.

It's quite a, a painful process. It's like a very steep learning curve. Uh, you can get very frustrated and during the process of trying to understand how, how they work together, and especially in the beginning, because I remember when I started working with the most recent version was zero point 25 and we're now it's version 1.4.

Just a timeframe of six months. The frameworks have improved so much that a lot of those, like little issues have all been taken away. And you just see that there's a lot of dedicated time and effort spent on improving these services. And from what I've seen, for example, Azure is that sometimes it can take quite a while before some very basic issues or, or from which you would expect that, that they would be solved much quicker than what is actually happening.

And regarding like airflow, from what I've read and what I've seen so far and what is, what is airflow real quick? Yeah. So Airflow is, is basically a, a orchestrator that works with like Dax, and then I have to wait, search for the official journal. Dan directed a cyclical graphs. Yeah. Okay. So, so in a pipeline of data, each step can connect to each other step.

Yeah. Basically that, and, but the thing with airflow is, is that it's not resource heavy. Like when you deploy something on Kubernetes, the processing power and the resources that you can use on Kubernetes are quite big. Like it's scalable. It, it can take quite some heavy loads. However, the thing is with airflow is that you basically, for all the computational heavy stuff, you have to submit, for example, a job to AI platform.

So when you want to train your model on cube flow on Kubernetes, you can choose to submit a job to AI platform. You can as well just train it on the backend of Q Flow, which is Kubernetes. You can just train it on Kubernetes. With airflow, you don't have that option because it's basically just an orchestrator that executes the different steps in your pipeline.

But when you need something that's more resource heavy, you always have to have something like an AI platform to spin up a machine that can then take the heavy workloads, do the computations and, and the model training, and then sends back information. I see. Okay. I think I understand. So cube flow is, uh, pipelining and model orchestration in Kubernetes with machine learning in mind.

And, and airflow is just a general purpose, just a stepper. I, yeah. And it just so happens you can put machine learning into that pipeline. Sure. But you're just gonna be using some machine learning toolkit or cloud offering, like a gcps AI platform. Yeah. You will never, never be able, unless you have a very light model of very little data.

But when we're talking about heavy models, machine learning, deep learning models, for example, that take in, in millions of rows of data, you always have to have something else and just airflow, because otherwise it's not gonna cut it. Do you use airflow? Does it have a place, if you're already using Kubernetes and Cuba flow, is there any reason you would ever use airflow for anything?

So what we do at depth is we. Do a lot of data engineering tasks on airflow, so triggering the export of data from one location to another, the creation of tables, all that stuff that happens within another environment but still needs to be scheduled. We do that mainly through airflow and for that it works very well, but just not for machine learning.

Okay. So if, I hope I'm right in this, in the AWS land of managed step pipeline orchestrations, there's a thing called step functions aws step step functions, and you might have AWS Lambdas, so you might have a series of just raw functions that feed into each other in, in a specific order and then branch out and then come back together.

Yeah. So general purpose, it sounds like maybe airflow is for that purpose. Yeah, I think you would. You could say so. And then how about ML flow? Do you have any familiarization with that? I have, I have heard about ML flow, but I've never really worked with, with their services. Speak at it to see if there's anything that really stands out.

Maybe I'll see if I can get somebody here in the future to discuss that. 'cause I do see ML flow coming up quite often. Yeah, I, I have not really seen anyone within our team at least, that uses ML flow. Okay, that's good to know, but maybe a rocket. Not that I've seen yet. So the main contenders that I have seen really in my world is you and me, Q Club and SageMaker.

It's funny thing, actually, I'm gonna work with SageMaker for just Eat, which is GrubHub in us, right? Yeah, yeah. Like delivery. Oh, cool. You know what I love? So for the listeners, a little bit of a tangent. So Dirk comes from Depth, depth Agency is the master organization. I actually working for a company called Rocket, which is a subsidiary.

It was recently acquired. There's a lot of, uh, smaller companies within depth agency and so it's nice that Rocket's main skillset is AWS in the cloud offerings. And then that's main skillset is Azure and GCP. So we're the perfect, we're the perfect match. Yeah. Together we basically cover the biggest, uh, share of, of cloud platforms.

I guess. Do you have anything else to say about Kubernetes Cube flow, TensorFlow extended, or any of the other offerings? Yeah, I, I think in general we talked about the benefits are of qpl compared to all the other options that are out there. The most important thing is not really what you're gonna use eventually, just discovery your options and, and see what is out there and then what fits your purpose.

Because in the end, all these solutions and options, they all serve certain purposes better than other. I can imagine that if you are completely set up in, in AWS it wouldn't really make sense to go with something else on SageMaker, even though Cube Flow delivers these very cool solutions. And of course you can still deploy cube flow in, in AWS of course.

I think the main lesson is to just move towards a more like pipeline or to start thinking in machine learning pipelines rather than just standalone models and then which platform serves the best. That's really up to your situation. Personally, I've always found the cube flow and and TFX to work free well together.

And it has served so far most of the, the needs of our clients. But I guess there are situations in which Q flow is also not gonna cut it for you. Yeah. I like the idea of starting to think towards orchestration instead of just like you said, instead of just local host model development. Even in training, if you're gonna train your own model by quickly just starting on one of these platforms, your training job, you'll be able to shard it out to multiple nodes and and potentially in the cloud.

Ideally in the cloud, so that you're not confined to your own local host GPUs limitations. Or if you don't even have a GPU and a lot of people develop on Macs. Yeah, and you can't do machine learning very well on Macs. And if that's your limitation, hey, you could just kick off a cube flow pipeline or a SageMaker pipeline and it will distribute your data out.

If you have a huge data set, it will train it on multiple nodes and then recombine the results downstream. And so your training jobs will go a lot faster. It doesn't have to be that expensive because you can use, on AWS, you can use what's called spot instances that are super cheap. Mm-hmm. Um, a little bit less reliable, but super, super cheap.

And GCP has its own equivalent, right? What's, what are those called? Machine types? You mean like the different, the virtual machines that the workers that they spin up You. Uh, AWS, these spot instances are actually. If there's like excess compute capacity from some, like one organization reserved some amount of compute mm-hmm.

And then they just so happen to not be using it at that time. You can fill in that gap and you're paying substantially less for that compute. But your machine might get taken offline if, if that organization comes back, you're basically house sitting for them. I'm not sure if, if Google Cloud has something like that, 'cause I haven't seen it before.

Preempt instances. That's the cool cloud equivalent for that. Yeah. Oh, you can, you can save like 90%. So if what I tell a lot of my friends who are doing machine learning, hey, but I have a Mac. I'll say you, you know, use AWS kickoff, uh, a training job on a GPU based instance, but you use a spot instance and you can save 90%.

I think that's the far end of the spectrum. Mm-hmm. So if a instance costs $1 per hour to run like a G four two x large costs, like $1 to $1 40. Per hour to run 90% savings. It's, you know, 14 cents per hour. It's a huge difference. But the downside is it can come offline. It's actually somewhat where you bid on it.

You say, I'm willing to pay $2 if the price hikes and everybody else wants that house sitting job. I'm willing to like pay over the original cost. Mm-hmm. But you'll find yourself in that regime quite rarely. And so these spot instances are incredibly useful and valuable and not as unstable as they would seem at a glance.

You wouldn't use these for like production grade things. You wouldn't use these for long running instances. Yeah. Phenomenal for training jobs. So you would recommend them also to use it like already deployed and, and, and models that are running for clients, because I can imagine that, so how we at that kind of build our stuff is that everything is dependent on one another.

So the, the output of a model, uh, is gonna be taken up by some other job to basically get the output to the location for when, for what it then can be used. But with these spot instances, let's call, how would that affect, for example, when your worker is taken away? Because it's utilized by the original organization.

Is your job then just waiting until it's finished or is it gonna fail, or how is that then gonna play out? You definitely wouldn't use it in ho post training. This is only for the development phase or the training. Like let's say your mono drifts and you get some alert that it needs a retraining job.

Then you use one of these spot instances. It depends on the framework being able to pick up where it left off. If a spot instance goes down and, and I did look it up, so in GCP it's called preempt instances. If that goes down, it depends on TensorFlow being able to take snapshots and then pick up from where the last snapshot left off, it would then have to notify itself to try again, a new spot instance, pick up where it left off before saying that the job is done to kick off the next step in the pipeline.

Yeah. Okay. Um, this is kind of the idea that you could do this instead of local host development of your model. Mm-hmm. Training part. You can kick this off to the cloud using spot instances for cost savings. Yeah. And then steep yourself within the orchestration paradigm. That sounds indeed is a very good solution to, against, like training locally or paid pool price for a, for a machine.

So yeah, what, what, what I always like to do is just train on a, a small subset of data just to debug and then when everything is working, just submit the entire tire data set to the model within the pipeline. And so far that's been working pretty good, but I guess this would also be quite a good solution to cut down costs.

Just for fun. What's your, uh, development environment? Are you PC Linux? I'm both. So when I'm at home I use, uh, windows pc. Yeah. I don't use Linux. And, and then at work would also be pc, but Mac. Mac, but no Linux. Gotcha. I tried it like at, at university once, but I don't know. My focus is more towards like model building and theoretical data science.

Uh, I found all the hassle that goes into setting up Linux in the proper way. I didn't really enjoy that much. Definitely a maker's thing. Tinker's job. I remember growing up, uh, in the early days of Linux, was a war between all the distributions and the flavors and everybody mm-hmm. Was trying to find the best.

And there was a buntu on the easy side on like the managed Linux front, and it was gen two on the other side, and those two were the extremes. And I'm like, why would anybody go gen two? But it's, there are makers out there, there are people who like to fiddle. Yeah. So I, no, I like the old data engineer was like the team lead.

Jasper. Did you read it? Yeah, he would, he would use Linux at home. I use Linux quite often. I use all three. I use Mac, windows and, and Linux, windows for gaming, Linux for model development and all that stuff. But increasingly I'm trying to do a lot of my development in the cloud. Cloud. First I would like to ask if you could take us through a project beginning to end journey.

The beginning being, I know I want to build a, let's say image classifier. Okay. And the end being, I want that hosted online. How would we decide which cloud provider to use or if we should use open source tooling? And what does the training process look like within the tools that you would end up selecting?

What is the pipeline? Why the pipeline? And then how do we get that thing online and integrate it into our website? So from like a, a project that a client or more like a, a general personal projects, let's say our listeners, they land a client. Client says, look, I know I need a recommender system or a image classifier.

Or even as an alternative, the client may also come with their own specific environment. Maybe they're a Microsoft shop or they're already using Google for their analytics. So like various forks in the roads during this process about why our machine learning engineer might choose one solution over another.

And then as you go through those steps, you just say, but as we're going along, I'm choosing QBI flow. Does that make sense? Yeah, for sure. Yeah, so I, I think, and, and the first step, and that's not really even data science related, but the first step would really be to go back to the, the drawing table and, and have a look at their data to really see if what they want or what they request is also what they need.

What we often see is that clients come with a direct question and already some sort of an idea what they want. However, when we 'cause what we always do first, we start with the process that's called data discovery. So we see what systems are in place, how is, are these systems connected? What data do we have, what do we see from the data?

So like a, a, a small version of an e, DA, can we even support that, that solution or questions that they have with the current data that is in place. So that would really be step, I would even call it step zero, go back to the drawing table, see if the question that the clients ha, that the client has is really what they also need.

Because for a, a very, very nice example is to, to stick to like recommended systems. We had a client that came to us with the question like, yeah, we wanna have a recommended system on the website. And when we started looking into their data and, and investigating customer behavior and, and, and through all the systems they already had in place, we actually discovered that they needed a better search function.

So they didn't need a, uh, a recommender on their website. They needed a way for, for people who visited the website to find products more easily, because what we already could see from like the first discovery phase is that it was very hard for people to search for products because often the products wouldn't really show up, uh, when they search for it.

So that really would be step zero, just elastic search, or you implemented some, so the, the client went with a tool, which I forgot the name from. I think it's something called like, uh, seeker, but I'm not sure. So yeah, that's, that's now where we're at, where we're at. But okay, let's say the client comes with the question and, and that is a valid question and that's a valid problem that we can solve in, in the way that they see it.

After we have done the data discovery, we see that the right systems are in place, or some systems still need to be implemented for everything to work well. We always first look with, okay, is there already a, a platform? That they are working on. A lot of the clients that come to us are bigger clients that already have a, a cloud platform, either AWS, Google Cloud or, or Azure.

But in case they don't have such a thing, we look at what their services are that they already use, and which cloud platform might be integrated with those other services most easily. So in case of, like what I said in the beginning, if, if it's uh, a company that uses a lot of Google products, then a logical recommendation would be to start using Google Cloud.

But on the other side, other side, if they already use a, a lot of Microsoft related products, Azure would be a, a recommendation. So that would then be step one, setting up their, their cloud environment. And then step two would be for the data engineers to make sure that all the data is in place and everything is available.

But also this depends of course on the use case, because there's a little bit of a, a, a, a, um, separation in approach. Because in some of the cases, what we start doing is first see, okay, the client wants to see if the output of a certain, an analysis or model is even gonna help them. Improving their decision making, and in some cases, they really want to have a continuous deployment and, and use of output from the model.

So in the first case, we say, okay, we first start with developing a model, validating the output, and then merging that to the cloud platform and deploying it on there. It more happens that the client is like, okay, I want this eventually to live within my cloud platform. And then we start developing this together with the data engineers.

We first make sure that all the data is in place, that all the data is clean, because that's one of the most important things in this like entire process. You have to have clean data, otherwise you're not really gonna get anywhere with the outcome of your model. Are we cleaning data yet in your journey or are we just looking at the data to see if it's clean?

So in this case, cleaning, I think there are multiple steps in cleaning. There is already like a a pre-selection. So if there's any faulty data that we already know when integrating or getting all the data together, making sure that that data is excluded from a data set that's finally gonna be used by the data scientist to build a solution on is, in my opinion, quite the right thing to do because when you already know that it's faulty data, then there is not really any reason to include it unless there's like a very specific use case.

Of course. So there is already like a preselection in that process. Then once all the data is in place, and this is also of course a iterative process that goes back and forth between the, the, the third step, which is the data scientists exploring the data and doing some EDA to see, okay, what do I already see as in statistics, uh, and visualizations from what I can can see in data.

And, and he goes back to the data engineer. 'cause maybe for some of the data, a lot of failures are missing or don't make sense or not, are not in the right format. So like step two, two, and three, uh, are kind of like an iterative process. So when step two and three are finished, that's when we start developing the model.

So transforming your data into for the right input. What we often like to do is we, we take multiple model approaches. So we have like a baseline model. So let's say it's a classification problem. We take like a random forest tree, like a, a, a random forest classification model, which is very easy to implement, but already can give you the first results.

And from that we see, okay, where is it performing well, where is it not performing? Well, maybe we need to change some stuff in the data, so we go back to step three and two again. But then also from there we start to maybe develop a more advanced model if, for example, that base model is not where you wanna have it.

So then when we get more to the stage of developing that advanced model, then we also get into the stage of making sure that the model can be deployed. And if we wanna build a pipeline around it, make sure that all the components that are gonna be in place around the model itself are being built up and indirect.

Well, with the part on the data engineering side. So that's kind of like a, a, a parallel process. We both developed custom model or the advanced model. Uh, and at the same time around it, we build the pipe. And then when we have evaluated the custom model and it's performing well and, and the output makes sense, we make sure that the entire pipeline is deployed.

So that then the data engineer can also use the output or, or the model trained by the pipeline to predict on new data, which then can be used for any activation purposes that were determined in like the discovery phase. So that's kind of like the, the process that we walk through with a lot of our clients, but definitely not all of the clients.

And then how do you decide at that pipeline phase between, let's say, if you're on GCP, vertex AI versus cube flow? Yeah. So currently that depends a little bit on, on the framework that you're using because currently when you use TensorFlow extended, even like TensorFlow itself, they say that it's for now better to still use the, the cube flow, like the Kubernetes cluster implementation.

So the AI platform, uh, implementation because the Vertex AI implementation is quite new still, uh, and might not be as stable as the AI platform implementation. So for now, we're, whenever we deploy such a pipeline, it's still an AI platform, but eventually it will depend on, on if there are any client specific requirements that might not be met by either AI platform or Vertex ai, or AI platform.

Pipelines totally disappear and get merged within Verex a of course. Mm-hmm. Oh man. So many solutions. I guess, you know, in, in web development there's 1,000,001 frameworks as well. There's, yeah, express Fasty Fast, API and all those things. It's par for the course. Sometimes it makes selecting tool sets to become an expert in.

Mm-hmm. Mm-hmm. Stick with it is a little bit difficult, but this sounds incredibly powerful and popular too. Cube flow. Yeah. It's one that definitely keeps coming up. It's almost, it, it boils down to me. So I do a lot of SageMaker work. Oftentimes I reconsider should I just start learning cube flow and you know, there, there are a lot of issues that I face and whatnot.

Yeah, I think that's also kind of the nature of depth. We work with a lot of different clients, so we don't really have the luxury, let's call it that, that we can stick with one platform. You have to know them all. Yeah. Yeah. And one of the, um, fastest ways to connect with other platforms with limited knowledge or not knowing all the dedicated pipeline orchestrations from each different platforms is using Q Flow because they can connect to all those different platforms.

And of course, in some cases it might be better to use the dedicated solution of a platform. But I think especially when you are in the business of depth, where you have a lot of different clients, a lot of different platforms to work with, finding a solution that expands across those platforms and is not dependent on just one platform, that really helps us to deliver value to all our clients.

And it makes us also much more flexible in indeed the solutions that we can build and the clients for which we can solve their problems. That is a really powerful deciding point, right there. We're an agency, we have lots of clients. If you're in the same boat. It probably pays to go Kubernetes cube flow because it's a generalized solution.

Yeah, I really think so. And of course you can also take the root of having different people with different expertise of course. But from what I've seen and what I've found, if there's just a couple of people that do one thing, it will never get as good as when everyone works on like the same thing and then can basically improve themselves, but also the team, you make progress much faster than when there's just one person focusing on one solution and another person focusing on another solution.

Then the knowledge sharing is also waylen, and I think with Q Flow. Someone who works for a client that has AWS can use Cube Flow, but another colleague who works with a client on Google Cloud can also use Cube Flow and still in some way they can share their knowledge with one another, which is gonna be harder between someone who's dedicated to SageMaker and someone who's dedicated to Azure pipelines, for example.

Definitely. This is super valuable. I really appreciate this. Is there anything else you wanna say about Cube Flow, TensorFlow extended, or anything else? No, I think what we just talked about like sums it up pretty nice, like Cube flow is a, a flexible solution that that serves a. Most of the platforms, um, not just the orchestration platforms, but also the machine learning frameworks.

But also in the end, it comes down to what works best for you. But I think for us as an agency, the biggest point is that, that we have a lot of different clients. Cube flow is really something that I think any agency that that works with machine learning, engineering and, and machine learning pipelines should consider using.

I like to end these interview episodes per the way ShipIt handles theirs with a pick, I call it, incidentally, you seem like a really interesting guy. People can, this is gonna be an audio format. It seems like a really fun, relaxed character. So I want to get to know you a little bit. Tell me some of your interests or things you like to do off the clock.

Oh, that's, I developed quite a lot of new hobbies during covid and the lockdown. So first of all, I'm a big keyboard nerd. Okay. You build keyboards? Yeah, I built custom keyboards, so I got into that hobby actually during covid and it's like a rabbit hole. Once you get in, you don't get out. So I think that's one of my most recent private hobbies.

As you could say it, I'm very interested in playing table tennis, which I also recently picked up again as like a sports. So I guess that's basically my private interests. And I recently finished, uh, my master's degrees in, in data science, but always have been working on the side as a data scientist at marketing companies before depth at another company and since like two years now at depth.

So yeah, besides it just being my work, it's also in my private interest. So during my time off from work, I also just read, like to read research papers, see what are new developments, what is out there, what are the latest trends on, on. Machine learning and its applications. So yeah, I think that both sums up my private and my work interests.

On the topic of a master's degree, this is something that a lot of my listeners are very interested in, and since it's so fresh to you, maybe you might have some insight here. Mm-hmm. They wanna know what amount of added value does Master's add over a bachelor's to finding a job in data science. That's actually a good question.

I see this often popping up on Reddit as well. A lot of people are asking this question and I think it really depends on, on your situation. So take for example, someone who comes from computer science or from a statistic background or from a mathematical background. If, if, if you want to get into the, uh, data science field and you have a bachelor in one of those three subjects like computer science, mathematics, and, and statistics.

Personally, I don't think a master degree is in data science specifically, is then gonna serve much value for you. Because in data science, the most important part is understanding the mathematics behind it. And the technical part I think is easier, in some cases is easier to learn than the mathematical and statistical part.

So I would say if you come from a background where you have had a lot of mathematics statistics, and I think with computer science you have that same thing. You basically have a combination between mathematics and and technical subjects. I think in, in those cases a degree, in a master degree in data science's, not gonna really add much value, or at least it's not gonna add more value than when you start exploring data science yourself.

And what, that's also what I see around me. A lot of people in the data science field are not necessarily people who did a master degree. I. They had some technical or methodical mathematical background, and they were interested in data science and they just read papers and, and, and, and did courses on coding themselves.

I think at that point, with such a background, a master degree is not gonna add more value than self study. However, me, like my bachelor's was more like, uh, business and economics focused. Of course, I already had an interest in data and, and I tried to always combine what I learned in my bachelor's with like the data fields, but it didn't really go further than like economical mathematics and statistics, uh, and business mathematics and statistics.

For me, it really helped to basically crank up my technical knowledge as well as my more specific data science, mathematical related knowledge. So I think if you come from a background that's like business related, I think it really helps to have a master degree in data science if that's where you want to go to.

And then ultimately you end up with this perfect combination between understanding the business side and the data side. One of those three, you said computer science, let's say computer science, statistics or math. And then, and then let's say the third option is, is machine learning or or data science as an actual master's degree.

Yeah. Which of those three would you pick if your intent is data science or machine learning? It sounds like obviously C is the right answer, but it doesn't seem like that's necessarily, so if you had to pick a master's for data science, machine learning, would you go computer science, math stats, or data science ml?

Personally, I think I would choose for stats. Stats or mathematics. I think so. I think this is what, what the conclusion is coming to, even though it's almost like machine learning is and data science, it's, you don't know what might come next as the equivalent of that field in the future. Yeah. But there's, there's old faithful all the time, which is stats.

That's true. And I, I think in the end, like you understand the solutions that you build, uh, the best and, and how it works if you know the stats and the mathematics behind it. And what I feel with these, sometimes these, from what I've seen, and of course this depends per university, but, but what I've seen from like these data science, uh, bachelor's, it might sound like they're very related to the subject, but sometimes I think they stay too broad.

So they don't really touch upon like the deeper, the deeper mathematics and statistics behind it. And of course, in the end, it also comes down what subjects you choose, like the selectives that you have during, during your bachelor's and your master's, which can also really change with what kind of knowledge you finished in those degrees.

But I think statistics would definitely be the best backbone to have when going into, uh, the data science field. I agree. I agree. And I, I've seen a lot of agreement on that front as well online. Yeah. And then in universities, one thing I recommended a long time ago, and I don't really stand by it anymore, just because I, I haven't kept up with it, so I don't, mm-hmm.

I'm kind of a blank slate on this front. It was O-M-S-C-S, geor, Georgia Tech online master's degree was really inexpensive, like 5,000 or $8,000 or something for a master's degree accelerated program. It was like one to two years master's all online. And so it was a really attractive option. Option as opposed to going physically back to university.

I guess it's Covid now, so everybody's online anyway. Yeah. Do you find that there's, I mean, those Udacity nano degrees, those aren't really gonna get you much in the way of credibility in the job market. And then there's like traditional university for your master's degree, but there seems to be something of a middle ground these days with degrees offered online, maybe by, uh, well-known universities, but seem to be, mm-hmm.

I don't know if they're like an accelerated program or. I've seen them coming by like advertisements for, for those kind of degrees a lot, especially on, on platforms like Reddit and from as far as I know, like you have to be very careful with with the curriculum because a lot of those online degrees, they seem very attractive because of their prices.

However, what you really have to look out for is that it's not gonna be, so what I've read from it is that a lot of time people took those courses, they had barely any interaction with professors or or other students. And personally, I think besides the courses itself, what I found very valuable or during my master's degree, even though 50% of my master's degree was during lockdown.

So I also couldn't go to university physically. That's also why I realized this, that even more is the interaction between you and another student is very, is as, as much as important as the courses itself because everyone there has a different background and you can really learn from one another. I've had a lot of these group projects and, and and meetings with other students during which I learned so much new stuff that I didn't even get in the courses itself just because they were, for example, from a computer science background or from a mathematical background.

I think, and that's a huge component that you miss when you take these online degrees. On the other hand, I have seen some very good ones that offer courses from very credible professors. I know for example, that that's even totally free, but I think they also have like a paid version. MIT is publishing also a lot of these deep learning lectures.

If I compare those to the lectures that I got regarding deep learning in my university is that the content there is pretty much the same. The only thing you're missing out on is like the, the interaction with maybe other students and the professor and the practical exercises that you get. So what I would advise is to really dig deep into what the curriculum has to offer and, and who are the people behind it.

And also just check sources like Reddit or N Forums because there are a lot of people that shared their opinion about these kind of online degrees, but often it's really a hit or miss. So sometimes there are completely nonsense and you're not really learning anything, which you wouldn't have learned when you just watched videos on YouTube.

And sometimes there are those that are really good. And I think that really it also depends on the organization that's behind it. I'm glad I asked you those questions. You have quite some insight on, on the topic. Yeah, because it's a huge discussion that's going on on Reddit, for example, as well. Like a lot of people asking, do I even need a master degree?

And, and I found this online version or this online degree, is it really necessary or valuable for me to further my career? I think there's actually a lot of discussion going on about this and I always try to like kind of keep up with, with those things. 'cause I find it quite interesting to see where those developments are going because especially in the field of data science in education, it's changing very fast.

And curriculums also from like universities themselves are changing every two years. Yeah. Another vote for math. It's like, are you gonna learn TensorFlow and PyTorch or deep learning in, you know, university? Who knows what comes next. Yeah. Because I remember like during my master degree, like the focus was a lot of like on machine learning and deep learning, but it's now already kind of shifting to more like reinforcement learning.

That is kind of the next step and not necessarily the next step, but it's a level up, like more advanced, and you already see that there's a shift going on to push more also for those kind of subjects. This was a wonderful, wonderful interview. I'm so glad we had it. Really enjoyed it too. Thank you.

Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.