How do we manage our Cloud Infrastructure at Cardo AI?
Our cloud infrastructure team is crucial to the success of our business. We use a cloud-based software approach, which allows us to be fast in the startup project stage and grants high autonomy to the product teams.
When we designed our infrastructure team, we were well aware of the isolation issue that technical departments often suffer in companies. That is why we built a structure that supports continuous innovation in our environments and can also guarantee high flexibility and autonomy to our product teams.
The team’s mission is to provide support and experience to all the product teams of the company and to be a touch point for all the cloud-relative initiatives that may be undertaken.
The team is structured into two main areas that closely cooperate with each other:
- Feature teams
- Resident team
The Infrastructure Engineering team is responsible for the process of getting a deployable artifact to production and managing it as swiftly as possible for product teams.
This group must walk the fine line between providing developers with enough flexibility to be productive and move fast while ensuring aggregate efficiencies to maintain organization-wide throughput as well as manage costs and risk.
In order to enforce the policies and decisions made by the Infrastructure Team only the Infrastructure Engineering team member (working within the Feature Teams) will be able to freely operate in the cloud environments of the teams. On the other side, in order to avoid the creation of a hard-dependencies structure, we’d like to grant the developers full ownership over the release pipelines and deploy manifests.
This team is responsible for driving innovation and defining the guideline for the Infrastructure (Cloud Infrastructure) within the organization.
The team will be primarily composed of experienced cloud engineers who will foresee trends in the technology spectrum and have the ability to identify potential areas of improvement in the current infrastructure.
They will normally operate by collecting business and infrastructure needs from a set of stakeholders, working with an internally maintained – externally defined backlog.
The resident team will also represent an additional support level for Feature Team members when they need help in executing their tasks.
Using a cloud-native approach to serve our products
We believe that cloud-native philosophy is a great way to build software. That’s why all of our products are built on Kubernetes.
We also think that locking into a particular vendor is a bad idea, especially when it comes to something as important as your software stack. By building on Kubernetes, we can avoid vendor lock-in and keep our options open with the help of a widespread, strongly skilled open-source community.
We use Kubernetes to manage our infrastructure, and we rely on the cloud provider’s distribution (e.g AWS EKS or Azure AKS) to do it. We also use Infrastructure-As-A-Code tools like Terraform in order to create a fully replicable, easy-to-manage infrastructure.
In addition to using Infrastructure-As-A-Code, we release our software with a pure GitOps approach using Argo CD as CI/CD tool containing Kustomize templates for our products defined in dedicated Git projects. You can learn more about our continuous integration and delivery in the dedicated blog post.
GitOps is a paradigm that allowed us to create, update and manage our deployments with the same tooling that our developers use for code: Git. This way, you can automate everything from building out your cloud-native infrastructure to managing different environments at a glance (e.g. development, staging, and production environment) and deploying it with Argo CD.
Machine Learning: how we defined a flexible infrastructure to support our Data Scientists
Our infrastructure for Machine Learning follows a similar approach to the one we took when designing our Software Infrastructure, but we also needed to consider additional requirements that are mandatory when dealing with Data Science. Machine Learning is not just software—it needs to include data in its loops (both model training and serving) and requires different hardware capabilities than traditional software development.
Data Scientists may count on 3 different environments, that can be used depending on the activity that they have to perform.
- Staging: base infrastructure for deploying development and staging environments.
- Kubeflow Cluster: cluster that will only serve KUBEFLOW environment, that provides data scientists a useful set of tools for developing ML models abstracting the underlying infrastructure
- Prod cluster: where all the production models and workloads run.
Models will be served through the use of KServe for granting a serverless and autoscale solution that relies on the whole Kubernetes stack.
Join our cloud infrastructure team
Our infrastructure team is still growing and we are currently searching for several talented engineers. If this looks like something you might be interested in, take a look at our job openings, and don’t hesitate to apply!