Tech

Site Reliability Engineering in Starship | da Martin Pihlak | Starship Technologies

Photos of Ben Davis, Instagram slovaceck_

Running autonomous robots on city streets is very much a software engineering challenge. Some of this software comes on the robot itself, but a lot of it actually works in the backend. Things like remote control, route search, customer-friendly robots, fleet health management but also interaction with customers and traders. All this needs to run 24×7, without interruption and scale dynamically to match the workload.

SRE at Starship is responsible for providing the cloud infrastructure and platform services to manage these backend services. We standardized on that Governors for our Microservices and have it running above AWS. MongoDb it’s the primary database for most backend services, but we like it too PostgreSQL, especially where strong write-offs and transactional guarantees are required. For asynchronous messaging Kafka it’s the messaging platform of choice and we use it for pretty much everything from sending video streams to robots. For observability we rely on it Prometheus and Grafana, Loki, Left and Jaeger. CICD is managed by Jenkins.

A good portion of the SRE time is spent on maintaining and improving the Kubernetes infrastructure. Kubernetes is our main distribution platform and there is always something to improve, whether it’s configuring autoscaling parameters, adding Pod interrupt policies or optimizing the use of the Spot instance. Sometimes it’s like laying bricks – simply install a Helm card to provide a particular functionality. But often the “bricks” have to be carefully collected and evaluated (Loki is good for log management, Service Mesh is one thing and then which) and occasionally the functionality does not exist in the world and must be written from scratch. . When this situation happens we usually go towards Python and Golang but also Rust and C when needed.

Another large piece of infrastructure that SRE is responsible for is data and databases. Starship started with a unique monolithic MongoDb – a strategy that has worked well so far. However, as the business grows, we need to revisit this architecture and start thinking about supporting robots for thousands. Apache Kafka is part of scaling history, but we also need to understand sharding, regional clustering, and microservice database architecture. In addition to this, we are constantly developing tools and automation to manage the current database infrastructure. Examples: add MongoDb observability with a custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate failover and regular recovery tests, collect metrics for Kafka re-sharding, enable data retention.

Finally, one of the most important goals of Site Reliability Engineering is to minimize downtime for Starship production. While the SRE is occasionally called upon to deal with infrastructure disruptions, more impactful work is being done to prevent disruptions and ensure that we can recover quickly. This can be a very broad topic, ranging from solid rock K8 infrastructure to engineering practices and business processes. There are great opportunities to make an impact!

A day in the life of an SRE

I get to work, sometimes between 9 and 10 (sometimes I work remotely). Have a cup of coffee, check Slack messages and emails. Check out the alerts that are shot overnight, see if there’s anything interesting here.

I found that MongoDb connection latencies went up during the night. Digging into Prometheus ’metrics with Grafana, you find that this happens during the time that backups are running. Because this is so quickly a problem, have we been running those backups for centuries? It turns out that we are very aggressively compressing backups to save network and storage costs and this consumes all the available CPU. It seems that the load on the database has grown a bit to make it noticeable. This happens on a standby node, it has no impact on production, although it remains a problem if the primary fails. Add a Jira article to solve this.

In passing, change the MongoDb (Golang) test code to add more histogram sections to have a better understanding of the latency distribution. Run a Jenkins pipeline to put the new probe into production.

At 10 a.m. there is a Standup meeting, share your updates with the team and learn what others have done – install monitoring for a VPN server, install a Python app with Prometheus, install ServiceMonitors for services external, debugging MongoDb connectivity issues, piloting canary implementations with Flagger.

After the meeting, resume the work scheduled for the day. One of the planned things I had planned to do today was to deploy an additional Kafka cluster in a test environment. We launched Kafka on Kubernetes, so you should be able to take the existing cluster YAML files and modify them for the new cluster. Or, depending on the thought, should we use Helm instead, or maybe there is a good Kafka operator available now? No, I won’t go there – too much magic, I want more explicit control over my states. Raw YAML is. An hour and a half later a new cluster is running. The setup was fairly simple; only init containers that register Kafka brokers in DNS needed a configuration change. Generating credentials for applications requires a small bash script to configure accounts in Zookeeper. A bit that was left pending, I was installing Kafka Connect to capture database change log events – it turns out that test databases aren’t running in ReplicaSet mode and Debezium can’t get oplog from it . Enrich this and move on.

Now it is time to prepare a scenario for the exercise of the Wheel of Misfortune. At Starship we run it to improve our understanding of systems and to share problem-solving techniques. It works by breaking a part of the system (usually on trial) and having some unfortunate person try to solve and mitigate the problem. In this case I will install a load test with hey to overload the microservice for route calculations. Use this as a Kubernetes work called “hay” and hide it well enough so it doesn’t appear immediately in the Linkerd service mesh (yes, bad 😈). Then run the “Rotate” exercise and take note of any gaps we have in playbooks, metrics, alerts etc.

In the last hours of the day, block all interruptions and try to do some coding. I have re-implemented the Mongoproxy BSON analyzer as asynchronous streaming (Rust + Tokyo) and I want to understand how it works well with real data. It turns out that there is a bug somewhere in the analyzer bowels and I need to add deep logging to understand this. Find a nice track library for Tokyo and let yourself be dragged along with it …

Disclaimer: The events described here are based on a true story. Not everything happened the same day. Some meetings and interactions with colleagues were edited. We are assuming.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button