Webinar #24

September 7th, 2022

Length: 47 min

Why lack of SRE kills your business

Learn what Site Reliability Engineering is and how it helps make large-scale sites and applications more reliable, efficient, and scalable.

Who is this webinar for:

  • Developers and administrators that want to learn how large companies keep their products consistently reliable
  • Managers that don’t want reliability-related problems to blow up in their face
  • Engineers that want to learn practical tips about managing large-scale projects

What you're going to learn:

  • What SRE exactly is and why you want to implement it since the very beginning
  • The best practices that employing the SRE culture can bring to your engineering organization

Timestamps:

00:00  Intro

03:09  What is Site Reliability Engineering

06:54  How a typical day of an SR Engineer looks like

09:19  What is blameless debrief

15:17  People’s doubts towards the implementation of SRE methodology

18:56  What tools Site Reliability Engineers use

23:53  Who actually needs SRE

27:58  What set of skills a Site Reliability Engineer needs

30:37  How we typically react to a work accident

37:40  What is the structure of an SRE team

43:10  Q&A

Transcript

Introduction

00:00

Maciek

Hello, everyone, welcome to yet another Buddy webinar. Today, we are going to talk about something about which I'll be honest, I didn't have a clue. Before this webinar, we're going to talk why lack of SRE, it's Site Reliability Engineering kills your business with me. I have a true specialist in this in this matter, Filip Tepper. Filip, tell us a few words about yourself.

01:44

Filip

Hi Maciek Thanks for Thanks for having me. So I'm Filip. I am in Warsaw in Poland. I am a senior engineering manager for the SRE team on top and my background is mostly software engineering. That's where I kind of started then I dived into cloud and infrastructure management. And right now finally, I'm in this realm of SRE. But in general, my responsibilities have always been kind of blurry. So I'm trying not to put the kind of, you know, definite label on on my career so far.

02:21

Maciek

And my name is Maciek Palmowski. I work here at Buddy as a WordPress ambassador. So before we start, first of all, today, we'll learn about few things. First of all, what SRE is why and when you should invest in it? Or maybe not? And what are the consequences are not having it, or have not having such a department in your company. Also, if you like what we are doing, remember to press the subscribe button on YouTube. And if you have any questions regarding our today's webinar, just ask them in the comment section on YouTube, LinkedIn or Facebook. Okay, so I think that's all and Filip.

What is Site Reliability Engineering

03:09

Maciek

So yeah, let's start with what is this site reliability engineering? What such? What does such person do?

03:19

Filip

I think it's a million dollar question, really, because you can ask different people about what SRE is, and you're gonna get totally different answers. First of all, this term Site Reliability Engineering has been coined at Google. And I would say that pretty much every company, every organisation that mentions SRE is kind of looking up to this to this idea that Google had for for a salary. But then again, each implementation is is very different. There's only one Google and there's a lot of other companies that do things differently. You can call SRE kind of an implementation of the DevOps approach, meaning that it's a culture of fostering collaboration between software engineers and system operators. You can also call it a software engineering approach to to operations for me personally, but please take it with a grain of salt. This is just kind of how I see it. My definition of it is, it's a set of practices, principles, ways of working that contribute to highly scalable and reliable software. It's about empowering engineers to own the reliability availability of their systems to let them understand what is the performance of their applications to help them monitor their applications to have a really good incident response process because this is also a very important aspect of this and and essentially built great tools for engineers to make sure that they can run their applications. Of course, here we kind of diving a little bit into the platform side. But like I said, this is still a very vague definition of what SRE is. A huge part of the job is also working with automation and reducing toil for the most common tasks. But of course, there's always this temptation to automate everything, I definitely err on the side of being reasonable with, you know, the time spent versus time saved. So that's another aspect of it. A lot of us are also involved in system design in architecture process. So this is kind of the area that we contribute to as well. Something that is a little bit more ambiguous for me is how I service contribute to the CI/CD process, because many companies expect the service to do that many companies have a separate department to handle CI/CD. So again, this depends on the on the organisation, and there's a lot of cool stuff happening in this area, for example, chaos engineering, that started up Netflix, where you just essentially randomly break down your application by, for example, shutting down random servers, or injecting network latency into your application. And essentially, you learn from that, it's something you do on purpose, and it's something that tests the resiliency of, of your application. But then again, I could probably go on for hours and hours just talking about you know, what, what SRE is, but essentially, I think that it boils down to to, you know, having a certain way of working with infrastructure with automation and providing reliability to systems.

How a typical day of an SR Engineer looks like

06:54

Maciek

So we can say that such an engineer, or has to know, really a lot about a very, very wide range of topics, right, because we already mentioned the development response to any problems, system architecture? And depends, it also depends on as mentioned, depends on the company at which you work at CI/CD. And, and many other things. So, but in your case, okay, you start your day at work, and how does your typical day looks like, oh,

07:41

Filip

I started on Slack, like everybody these days. But in all seriousness, I think it's, it's, of course, the technical skills are important, but it's also about the culture. So it's not only the skills, certain skills that are necessary, it's also about how you foster certain interactions in the company, how you foster certain practices to become more permanent. I would say that, you know, everyday life and ESeries life everyday in a real life is pretty much the same as every other software engineering, except we focus on different areas, right? So we have, essentially, you know, all the software engineering work, except is focused mostly around infrastructure, mostly focused around reliability and things like that. So it's not really that different. Of course, a lot of SREs are expected to be part of the incident response process. But then again, it feels to me based on some conversations I had that more and more companies actually put their engineers software engineers on call as well. So again, this is very kind of a blurry topic here. On the other side, if you think about incident response, it's the SRE culture that essentially brought a lot of practices into this area, like for example, blameless debriefs, how to learn from that chaos engineering part I mentioned. So I would say I started life was just slightly different software engineers. Life at this point.

What is blameless debrief

09:19

Maciek

You mentioned this term blameless debrief. Could you say a bit more about this? Because this is, this sounds really interesting.

09:26

Filip

So I think it's it's essentially the essential part of the SRE culture. Whenever there's an incident you are expected to get to the bottom of what happened and mostly why it happened. And you're interested not in the who, but you're interested in what has happened, why it had happened, and then again, how we can prevent this from happening again. So essentially, you run through the incident, you gather all the information that you gather to resolve the incident And then you learn from it, right you apply some immediate fixes that you need to apply in order to restore the service. But then again, a important part of that culture should be that you make room in your software development process to address the issues that cost a certain outage, right? Whether it was something on the infrastructure side, whether it was bronc permissions, whatever it was a bug in the application that cost, I don't know, increase net network latency, you have to make sure that you learn from these experiences, and you make room for that in the in the software development process.

10:39

Maciek

Okay, so. And you also mentioned here that the the who was responsible is less important, or it's even not important at all. So I see that it's kind of similar when we also talk about code review that the very important part about code review is, first of all, not to take it personally. And when we write code review, it's also very much about emotions. Because our always our first reaction to such a review, is try to defend it, try to defend it personally. And it's more than that. It's not about us, it's just about the code. And here, as I understand this is the same, but it's not just about the code. That's much wider. I can say that it sounds a bit like being a private detective. And in the company. As you mentioned, we gather all the clues from all the departments?

11:46

Filip

Well, I would say yes, I will say that my immediate thinking, when I think about it, is that I can give you an example, let's say someone removed a critical part of infrastructure. And that cost, for example, downtime. For me, it's extremely unimportant. Who did it? What is important is that some action calls to downtime. And my question here, right now would be number one question is, why was it allowed to happen, right? Why a person who was in this department who had a certain set of permissions to doesn't matter what the name of the person was, but why was a person who should not be having certain permissions have these permissions granted, right. So this is a gap in the process. This is not a human error. I mean, it was a human error, but it should not be treated as one, we should design our systems in a way that should prevent these kinds of actions, right. So this is again, a learning experience. Okay, we did not foresee that certain thing might happen. It was the system mistake. And we are here to correct that system mistake and not to blame anyone in particular, what I learned from one of my earliest bosses was that this is actually something that is extremely valuable. I deleted a production database on my first day at work. And I was expecting the worst, really, I would expect I was expecting to be fired. But the CEO of the company came over. And he said, Thanks for discovering this extremely, extremely important back to us. He said, we're going to be fine. We have all the backups, we've lost a couple of messages from our instance. And that's it. And I think that this is the approach we should take this is a learning experience. This is something that helps us improve our systems helps us understand our systems better and make them more resilient. So it's, it's it's the greatest opportunity for us, for us to learn and understand how we can do things. Things better.

13:58

Maciek

Yeah, because one of the most important part in your case was not the fact that you deleted the database. Because, yeah, it happens. I mean, this engineer at GitLab, probably still thinks about, about the day that he had. But the fact that you as a first day engineer at your workplace had the possibility to do so. Exactly shouldn't happen. It shouldn't happen. Only only a small group of people who should have access to such a dangerous operation, because

14:38

Filip

absolutely, I would even argue that today with all the fantastic tooling, especially related to infrastructure. There should be actually very few people, if any, with such a power, you can do really remarkable remarkable things with all the tooling that you have for infrastructure as code all the automation that you have already available. So Like, if you want to go down that path, then yes, I think that you should be very critical about, you know, handing out these superpowers. And there are certainly different ways of doing this, this now, and that gives you a lot of confidence in what you're doing in general.

People’s doubts towards the implementation of SRE methodology

15:17

Maciek

Also the thing that you mentioned, and this is also the thing that I really liked to hear, because it happens more and more often with different approaches, is the fact that SRE is a culture. We also say about this about DevOps and about many, many, many things. We also say about a simulated, it's more than just accessibility, it's the way that we all think about this. So what is the hardest part to convince other people that this SRE way of doing things? It's worth it, and what are the biggest problems in trying to convince some people, I have

16:02

Filip

to say, I've been mostly fortunate enough to not have to make a very strong case, because I think it's, it's, it's really a very reasonable approach to software development lifecycle in general. Because in the very end, it actually serves a very important business business case. So you have your software more resilient, more reliable, customers are happier, you understand the issues that you're facing much better. So you can have a much better communication with your customers, you can be ahead of your customers really, if you have good observability good monitoring of your systems, then you are ahead of your customers, you don't have to wait for your customer to report a bug or, or downtime to you, you are ahead of them. And you can react faster, essentially shortening this, this cycle. So I would say I'm only seeing benefits from this perspective. Obviously, that means that there's more work essentially for everyone, right. But then again, you have SRE teams that are to a certain degree teams that help build tooling for maintaining this culture in a way, looking from software developer perspective, if it's easy for me to add observability to my application, then I'm happy gonna do it. Because I can clearly see the benefits of that, right. I can detect performance issues, I can detect the latency, I can detect just about any issue with my application and very carefully monitor its health and be aware of its health and be able to react faster. But I often find that yes, it is not as easy as it might seem, mostly because tooling right now is kind of all over the place, right? There are certain I would say business leaders, if you think about the this area, but but there's still a lot on SRE teams that has to be essentially enabled, built glutes together so that engineers have a good experience. Because really, engineers are our customers. And we want to make our customers happy as well. If we provide tooling that is difficult to use, has no documentation doesn't really bring the expected value, then it's really hard to justify this case. But then again, if you have good tooling, if you have good culture, good documentation, then it's it at some point, it becomes really obvious to everyone Yeah, like I've released this application to production, but I have no visibility into how it's performing. I don't have that confidence that the tooling that the observability gives me and yeah, I would really like to have that. Right. I feel more more confident when I understand what is actually happening with with my systems.

What tools Site Reliability Engineers use

18:56

Maciek

Okay, yeah, I understand it. Yeah. You mentioned tooling. So this is kind of a natural. Next questions are, and you mentioned that there are some tools for SRE. So what are they especially this is kind of interesting, when we wouldn't think how wide of a topic it is. So

19:18

Filip

it is it is very wide. So on a daily basis, we really use a lot of tools. And I would say that my kind of, you know, favourite areas observability. And there's a lot of really great tooling that I would say is quite acceptable, as you know, industry standard. If you're into Kubernetes stacks, and pretty much everyone is working with with the Prometheus stack. Now there's data dog who is doing remarkable things on the market. There's the elastic LogStash, Kibana stack for for metrics and for for observability in general. So, this is the tooling that a lot of companies are all over organisations are using today, there's a lot of great tooling for incident management. So essentially to provide certain guidelines certain way of doing incident response process, there's a lot of tooling for essentially running your night like pager duty, which calls you in the middle of an incident to kind of put you in front of your computer and start to start solving solving the problem. But this is like only one aspect of it, we, especially my team, we do a lot of with a lot of infrastructure. So we use a lot of, you know, infrastructures, code, tooling, configuration management, infrastructure, management, whatever you want to call it. So there's, there's TerraForm, there's things like Ansible, all these, I would say, old school tooling in this in this in this realm, but then again, they have proved to be really reliable, and essentially invaluable, a certain size of your organisation. Because otherwise, it's just, you know, managing infrastructure manually, that's not something you want to do, like, probably at all, obviously, a lot of, you know, I would say come on tools, like for example, you know, everybody's using Docker, a lot of companies are moving or already are well into, into Kubernetes. Very important part is that we also use a lot of CI/CD tooling, like for example, Buddy, of course, because that gives us a lot of confidence into how we, for example, work with our infra scope, right, we essentially treat infrastructure as code as any other software. So that really ties well into the whole software development process. And of course, if you're managing infrastructure and working with, with system design, that means you're evaluating a lot of storage engines. You're evaluating all sorts of network appliances that you can run in the cloud or in data centres. So yeah, essentially, you have fun with a lot of tools. Really.

22:05

Maciek

Yeah, the list was very, very long, and from so many topics, so yeah, but but I'm sure that working in SRE, it's pure fun, if you like to play around with different tools. until some point, I mean,

22:26

Filip

I would say that incident response, unfortunately, is an important part of the job. And that is always tricky. And that's definitely, I would say like, this is a really important topic, because it's not only about how you work with tech, but it's also how it influenced influences your team. You know, it's a stressful process to be involved in an incident response, it is a process that puts strain on your family life if you're on a call of ours. So there are different aspects of it. But then again, being responsible for certain areas makes you think, essentially twice before you commit to something that might wake you up in the middle of the night. Personally, I very much enjoy boring technology, meaning technology that has been proven technology that is exactly boring, meaning there's a lot of people who understand how it works. It's fairly reliable. It is something that you have a lot of confidence in. So yeah, different aspects of this very topic.

23:30

Maciek

And probably some people would say that you are talking about many things that will be described as legacy software. But those legacy software, in many cases just work in a very boring way without any accidents and everything.

23:45

Filip

Well, in distributed systems, you cannot have incident free environment. But yes, in general, yes.

Who actually needs SRE

23:53

Maciek

And tell me who needs SRE? does. Every company should think about it, or maybe not having the whole department because you mentioned it's a culture. It's a culture. It's also a culture like DevOps culture, like, Let's go to CI/CD. Because CI/CD is also a way of dealing with software. It's not all only about writing tests and everything. It's also a bit about the fact that you mentioned that I rather think twice and add some additional testing rather to be awake at night to fix something and production.

24:36

Filip

Well, since I see this mostly as a culture, I would say, I would ask me what my favourite answer is. It depends. If you have a small nimble team that already knows their way around certain aspects of reliability, then I would say you're probably fine then you probably don't need a separate team. If you have a team team that lacks the skills, try embedding an SRE into this team, make it make SRE and the whole concept around SRE part of the process inside your team. Maybe you want to have a separate team, maybe you know that there's so much need for building stuff that it will keep a separate team occupied. And on top of that, at some embeddedness, a reset will say, there is really no good answer. And this very much depends on your organisation. But I remember when I was when I started kind of building the first infrastructure team. A couple of years ago, I started doing that because I noticed gaps in how we essentially, at that time, monitor our infrastructure. So that was kind of a common sense approach. Okay, I wasn't aware that it's called SRE at the time. But apparently, what I did was I started building this SRE culture where we started understanding our application better, we got more insights into how we're performing. And yeah, we didn't call it that way, it was just a natural approach. So I think that it's more about fostering this culture and having an understanding that this is something that every serious or not so serious, even business needs, just because it gives you confidence in your process. So you

26:32

Maciek

would say that, it would be great that ever even the smallest company would learn about how SRE works, and try to figure out how to figure at least some of it ideas into their styles, either by providing the whole team, which probably when they were there, the beginning of their journey, is quite hard. That's why the better approach would be, just use the methodology at start without the extra people. Because

27:04

Filip

Absolutely, there's a lot of great tools that have very low barrier of entry, you don't have, you don't need to have a separate team to I don't know, start monitoring your application or add an APM or anything. If you're, if you have an early stage application, you're building an MVP, it's so easy to add it and then build on top of that, that it would be really a sin not to do it at this point, right? Obviously, at a certain certain level of development, you're going to expect more from the stack, you're going to have a more thorough process around how you're building observability. But at the same time, it is so easy to start, it's mostly free to start. So really, like it's a kind of gateway drug into having this, this this kind of setup and and start building with with this mindset.

What set of skills a Site Reliability Engineer needs

27:58

Maciek

Okay, yeah. So like with many different methodologies we have in in it, the same goes with CI/CD. That same goes with DevOps, it's really easy to start. And what's interesting, if you want, then getting started at the later stage, will be much more difficult. And tell me if you would, if I think about the starting a career as a site reliability engineer, what sort of skills would I need?

28:39

Filip

That's a tricky question. I could answer that. You need all of them. But then again, yes, this is also depending on the kind of organisation that you have. But I would personally when you know, what doesn't hurt is having experience in software engineering, and or ops, obviously, because this is the area that kind of glues together both of these specialties. I think when building a team, it will be super important to actually have people with with kind of, you know, both kinds of backgrounds. So that's one thing. I think that people working in SRE need to be really good communicators and good team players, they should be able to foster collaboration because it's not only about you know, kind of hard technical skills, but it's also about the soft skills that you need to have in order to be efficient. In this role, obviously, you need to know the basics of you know, how to deploy an application, how to use Kubernetes things like TerraForm but then again, every company will have a very different set of tools that they're that they're using. I will say that a very high level if you enjoy I know engineering work software engineering work, and you have a knack for infrastructure, you're probably set, I would say that, you know, exploring AWS is a good starting point, if you want to get around into this topic, because you will learn so much just by looking at, you know, different areas there. I know that it kind of goes into the area of infrastructure management and possibly platform. But then again, it's it's so difficult to kind of draw the line between these areas. So I would say the more you know, the better SRE you will be.

How we typically react to a work accident

30:37

Maciek

Yeah, I understand. Yeah. I mean, it's kind of a consequence of that. What what you mentioned before that it's a very SRE in general, it's a very wide topic. So. Yeah, so So you have to know a bit about everything. And the thing that SRE is doing is reacting to different accidents and tell me in practice, maybe, maybe we'll have some great example, how a typical reaction to an accident looks like, like from the beginning, some big accident happened. And what do you do step by step? I know, it depends.

31:23

Filip

I was actually going to say that it's a fairly common process that is shared between companies that actually do have this kind of incident responses, management process, right. So usually, you have someone who is on call, who is able to respond to an incident, whether that incident has been created manually by, for example, a customer support team or whether that incident has been created automatically from your monitoring, because essentially, that's how incidents in a perfect organisation should start. And then essentially, you have someone who is on call who is an incident commander, that's the kind of official title that is being used in this in this area, you would have a scribe who's taking notes, you would have responders who are actually diagnosing the problem. And then yeah, you essentially dive in you try to figure out what is actually happening. And I think that's a really great way of understanding that is actually having monitoring set not on the infrastructure, underlying infrastructure, but on the customer journeys, right. So if I understand how a product is broken for my customer, then I can start looking into that, that area, and it's kind of all hands on deck situation, you're trying to find, or you know how to find in a great environment, people who are responsible for a certain area of the application, and essentially you bring them in to help you diagnose a problem. If it's an application problem, then obviously it requires a patch, if it's an infrastructure problem, because I don't know, database went down, and you have to figure out how to bring this back. So the immediate focus of the incident is restoring the service. That's number one priority, because customers, your customers are waiting for your service to be back online. And then there's the whole debrief process that we've talked about, and essentially making sure that we learn from this process, and we try to be never ever paged about it. Again, I think that that is a great outcome of every incident is that it should happen in a perfect world only once, because then you find ways around preventing it from, from happening again. All in all, it's a super stressful situation, because you know, that you're kind of, you know, trying to game the clock, and you know that the sooner you can do it, the better. But it's not always super obvious what exactly you need to do. A lot of good practices can be employed to make it a little bit easier. If you have good automation around and your infrastructure and your application and you have good monitoring available for that. Then you can essentially make the process as easy as possible. So for example, what we try to do whenever you receive an alert, there's always something for you to look into, for example, a dashboard or a run book to follow up right so that we try to follow essentially the most common cases and have them documented so that it's easier for the person who is already in a lot of stress to actually take action. If that fails, then well, it's gonna be an interesting couple of minutes or hours in the worst case scenario, but essentially, it's it's it's a lot of debugging, it's a lot of trial and error. I would say that probably Uh, probably the most stressful piece of any engineers work when you're working with with software with with the web is essentially you have to figure out what is happening sometimes it's not as easy as it might seem.

35:14

Maciek

Yeah. So as a developer, I had the pleasure to be. I mean, you have a chance to hear a client yelling at you, and because he thinks that, thanks to which ETL, it will be a bit faster? Well, it's not. And on the other hand, you try to concentrate and concentrate and do do your job. And one thing because with such an approach, I still understand that the first thing is to create a hotfix patch. And the next step is to, to make this patch this hotfix proper fix with all the tasks with all the documentation and everything, right?

36:05

Filip

Yes, I would say, that's a fair assessment. Definitely. I mean, it's, it's the alerts, the uncle fatigue is real, it is a real problem. If you're being paged every day, a couple of times a day for the same problem. That means you're gonna have a problem not with your customers, but you might simply lose your team, you might lose the people who are working with you. And that's the worst thing that can happen, because there's only so much heat people can take. And, you know, even I just want to briefly mention what you said, like an engineer should be absolutely, or the whole response team should be absolutely shielded from the customer's they should be shielded from anyone who can interrupt this process and distract them from solving the issue. So if that was the case for you, then that was not a great incident response process. Right.

37:02

Maciek

Though it wasn't, it wasn't I'm sure.

37:05

Filip

So yeah, I think that, you know, the team should be given all the freedom, they need to essentially solve the issue, obviously, you have certain roles in the incident response process, that, essentially their goal is only to maintain this communication between the response team, and for example, the customer facing teams, that is also an important part of their response process, but essentially, the people who are working, you know, in the heat of the moment, they should be given all the tools or the resources or the all the distraction free environments that they can, that they can get.

What is the structure of an SRE team

37:40

Maciek

Yeah, I understand. And the last thing I wanted to talk, I mean, you mentioned that a bit because we you described, the SRE team can differ, especially when it comes to where they're put in the company. But yeah, I wanted to talk a bit more about the structure of, of such of such team, how they look like of are those victims, small teams, working with them. So

38:12

Filip

there's a different way, a couple of different ways that have been actually pretty well documented. There's, there's a book called Team topologies, that I can recommend, it's really one of the best books that I have read in a really long time that that are around the whole business of it, software development, and team building. So I would say this is a very good resource. And it gives you a couple of different ideas how you might want to build out your SRE platform or infrastructure teams. There's also this little kind of addition, DevOps topologies dots shows you a couple of ways of doing that as embedding habits, separate teams having, you know, time fixed collaboration. So this very much depends on what you're trying to achieve. And where your organisation is, in terms of, let's call it SRE maturity, right? There are as many good ways of doing it as as bad ways of doing it. And I don't want to say it's trial and error, but you have to be very mindful about the goals that you have in mind for that particular team and choose the right alignment in the company, my team actually moved from separate, disconnected team into an embedded one. I feel like this, at our current stage has given us a lot of value. There's definitely much better collaboration between engineering and engineering teams and SREs. There's dedicated people who know and understand the area they're working with. So I would say so far at the embedded model has been quite, quite good for for my team, but again, your mileage may vary. I very much encouraged to to have a look at the you know, Team topologies, DevOps topologies readings and and it's a really good material if you want to essentially start figuring out how to structure this inside your application. And of course, the, you know, the Google SRE book, that's the Bible of every SRE. Well, it's, it's, it's a couple of years old. Now, really, and you should probably take some of these experiences with a grain of salt, especially today. But then again, it's a great starting point, if you want to understand the area a little bit a bit better, and start figuring out how to fit this into your organisation.

40:37

Maciek

Okay, so yeah, really, I've thank you so much, because I learned a lot because like I mentioned at the beginning, SRE wasn't something that I heard of before, before meeting you. So this isn't self, I mean, the SRE isn't a topic that. So it's so popular, I mean, CI/CD is having at some, in some environments, problem with popularity, but SRE is, is less, less popular, but I really liked their approach. I really like the approach, I really like. The way the way it can, can fix many things, because you also mentioned this, the things about, about the teams about how much pressure they can take, because accident response is, is can be horrible, can be horrible. If we aren't shielded from the client, especially this this yelling type. And in some cases, let's be honest things things happen. And having round after round of such a client, for some people, it can be just too much. And yeah, this is a simple way to lose. adminstrator, a developer depends on the robot. And this is really interesting, that SRE also, a bit, maybe it's not its main role, but it does. It's also a bit takes care of the team, just by all the processes by all the procedures. And also the moment that they ask you about the typical response when an accident happens. You mentioned there is a process for that. So yeah, it's that I mean, that's normal, there shouldn't be a process for for everything in such in such a broad, and also the blameless review. This is this is also interesting. It's a very interesting concept. And I think, in some ways, it also should be adopted in in other branches of, of IP, or in some cases it already is.

Q&A

43:10

Maciek

So from what I see, we we have one question, when you should win, you should add an SRE team, to your project,

43:23

Filip

I want to say start as soon as possible. Like I said before, I don't think you need a team to begin with, I think that having someone who can enable this culture, even just a single person is already a huge win, for the team and for for for your system, in general. So I would say start as soon as possible. But then again, I realised that for many companies, you know, building out separate infrastructure teams or SRE teams is an afterthought, right? And I completely understand this from a business perspective, because if you're building a product, first, your first thought is to figure out okay, do we have customers for that? What is the you know, the market fit? Can we even succeed? Can we have the scale that justifies building teams that support the main process? So I would say, start as soon as again, you're not going to regret it, probably. But the reality of it is, is that a lot of companies think about these kind of support teams as an afterthought, which again, is fair, this is a very fair, fair business case. So, a lot of the time when the SRE team comes in or as an SRE person comes in, the first thing they do is they evaluate what they have arrived at, try to find, you know, the areas of kind of immediate work that needs to happen in order to bring up a certain aspects of this culture to to a certain level.

44:54

Maciek

Okay, yeah, I think that that answers the question and before I mean, during the A Webinar we asked the question on our YouTube poll. Are you familiar with SRE practices? And what's interesting is, some people are one firm, know, at least some SRE practices. That's great. I'm kind of surprised because, like I said, I felt that less people will be but but that's the truth rate. Great to hear. And I want to say that if you want to get notified about more of our webinars, join our meetup comm group called Buddy CI/CD group. Also, don't forget to subscribe to our YouTube or maybe Twitter, Facebook or LinkedIn, it's, it's up to you. It's also a great place to get informed about all the upcoming webinars. And in two weeks, we will be having Christian Dangl. And we will talk how to automate tests with Cypress and testrail. And of course, with with Buddy so this should be a really interesting webinar. So Filip, thank you. Thank you once again, it was a very, very interesting webinar, I really learned a lot and I think that I will be buying this book in topology because it really sounds interesting because I know how, how different team topologies can affect the way how we work. And also from from, from my perspective, I also like to learn how do some people think about putting the design team because this is also a very interesting topic. Probably not in this book, not still an interesting one. So, Filip, thanks once again, and thank you, everyone, for being with us. Thank you while I interact, I see that you are with us every time it's really great to have you here and see you in two weeks. Bye, everyone.