The Bike Shed: 329: Fire Mode

0
72
To the Left, to the Left


Steph is excited to be headed on a retreat with her mom in the mountains, but before that, she details how she helped troubleshoot a production issue with her team and appreciated their process. She’s also looking into tooling around spinning up more machines to process more RSpec tests.

Chris had a developer start their new job at Sagewell and highlights how they involved the new person in rectifying potentially missing and/or confusing existing documentation. He also has a gripe, and that is accounts. Handling too many accounts. Additionally, he talks about triaging an error and how it was tough initially to understand if something was actually broken. And then it was even harder to understand what was broken. So he paired through it and used the power of putting two heads together.


This episode is brought to you by ScoutAPM. Give Scout a try for free today and Scout will donate $5 to the open source project of your choice when you deploy.


Become a Sponsor of The Bike Shed!

Transcript:

CHRIS: Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I’m Chris Toomey.

STEPH: And I’m Steph Viccari.

CHRIS: And together, we’re here to share a bit of what we’ve learned along the way. So, Steph, what’s new in your world?

STEPH: Hey, Chris, I am going on vacation next week, and I am so excited about that. It’s going to be pretty much a week long. It’s like a Tuesday through Friday ordeal. And it’s a trip that I’m taking with my mom. So over the past year, she’s gotten super serious about her health and nutrition and done a phenomenal job of being very focused on a plant-based diet, which is basically healthy vegan food is what that comes down to.

So there is a retreat that’s taking place in the North Carolina Mountains that she’s really excited about. I’m going to go with her. We’re going to do lots of cooking, and hiking, and hanging out in the mountains, and it’s going to be lovely.

CHRIS: Well, that does sound lovely.

STEPH: Yeah, it seems like a really perfect time to disconnect just because you’re headed into the mountains. So all you should take with you are books and things that are not iPhones, and tablets, and computers, and screens. So I’m looking forward to that, just to be away from screens for the week.

On some more technical news, this past week, I helped troubleshoot a production issue, which was a bit novel for me because the work that Joël and I are doing with our current project it’s all in the testing realm. And so it was probably around 10:00 o’clock at night my time, and I got a ping on Slack. And it looked like I was getting called in for a production issue.

And I was like, I have touched zero production code. [laughs] So I’m very intrigued how I could have broken production at this point. And so I looked into it, and it turned out that it wasn’t necessarily related to a commit that I had authored, but it was for a commit that I had reviewed and then approved. And so their strategy is they create a new channel. They’d gotten a ticket that an error was occurring.

And then the site reliability team created a new Slack channel, and then they pinged everybody who either authored, reviewed, and approved that change to be like, hey, we think the issue is related to this commit. Our plan is we’d like to roll it back. But before we do, we just want to check in with folks who have more knowledge to help us confirm that, yes, this error message seems related. And I really liked that approach.

I really like the idea that it’s not just the person who merged the commit that then gets pinged on it, but it’s like everybody else who happened to look at this and review it come help us too. So we spent some time looking into it, confirmed that yes, indeed, it was related to that particular commit. And then their team did the wonderful thing of then rolling it back. So then, it was no longer an escalated issue.

And so then I asked, “What else can I do to help?” And they said, “Well, from here, it’s no longer a production issue. So tomorrow, just follow up with the author and let them know and issue a fix for the bug, and then merge it like normal.” So we’re back in that normal pull-request flow, very calm.

And overall, I just appreciated their process. I like very much how they pulled more people in because I think some of the other people that were involved weren’t online, which makes sense because it was really late. So that way, you just spread in case some other people really aren’t available that then hopefully you’ll get lucky and one of those three or four people are available to help you troubleshoot.

CHRIS: That does sound like a really nice and thoughtful and intentional bug response, communication, procedure, rollback, et cetera. All of that sounds like it worked very well and is nice to have. And it’s the sort of thing that a larger organization ideally gets to, having these sorts of processes. Spoiler alert, later in the episode, I will talk about the other side of it of being a very young organization and trying to be like, wait, is this a bug? Is this not a bug? Should we roll back? What do we do? That’s actually my topic de jour.

But what you’re describing sounds like the calm even in the case that there is a fire sort of like, yep, we’ve got procedures. We have workflows. We have communication channels and ways that even the exceptional things can be handled in an ideally as calm as possible way. So that’s awesome that that’s what you got to experience there.

STEPH: Yeah, getting called in at 10:00 o’clock is never fun for anybody. But when it happens, because it’s going to happen, then I appreciate the thoughtfulness and that process that they put behind it. So it all went fairly smoothly. And it was also one of those fun things where I haven’t met…like this is a very big organization, so I hadn’t met any of those people.

So when I got pinged on it, and then I hopped in, I was like, hi, I don’t know anything about this process and what y’all are doing, but I am here. I’m here to help. Where can I look? What can I do? So it was also a fun endeavor in that regard to just be like, I don’t know what I’m doing, but I am here to help. Please let me know how I can help. And it ended up working pretty well. So yeah, that’s been a fun adventure for this week. How about you? What’s new in your world?

CHRIS: What is new in my world? Well, we had a developer start this week, which has been really wonderful. Unfortunately, we had scheduled their first day to be Monday, which was Presidents’ Day, and that’s a holiday. So we got out in front of that one and figured it out. We’re like, no, no, actually, feel free to start on Tuesday. We’ll not be around on Monday, so you shouldn’t be around on Monday. But then, on Tuesday, they started.

And we intentionally structured things such that we have a contractor that has been working with us for like seven or eight months now. So it’s been a long time and been very formative as well the work with that contractor. So this is their last week, and thus, we very purposefully brought the new person on the team and that contractor together to maximize the amount of pairing and overlap that we have there just to try and as intentionally as possible grab whatever is in their head, get another point of view.

Because this new individual on the team will be able to work with myself and the other full-time developer on the team a bunch moving forward, so we want to maximize their overlap with the person who is on their way out. But otherwise, it’s been great. We’re a young organization, so the version of onboarding it’s me running around setting up a lot of accounts, forgetting to set up other ones, getting pings in Slack, and then following up and setting up another account. Eventually, I hope that there are checklists and formalizations and, ideally, one-click SSO magic that makes all of that work. But for now, I’m happy to chase it down.

But really, we’re just leveraging pairing as much as possible as the onboarding tool to make sure that where we don’t have formalization, procedures, documentation, et cetera, as thoroughly built out as I would love to be at, we can shore that up with some time with other humans.

STEPH: That’s awesome. It’s always fun having someone new to join to highlight all the things you need to automate or at least have a checklist for to then help them onboard. But that’s really exciting that you’ve got a new teammate.

CHRIS: Yeah, definitely very exciting. And they’ve been great. They’ve hit the ground running and a couple of pull requests already and just contributing very effectively within their first couple of days. So that’s always wonderful to see. We are definitely taking this moment to document what is undocumented or update the README where it needs to be and start to make that checklist. We have another person who will be starting in about two weeks’ time. And so, ideally, that will be even a little bit more fleshed out of a process. So slowly, incrementally get a little bit better with each we add that we get there.

STEPH: How much do you involve the new person in creating that documentation? Is that something that you ask them to help build, or is it something you take ownership of? What’s that balance?

CHRIS: It’s interesting. So definitely some I want to be with that person because I think it can often be the easy first PR is an update to the README for like, oh, I tried to set up the app, and it did not work. For this reason, I have now updated the README, and now there’s a pull request. And we get to experience that flow via the very low-stakes change of updating the README. So that’s a definite one that I like to have.

The other is I’ll typically ask for the individual to capture as much as possible. There’s a very delicate line in my mind between empowering them and being like, yes, absolutely. We’re young. We don’t have everything documented. So feel free to make changes where that makes sense to you. But at the same time, I know that joining a new team can be complicated, can be intimidating in certain ways. You’re not sure what’s okay to change? What’s not okay to change? That sort of thing.

So I simultaneously don’t want to put the pressure on someone to be like, “Yeah, no, change anything you want. Literally, nothing is stable here. Nothing’s glued to the ground. So feel free to pick up anything and throw it out the window.” That feels too far in my mind. So I don’t have an actual answer like, I’m ideally calibrated at this point. But it’s sort of those two tensions that I’m holding in mind as I think about that.

STEPH: Well, I really like your answer. I like that balance because I think it’s really nice to include the person in those changes and also just because they’re going through it. So they happen to have that insight, and it’s fresh. But I agree, when you’re joining a job, you want some stability and confidence that the people that you are joining that team with are also working hard to make it a very positive onboarding experience.

And if you just were to push all of that responsibility on to them to be like, “Yeah, we know. We don’t have this organized yet. So you tell us everything that we need to do,” that would feel unkind to that new person. I think as a new person that I wouldn’t fully enjoy that. I don’t mind some of it, but I wouldn’t want all of it. I’d have nervousness around ownership, around improving processes, and who that belongs with.

CHRIS: Sort of a classic case of it depends, or it’s a little from Column A, a little from Column B, but definitely some, just hopefully not too much.

STEPH: The Goldilocks of onboarding, some onboarding responsibilities, but not all of them, just the right amount. [laughs]

CHRIS: Shifting gears slightly, though, I just want to gripe for a minute. I’m just going to gripe. This is not my normal mode, but I’m going to lean into it.

STEPH: Do it.

CHRIS: Accounts, just accounts. I have so many accounts now. There are so many across different systems, and I’m trying to do the good thing, which is let’s stop using personal accounts for anything and only use organizational accounts for the things that are for work. And some organizations do a great job with this.

GitHub, I’m looking at you; really well done, super happy with the way that you folks have implemented accounts. You get that I am one human being that contains multitudes. I am my personal self; I am my work self. I am maybe even another version of work, and you get that. And you usually let me exist as all of those versions of myself and, man, do I appreciate that.

Heroku, you’re okay. Like, it’s all right. You treat the different facets of me as different accounts, but that’s okay. You make it relatively easy to switch between. Although you do make me two-factor auth and re-login every single day, and I don’t love that. So I don’t know what’s going on there, but fine.

Trello, aka Atlassian, I guess at this point, come on, what are we doing? What’s going on here? So originally, I had started, and I had the one Trello account, and I had my personal boards. And then there was the Sagewell organizational account. And within that, there were some boards, and I would just bounce back and forth.

But I realized, no, I need to do the right thing. So I created a new Trello account. And now Atlassian just forces me to switch between them, and it loses the link that I’m going to often. It’s a different login interstitial screen. And it constantly shows me that like, hey, you don’t have access to this. Do you want to switch accounts? And I say yes. And then they take me to a screen where I can pick between two options, the one that I was that didn’t have the ability to do it and another.

And as a developer, I know that the thing I’m about to say is not fair. But come on, folks, you could know the answer to this question. There are two, and one is the wrong answer, so the other one is probably the right answer. You don’t need to autolog me into that; I get it. Just emphasize it because they almost look identical on the list.

I have now accidentally tried to request access with my secondary account to my other account, and I can’t get out of that state. So now, one of the ways that I try and do this it shows me a list of them to pick. The other it says, “You have requested access. We’re waiting to hear back.” And I’m like, no. So anyway, that’s a thing.

STEPH: So I know people can’t see me. [laughs] So I’ll narrate that I’m dying over here because I very much appreciate that we are positive people. We are very focused on bringing positive energy, but the descent into the amount of shade that you’re throwing at different applications [laughter] just really made my day, and I feel that pain.

I have felt that pain with Atlassian and can relate. And we should have some gripe sessions. This feels healthy. This feels very…okay, well, I don’t know for you. I’m the one that’s laughing and getting joy out of this. I don’t know if it’s helpful for you, but it feels very cathartic to me. [laughs]

CHRIS: It is definitely somewhat cathartic. I think there’s utility in having these sorts of conversations. And throwing shade at Atlassian, whatever, they’re doing fine, so I’m not super worried about it. But generally, we try and keep things positive because I think that’s, frankly, a more effective way to communicate.

But occasionally, it is useful to look at the things where I’m like; that is a pattern that I do not want to repeat. And I’m sure that there are complex organizational enterprise-y reasons that it has to be this way. But I can look at that and say never that. That experience as a user is like, wow, yeah, I just tripped over nine layers of your enterprise there just trying to do very simple day-to-day things for myself. So I want to avoid that. I’ve griped about that one login, not the company OneLogin.

But that one login page that I’ve experienced where I start to interact with the form, and suddenly some JWT handshake in the background happens, and I’m now logged in. And it just rips the page out from underneath me. That is unacceptable. That is not okay. And I really do think there’s something worth occasionally looking at those and being like, well, not that. But anyway, I should probably stop my gripe session now.

STEPH: [laughs] Well, if I may join in, I have one that I’d like to share. Since we’re on this —

CHRIS: Throw it on the pile. What else we got? [laughs]

STEPH: [laughs] So there was some code. There was a piece of code that I was looking at that was very not friendly. It was difficult to understand. It took a while to parse through what are they actually doing? What records are they creating? Why did they choose this manner? Why are we iterating over these particular numbers? What’s the outcome here? And I was pairing with Joël and was going back and forth having a conversation trying to be the detectives of why this code exists, and we finally got there. And we finally understood what it’s doing and why. And I just lost it for a minute once we finally got there. [laughs]

I just thought the way this code is written, it does not improve readability, and it doesn’t improve performance. All it did was make my life harder because it was very difficult to read. So all they did was become really clever with the code that they were writing and essentially drying it up, which I have such a beef with DRY because it has caused me pain. And so they essentially were drying up their code or introducing a way to make it just take up fewer lines that took up less vertical space. But overall, I was very grumpy about it.

And Joël was very kind about it and was like, “Well, this is the type of code I could see maybe why they did this.” But you’re right; it doesn’t help with readability and performance. And he was helping balance out my grumpy goose moment. I’ve been having a lot this week; maybe it’s just the week I’m in. I’m in more of a fiery mode this week [laughs] with some of the code that I’m seeing, and that was one of them. That was the please, please, please don’t DRY up your code. If it doesn’t improve readability or performance, there’s just no need. There is no benefit.

CHRIS: Well, I definitely know that feeling. And I think I’ve probably, as a developer, gone through that arc where early on I was just trying to make stuff work, and then I learned how to be clever. And suddenly, being clever became a game that I could play.

And then, pretty early on, I realized I would come back to my own code from two weeks ago and be like, what the heck does this do? I have no idea. And that’s when I was drawn to Ruby. That was one of the things. I’m like, oh, I can write code that looks so much like the clear words that I have in my head about the thing. I like that. And so much of my career has been spent in the let’s make it obvious and revisitable.

I actually remember very clearly early on in my time at thoughtbot, I was working on something and was working on it with Joe Ferris, who is the CTO of thoughtbot and a very clever individual, and I mean that in the truly positive sense of the term, one of the most capable engineers I’ve ever worked with. He was describing an anecdote, but it was basically he’d put up a pull request. And someone replied, “Oh, that’s clever.”

And Joe’s reaction was, “Oh, crap.” Just taking that as not an insult but as someone saying, oh, that’s clever in a positive way, and Joe hearing that in the negative form of I went too far here, or this is not obvious in its initial interpretation. That really stuck in my head from there, just his reaction to it immediately of that being not a good thing. And I was like, that is interesting. And all the more so over time, I’ve come to believe that clever is probably something to avoid in code.

STEPH: Yeah, agreed. I’m at the point that if I do see someone who’s done something that I do think is clever in a positive way, I will still abstain from using that word clever because I do want to make sure they don’t think that I’m saying in a bad way that this is clever, that it’s not readable, and it’s not friendly. So I totally avoid that word when I’m complimenting someone’s code just to make sure there’s no confusion.

CHRIS: It’s one of those words that got away from us that we lost the definition of, and then we came back, yeah.

Mid-roll Ad

Hi, friends, and now a quick break to hear from today’s sponsor, Scout APM.

Scout APM is an application performance monitoring tool that’s designed to help developers find and fix performance issues quickly. With an intuitive user interface, Scout will tie bottlenecks to source code, so you can quickly pinpoint and resolve performance abnormalities like N+1 queries, slow database queries, and memory bloat.

Scout also recently implemented external service monitoring, adding even more granularity when it comes to HTTP requests and API calls. So give Scout a try today with a free 14-day trial and experience first-hand why developers worldwide call Scout their best friend.

And as an added bonus for Bike Shed listeners, Scout will donate $5 to the open-source project of your choice when you deploy. To learn more, visit scoutapm.com/bikeshed. That’s scoutapm.com/bikeshed.

CHRIS: Let’s see. In other news, you had mentioned this earlier, and then I had mentioned my side of it but errors in alerting and all of those sorts of things. They’re an interesting question. We had a small situation over the weekend that turned out to be kind of real, kind of not real. But I happened to be away on vacation. I did have my computer with me because, at this point, we’re early enough. And I’m like, I’m going to take my computer everywhere and just be ready in case it’s necessary.

And in this case, I did get a ping. I looked into it and what was unfortunate is it wasn’t immediately obvious if something was broken or not. And to a certain degree, that’s always going to be kind of true. There’s so much noise, so many requests hitting a web application. And how do you tell the good ones from the bad ones? And ideally, I could threshold around certain volumes of traffic, but even that’s going to have spikes, and ebbs and flows and things like that.

So it was very hard initially to understand is something actually broken? And then all the more so to understand what was broken. Thankfully, it was tractable. It was solvable. And we’ve done, I think, some good work especially considering how early on we are and how we’ve instrumented things in Sentry, in particular, our usage of Sentry and also somewhat in the logs.

But again, I think I’ve talked about this before, but I’m feeling this tension around there’s data. There’s data just kind of like, what happened? And right now, we’ve got logs. That’s one of the places that goes to Sentry if it gets escalated up to that level. And we sort of have a weird Venn diagram between logs and Sentry. And then we also have analytics as another thing and then eventually data science, and what do we want to try and learn?

And all of these kinds of want different facets of it’s not the same data set. But I wonder, is there a superset of data that then we could filter and slice and cut up, do all those sorts of things? I think this is the dream of Honeycomb and platforms like that, but I’m not even certain if that’s true. And so I’m in that awkward middle space is how I would describe it.

But in that particular case, I was able to resolve it. I did take away as an action it’s probably time to start thinking about PagerDuty anomaly detection, that sort of thing. When does alerting happen? When do engineers actually get calls when not just during the normal nine-to-five of the workday? So I’ll be investigating that in the coming weeks and see where we get to. But it’s sort of the first thing that really pushed us in that direction.

The other thing I’ll say is we have the idea of the point dev, which I’ve talked about on a couple of episodes. But the idea is for each week, one individual on the engineering team is in charge of the noise, for lack of a better term. They’re looking at the error stream in Sentry. They’re looking at any ad hoc requests that are coming from our admin team, et cetera, et cetera. And that’s been really great.

But one thing that I’ve noticed is that dealing with the errors is particularly tricky and what we did in this particular case was just to pair on that. As an individual, it is really hard to sometimes to reproduce, sometimes to just understand these are the things you didn’t expect in your code, and therefore they are, by definition, harder to understand, harder to think about.

And then sometimes you get to an understanding. You’re like, ah, what do we do about that? Do we care? Do we not care? Is this just noise? Is this something we should solve? Is it something we should solve soon? Or is this something we can solve whenever we get to it in the backlog? And making that sort of determination is all the harder.

And so I’m increasingly of the mind that there should be some amount of time that is pairing on that error backlog to bring two heads together. I hadn’t been thinking of it this way, but I’ve now come around to thinking this is a really great place for pairing because it’s so hard for one individual to deal with that complexity to make the hard value judgments. And to do that, if each individual does that in a vacuum, then we have n different value systems at play that are hopefully very similar.

But if we start to pair up, then there’s osmosis between those groupings. And ideally, we sort of coalesce towards a shared value structure around, like, what can we ignore? What should we snooze for a week? What should we put in the backlog? What should we prioritize and fix immediately? Because I think those are really hard things to otherwise…that’s really hard to document, I would say. I would love to write up a page in the Wiki that says, “This is how you treat errors,” except each error is a unique snowflake, and you just have to follow your values.

STEPH: I have been on teams where we’ve written up documentation that helps you triage an error because you’re right; you can’t write documentation around a specific error. But that I always found really helpful where it was like, here’s all the links that you can look at, here are some recommendations. When we were working on an application that was falling over more often, there were some specific outlines around if you see this problem, then this is typically how you can solve it. And then we had to fix that at a larger scale, but it was a nice band-aid to get us through at that point.

I like the idea of pairing, especially as you mentioned; it’s tricky. It’s funny when you mentioned capturing those errors and putting them into the backlog because I like that idea that then you can prioritize and bring those into the sprint. It just made me feel a bit hesitant. If we don’t work on it now, we’re never going to work on it. But then that feels unfair to say because it really comes down to the team.

If you have a team that’s going to be able to look at those errors and say, “Yes, we’re going to bring them in and prioritize them,” then that feels really good to then be able to say, “This is an error. Let’s capture it. Let’s provide some content around it. But it doesn’t need to be addressed at this moment. It’s still pretty low in terms of risk for users or at least low in impact for users.” So yeah, I guess it just depends as long as the team feels good about being able to prioritize errors, which I feel confident that your team would be able to do. And if you can’t, then y’all could reassess that plan.

CHRIS: That’s why we definitely have that. We’re revisiting the errors. They’re part of the same backlog as everything else. So they’re coming up in relative priority and getting worked on and getting resolved. But we’re also shifting our thinking just a little bit to say, “We should take a little bit more time in the moment to try and resolve some of these where we can.” I have the dream of there are just zero bugs ever. But that’s hard, especially in different platforms.

And we’re seeing a lot of mobile traffic and from different older Android versions and so weird JavaScript edge cases and things like that. Like, why does your runtime not have object? That feels like a thing every JavaScript runtime should have. But that’s a joke. Every JavaScript runtime, I’m pretty sure, does have object but that sort of thing. It’s like, whoa, this is weird and specific to this one device. Cool, those are fun. So yeah, giving a little bit more time to do those.

And again, so we definitely do have the document that describes here are the places to look and how to think about this category of error and this category of error. But at the end of the day, you get one that’s just like, there’s not a ton of detail in the error. It’s hard to reproduce. It might be device-specific, et cetera. And so what do you do in that moment? And that’s where we’re trying to…I think pairing is a great way to share that thinking around the team. So overall, it’s been great, though. I think everyone who has been involved has been like, “This was better than when I did it on my own,” so cool.

STEPH: Awesome. That sounds great.

CHRIS: Yeah, I think so. This is one of those ever-evolving facets of how we work as a team and how we build the platform. So I will certainly report more in future episodes, but for now, happy with that. And yeah, what else is up in your world?

STEPH: Yeah. So we’ve been looking specifically into tooling around how we’re going to spin up more machines to process more RSpec tests. So specifically, we have around 80,000 RSpec tests that we are processing, and we have one machine that is parallelizing those and takes around just for that portion of the build because then there are other tests and things that get run that brings it up to about a total of 30 minutes.

But for the RSpec portion, I think it’s probably around 20-ish minutes to process those 80,000 tests. So we split that across four different containers, and then we run those tests. And so we’d really like to spin up more machines to then process because we’ve reached the point that we have given as much power to that one machine as possible. So now we’re looking to add more machines.

And one of those solutions that we’re looking at is using Buildkite, which is built with the idea that you can add these build steps so then you can more easily say, “All right, once we get to this particular build step, hey Buildkite, we’d like to run n number of machines to process all these tests.” And that seems really nice. And it is something that we are interested in. It is actually what Shopify uses. They use Buildkite ci-queue, which is built for mini-tests, which is what they use, and Redis to then run all of their tests.

But we are using TeamCity, so we’re not using Buildkite. And we would like to see if we can grow with our current CI infrastructure versus having to move to a new one. There’s a lot of just risk involved in moving to a new one. And so we’ve been studying hard if TeamCity will let us do this. And so far, the answer has been no.

But just recently, we found somewhere in the docs that it looks like there is a chance that with TeamCity, we can inform TeamCity that, hey, even though we have just this one build step, instead of only giving us one agent or one provisioned machine to then run these tests, instead that we actually want to spin up a couple of machines to then process these and then aggregate the results back to this one step. So we’re looking into that.

But I wanted to throw this out there in case anybody else is also using TeamCity and has already invested in this particular approach. I would love to hear about it because we are currently figuring out the capabilities and if this is something that we can stay with our current infrastructure or if we’re really going to have to look for a new solution.

CHRIS: Well, I’m hopeful that someone out there can give you some input. I definitely get the idea that you’re stuck, and stuck is maybe too strong of a word. But if TeamCity is not ideal, the idea of moving off it does feel exceedingly heavy and the riskiness that you talked about. That’s, I think, a critical word here because I think it’s easy to think of CI as like it’s a very important thing. But that’s absolutely critical as part of your deploy pipeline, I assume.

This is speaking generically about CI, and so it is, in fact, a critical piece of the infrastructure. If you’ve got a bug on production and suddenly CI is down, what do you do? I guess you can test locally and decide you’re going to push past it, but then you have to circumvent it. And so I understand the intentional way that you’re thinking about that and the risk associated.

I do wonder, though, if TeamCity has felt like not the right platform for a while and if there are considerations. Is there the possibility of both trying to improve the world that you have now, so it’s not the big move off of it but then also in parallel start to work on an alternative implementation? This is perhaps not entirely fair, but it feels like a Rails application is this repository of code. And typically, CI is configured via a file.

And that’s like, if you’ve got your teamcity.yaml or whatever it happens to be, could there also be a buildkite.yaml that is not on the critical path for deploying or anything like that? But it is a way to, frankly, somewhat inefficiently test on two different platforms but start to see if you can get the code moving on a different platform and be able to gradually build out and make that transition possible without it being one big swap over sort of thing, which eventually it would need to be. But just wondering, is that happening in parallel? Is that a possibility?

STEPH: I think the short answer is, I’m sure there is. There’s a way to look at the existing system and then find ways that we can tweak it. But I also know that the team has already invested a lot into working with the current system and making it as efficient as possible. So I don’t know if there’s any true big impact but intermediary steps that we can take. We are definitely in that proof of concept world. So we’re not going to move anything over for the rest of the team until we can really prove that something is working for a small subset and then start to expand from there.

But currently, our idea is to dig further in TeamCity, which I think also includes just a call to their team and say, “Hey, we’d love to talk to one of your engineers and see if the thing that we’re trying to do if it’s possible. Let us know if it’s not and if we need to look elsewhere,” which is intriguing to me because having a lot of tests isn’t new. There are tons of companies that have lots of tests, and they want their CI test suite to be fast.

So a company that then has built software that helps Team execute these steps that then the ability to say, “Hey, I want more machines to process. I want to give you more money and to give us more machines, and we can process more things.” I feel like that should be a thing. And I’m getting at the edges of my knowledge. This is why we’re exploring all of this. But it has been surprising to me to realize that that doesn’t seem as easy of a thing as I would have expected it to be.

There are also some other concerns around here where the client that we’re working with if we’re going to work with third-party vendors, then we have to get special approval to work with them. It’s not just a hey, we can just go try it out. It’s a lengthy contract process that we’d have to go through. So there are also some constraints that we have to keep in mind where we can’t just work with anyone. We need to be careful to make sure that they’re certified in a particular way.

So yes, I like your idea. I will definitely keep it in mind. But I don’t know if there are any true intermediary steps yet other than the building out a proof of concept and then finding small ways that we could move over. Then I think that would be ideal for sure. And then hopefully, if there’s anybody that’s listening that has experience with TeamCity or Buildkite, that’s the other tool that we’re looking at using, let me know. I would love to chat about it and find out your experience. On that note, shall we wrap up?

CHRIS: Let’s wrap up. The show notes for this episode can be found at bikeshed.fm.

STEPH: This show is produced and edited by Mandy Moore.

CHRIS: If you enjoyed listening, one really easy way to support the show is to leave us a quick rating or even a review on iTunes, as it really helps other folks find the show.

STEPH: If you have any feedback for this or any of our other episodes, you can reach us at @_bikeshed or reach me on Twitter @SViccari.

CHRIS: And I’m @christoomey.

STEPH: Or you can reach us at [email protected] via email.

CHRIS: Thanks so much for listening to The Bike Shed, and we’ll see you next week.

ALL: Byeeeeeeee!!!!

ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let’s make your product and team a success.

Support The Bike Shed





Source link

Leave a reply

Please enter your comment!
Please enter your name here