Tuesday 31 March 2015

Blame-Free Culture

Today I managed to completely fill up our dev server to the point where you couldn't run rm -f because it didn't have enough memory. I couldn't even do an ls -l to see which was the offending file. As a side note, I found out this happens (on a Solaris box) when you fill up the /tmp directory because you don't have enough memory to call non-native commands. The solution is to do > filename which copies nothing into the file (if you happen to know which file is the offending massive file), reducing its size to zero. Hopefully that'll free up enough memory to allow you to start deleting things. I just kept "zero-ing" files until I had enough space to list the directory contents.

I didn't know why the /tmp drive was suddenly full, as all I did in the morning was kick off a build for a project I had been working on, something I had done without issue hundreds of times. It turns out that there was an issue with the deploy script we were using (it had been changed somehow), and there was a typo in it, causing it to copy every version of our build, rather than the specified version.

I told Intern Daniel about it, as he happened to walk by my desk while I was panicking about killing the dev server and he was making fun of me for breaking things. Usually it's me who is making fun of him for breaking things.
Me: I worked out what the problem was. There was a typo in the deploy script.
Intern Daniel: lol, so who did it?
Me: I don't know, can't be bothered looking through source control to see who checked it in.
Intern Daniel: You got your certification, and now you're so lazy!
Me: No, I'm not being lazy! It doesn't matter who broke it, because it's fixed now.
Quick note for non-programmers: Programmers tend to use source control to keep track of changes to the code base. This allows you to revert back to an older change in case some new code breaks everything, and also allows multiple developers to work on a single project without having to do crazy things like email each other file changes whenever someone changes things just to keep track of all the changes. Checking-in your code is when you decide to push your changes up to the "master copy" so that other people can pull them down and incorporate the changes. It's generally considered bad etiquette to push up broken code, as other people will pull it down and think that they have broken the code when it was actually you. If you do check in broken code, it is your responsibility to fix it.

When I first started in this team I was really nervous about breaking anything. It took me so long to build up the courage to check in my first change because I was afraid the entire project would come tumbling down and the bank would explode. (If you're wondering whether I checked in any code in my last team, yes, but I had my own branch which nobody pulled from, and nobody looked at until my last week in the team, so it wasn't like my work was going to interfere with anybody else's.) Obviously, that meant that anything I did took a lot longer than it should have, because I was too afraid to try anything out. On one hand, it meant that every single thing I checked in had 100% code coverage, and I had run every single test I could think of against it. On the other hand, it really shouldn't take me a month to make a small config change. At least, from the point of view of my team, I was considered a "free" resource at that time, as the team wasn't paying for me, the graduate program was.

Eventually, I did screw up, and I broke something. I didn't even know that thing existed, but one of the other developers pulled me aside and said, "Hey, that change you made caused X to break, do you mind taking a look at it?" I said I'd take a look, but inside, I just wanted to melt into a puddle. It didn't take me long to fix it, and so I quickly checked in my fix. Phew, coast is clear, back to super cautious mode. But now that my broken code cherry had been popped, I was starting to realise that other people occasionally made mistakes, too. I'd often hear, "Oh, shit, I just accidentally deleted the deploy job, sorry guys, I'll fix it soon" or "Oops, I just ran the script to stop X application, it'll be back up in 5 minutes." And you know what? People just laughed it off, joked about the bank exploding, but nobody got up and yelled, "Oh, My. Fucking. God. You are so useless, I can't believe you are still working here." Nobody got mad. Half the time, nobody even looked up except to say, "That's fine, I wasn't using X anyway, take your time."

One time someone said, "Oh, John has broken the build again." and the lead developer replied, "Hey, it's a blame-free culture here" but in a really sarcastic way, like how people would say, "We're a synergistic team who kicks goals and we're seamlessly moving forward to the cloud." Despite his sarcasm, I do think that our team is very blame-free. If someone does break something, someone else might give them a heads up, but not because they're trying to say, "You moron, look what you did", but because they know the person who broke it will likely be in the best position to fix it.

It's really comforting to me, and once I realised that nobody was going to kick me out of the building because I broke a build, my velocity increased quite a lot. Before, I found myself constantly checking with the other developers, "Is it OK if I do X?" or "I'm not sure how I can go about solving Y". One of the top developers on the team would often ask me, "What have you tried so far?" and I would say that I haven't tried anything yet, but I have thought of A, B, C solutions. Then he would suggest that I'd try them and I'd do it.

Now, I just jump straight in. I've broken things, I've deleted things I shouldn't have. I created a job that ran over and over again for 4 hours, and a particular build of ours has over 200 tags (when it should only have 1) because my auto-tagging job kept running. I've made mistakes, and it's OK, because in making them, I am learning. As long as I don't leave a crazy backlog of things that need to be fixed, and I am not inconveniencing others too badly, of course. I know some people don't like the idea, as it means people don't need to be as accountable for their mistakes, but if you have a mature team like the one I am currently in, I think it works really well.

Should I be fixing other people's broken unit tests? That's a whole other story that isn't going to fit in this blog post.

No comments: