Why Is That Feature Taking So Long?

turd-polish

I've observed a recurring source of tension:  building things fast vs. do it the right way. Usually you're not lucky enough that you can do both.  This post explains a bit why we (engineers) care so much about building things right.  Even when things are overdue and our stakeholders (end-users, biz folks, product management) are pushing to just get it done.

You can describe the two poles here in pejorative terms.  Fast is slapdash, quick and dirty, bad software.  The right way is really just ivory-tower over-engineered turd polishing.  Is that fix really so important that we shouldn't ship it? Do you really need to refactor that now?

I see three things driving engineers to build it right.

  1. Wrong is insidious. A story: early on in Class2Go every page began with a fetch of some basic page data. If there was an error on lookup we would throw a 404. Not only is "not found" the wrong message, but it also isn't helpful. It covers up all kinds of other problems, in a way that is really difficult to debug.But for an engineer, these programming messes are broken windows. Not only do they cheapen the project, they encourage others to take bad shortcuts. Heck, if they can't be bothered to handle exceptions correctly, then why should I?
  2. Wrong is the express train to support hell. Unless you leave the company or project, you'll be called upon to support your own crappy code. That is certain. And there is almost nothing worse than being dragged back to debug some old code that you never meant to become production, and is now breaking the world.
  3. Wrong hurts your reputation as an engineer. People see you taking shortcuts and infer that you are a sloppy programmer. And really they aren't wrong to do so. One habit of really good engineers is they don't leave a wake of messes behind them. They are able to do things (mostly) right even when they are moving fast.

This last point is really the key one. Amongst engineers your reputation is your hard currency. It affects what projects you're invited to work on and what companies you'll be asked to join. Most importantly, good engineers only want to work with other good engineers, and they won't seek out someone sloppy. So I assume that any code I write will be looked at by a potential colleague or employer. This is especially true when working on an open-source project.

This doesn't mean you have to go slow -- indeed, on my current project, Class2Go, we get a lot done every week. What it does mean is that sometimes I'd rather not do something than do something I know will come out poorly.

So, product manager waiting on a feature. What to do if you're waiting on an engineer and you think it's taking too damn long? The tactic **not** to use: tell them this won't matter and to just ship it. That will just piss him off. Better: get your senior engineer to convince the junior one why it's good enough to ship and reason about what really needs to get done. Exception: if the engineer you're waiting on is your senior engineer, then trust her judgment and wait.

And finally, to you product managers and biz folks who push. You are right do do so. That's all that matters to the customers, all matters at the end of the day. Keep it up.

Halloween Candy Data

Update with actuals from Halloween 2012:  It was a banner year.

halloween-2012


You may be giving out candy later today. What can you expect? Let's look at some data.  This post summarizes the past three Halloweens.

Cumulative Trick or Treaters

As you can see we live in a pretty popular neighborhood.  Each year has its own story.

  • 2009 - our first year in our new neighborhood. We had no idea that this was such a popular trick-or-treating spot. I ran out of candy at 8:00, turned out the lights, and hid in the back of the house. Shameful.
  • 2010 - a fine year.  No complaints.
  • 2011 - we moved to a new house just around the corner.  I figured the quieter street would mean fewer kids -- not so! What I didn't appreciate was the attractive power of my next door neighbor's insane decorations. Luckily my wife came back with emergency supplies just in time.

And how busy do things get?  Darn busy.

Average and Max Trick or Treaters per Minute

During the busiest 15 minute period last year I was serving a kid every twenty seconds or so.  When bursting this is close to my max current candy-dispensing throughput.

If you come by my house this year you'll see me again, handing out candy with one hand and scribbling hash marks with the other.   I'll update the data in my public spreadsheet.

Managing Work-In-Progress Folders with "ls -ltr"

I've developed a nice little way to manage my work in progress folders.

I keep a bunch of folders around where I can stick things that are in-flight: one for home projects, one for blogs, etc. I like accumulating ideas in there, one idea per file, that I can come back to and fuss over until something is ready.

The folder is flypaper for ideas and notes.  My blog folder has about twenty files right now.

Using "ls", and relying on file and directory modification times, is a useful way to keep track of all these files.  Here's how.

As projects are finished I move them into a "done" folder. This gets them out of the way.  A useful side effect is the modification time of the done directly itself gets updated.

The in-flight projects I care about most are the ones most recently updated: as an idea gets older it becomes less and less interesting.  When looking at this directory I use the "ls -ltr" command.  "l" means long (so you can see dates), "t" means sort by modification time, and "r" reverses that sort, so you see latest modified on the bottom.

The old files are at the top where I don't worry about them.  Some even scroll off the top of the screen, that's fine.  The done folder is a nice visual line for when something was last completed (say, published).  Having a separate color for directories makes this visually apparent.  And the bottom of the list are those most recently modified. Generally I work on the bottom up. Sometimes I go back to an old idea and add some notes. That bumps it to the top of the line -- it is interesting again.

I probably type "ls -ltr" at least 5 times an hour.  I've never aliased it since my muscle memory is so strong (but maybe I still should).  

The Day I Took Down 100,000 Web Sites

fire alarm

Every engineer has their "the day I did something very very bad" story. This is mine.

It was my last week at Ning. For about a year I ran the Infrastructure and Operations group. It was the end of a good stretch, working with a strong team on hard problems.

In early 2010 (before I joined) Ning had successfully switched from a premium to paid model. Of Ning's two million social networks some were thriving onine communities, but many weren't. Trading those in for one hundred thousand good paying customers had been the right thing for the business. We kept around the 1.9M orphaned sites for some time in case someone wanted to assume ownership and start paying. By late 2011 the revival rate had dropped low enough that it was time to clear out the cruft.

I volunteered for the cleanup. I enjoyed little projects like this because it kept my hands in the code and took away a task that would be a chore and distraction for the engineers. The job was some SQL queries, scripting, load testing, checking, and double-checking. The hardest part was ensuring we had the right list of networks to delete.

It took about two months to write and run the scripts. I felt good that it had gone through uneventfully.

There was a legacy corner of our system that hadn't been included in my cleanup pass. I hadn't worried about it at first since it was just a few thousand networks. But now that I was leaving Ning I wanted to tie off my loose end projects, and this was one. I assembled the list of stragglers and ran my scripts. The 2,000 deletes went quickly. This was 11:00 AM on regular workday -- my last Wednesday on the job.

I immediately knew that I had done something very wrong. Since I sat close to the Customer Advocacy group I heard their IM clients and phones light up. Every Ning page was a "We're Sorry" 404 page. Oh crap.

It only took a couple of minutes to find the problem. One of the networks I had marked for deletion had core Javascript and CSS needed by every Ning network -- important stuff. The network, "socialnetworkmain" was known to everyone as being important, and if I had done my spot checks better, I would have seen it straight away. Instead, I had been sloppy.

The backdoor method I was using bypassed the safeties we had in place to protect our "really important" networks.

Luckily, Phil, one of our top-notch infrastructure engineers was sitting close by and quickly restored the network through the backend. Things returned to normal almost immediately. That feeling of panic started to back off. The jokes started and we started to relax.

It took us another few hours to find and fix second and third order effects of my snafu. The network creation flow was broken -- fixed. An in-house monitoring system was dying silently -- fixed. And so on.

Dilbert.com

I learned four lessons:

  1. Undo is your friend. The tooling or features to revert your change will take time, but you will be happy to have it if you ever need it.
  2. Make changes during the day. Resolving a hairy problem is hard if you don't have the right people, and when the people you do have are tired and foggy. This incident might have taken hours to resolve instead of minutes in the middle of the night.
  3. It's not over until it's over. Once in crisis mode, you're eager to be get out of it. Once the main problem is solved it is tempting to exhale, go to lunch, go back to bed, whatever. But aftereffects might take a while to show up, or may just be harder to find.
  4. Disproportionate effects. We all know that our complex systems nonlinear failure modes. Incidents like these are reminders that small changes have big effects. Said another way, even small things should be taken seriously.

My time at Akamai had prepared me well for managing a crisis. But it's nice to have a refresher course every now and then.  

On-Line Education Is Really Interesting Right Now

I've joined a research project at Stanford University (my alma mater).  I am working with a small team to build a platform for on-line education. This post explains what we're building, my part in it, and why this is an interesting area right now.

Class2Go

We are building Class2Go, an application to put Stanford classes on line.  Envision a video-driven web site with exercises and tests.  It will run much like a class today (professors, TA's, lectures, homework, tests, schedules) but with everything happening on-line.  For some kinds of classes this could improve the classroom experience for enrolled students; the excitement comes when you can bring in the a much larger set of non-enrolled students.

The professors who have expressed interest so far are those who want to flip their classrooms or host a Massive Open On-Line Course (MOOC). Others have ideas on experimentation with the learning process itself. For example, do students learn better when they see slides, a talking head, or both? A large enough student population makes meaningful studies possible and cheap.

There are a bunch of features needed over and above basic basic video.  First, can we devise a good system for students to grade each others work? This is an absolute must to do anything in the humanities at large scale.  We have some ideas of how to do peer evaluation, but it's an area that needs much more experimentation. A second important feature is enabling off-line use from laptops and tablets.  When we think of disconnected students we typically think of commuters, but we've heard it's just as important (or maybe more so) for students in developing countries where bandwidth is precious.

While there are commercial offerings to do this, and even things from other established universities, Class2Go will be different in three important ways:

  • Research will be first and foremost. Hey prof, you want to do a wacky experiment? We can help.
  • Value produced by the course is retained and controlled by the professor and the university.  That value comes from the assets (video, homework) but even more so the community and technology.
  • The professors will have quick and ready access to their data. No waiting in lines for reports.

Stanford Professor Rob Reich made a very coherent argument for this in a recent blog post. (Getting to work with smart people like him is a major job perq, by the way.)

Part of what makes this doable is that there are so many great building blocks available.  We plan to use YouTube for video and the Khan Academy framework for exercises.   Piazza has great discussions and forums to start with. Because of all this good stuff, we feel good about getting something strong out by the Fall for a couple of courses, and then expanding from there.

And me?  I'm the line manager. It's fun to work on a project of this size where I can keep my hands dirty. For example, this week I'll be working on authentication (oauth).

Why Education, Why Now?

Two things brought me to education.

Firstly, education is uniquely important. I challenge you to name another human pursuit that is as important, consumes so many resources (for good reason), and can be so transformative when done well.  The other two that come to mind are health and agriculture. But I'd say education is right up there.

And second, education is changing right now. Of the meetings I've had while on sabbatical  some of the most interesting have been about education. Why can't information technology have as big an impact on how people learn as it has had on everything else? Technology could mean making it available to to many more students, in different places and of different means. Educators could spend less time lecturing (dry stuff, to be sure) and more time working with students. Better measurement and analysis and could give educators near-real-time feedback and students tailor-made homework. Those are just a few of the examples that I heard about that got me excited.

I look forward to doing something that will make even a little difference here. I'll blog about my experiences along the way.

PS - Peter Norvig's TED Talk captured the reasons and potential nicely.  That seven minutes is worth your time.

Two Things At Once

Engineering management is about solving problems and removing obstacles. You always have more problems than time and resources to solve them. How can you stay on top of it all?

A trick I learned from Julia Austin (my boss at Akamai some years back) was force yourself to always solve two problems at once. Before jumping on today's problem, figure out something else you can fix at the same time.

Say you have a retention problem. Engineering department scuttlebutt is that people are burned out with maintenance projects. But you are also concerned about your low bus number. I had these two problems once, and my answer was to actively shuffle engineers around onto other projects. Not quite musical chairs, but close. It worked. The engineers liked the variety, and I gave my top performers first dibs on projects, which they appreciated. There was a short-term productivity hit, but it was worth the benefits from cross-pollination and engineer happiness.

Another real-life example. I had a remote office that felt disconnected and needed some knowledge-sharing love. I had to send someone over for a few weeks. I initially planned on sending my go-to senior person, but they had had been before and to them an international trip was a chore. Instead I had an up-and-coming engineer who I wanted to invest in. This trip was a stretch for her, but I did some coaching and she did great.  Plus she saw it as a perq and appreciated the responsibility.

This technique has driven a couple good behaviors for me as an engineering manager. I try  to take a minute before jumping in with a solution to see what else I can do while I'm at it.  It also has helped me to keep a useful (and secret) worry list and to consult it from time to time. That list is a useful way for important but not urgent problems to get mindshare. Reviewing that list became part of my Sunday evening get-ready-for-the-week routine.    

FizzBuzz Questions for Engineering Managers

You probably wouldn't hire an engineer that you hadn't seen code. But people tend to hire  engineering managers without seeing them manage. I contend that's a big oversight: you should see them manage on their feet, at least a little.  But how?

I'm thinking along the lines of the "FizzBuzz" class of questions. I believe they got some fame with this article in 2007, and Google seems to support this. Anyway, they are simple programming problems that should be no-brainers for any decent engineer. But they are useful exactly because of the surprisingly large number of candidates who flame out on them. By some accounts, as high as 50%.

So what we want are something similar for engineering managers:  practical questions that can be done within the constraints of an interview. Here are a few that have been useful for me.

Role Playing

Instead of asking them how they would do something, just ask them to do that thing in front of you. Role playing can be awkward if you've never done it before, but give it a try. Once you get over the hump and try it, it can be fun. Here are three good scenarios I've used.

  • "I'm a product manager. Customer Foo (say, one of our biggest customers) wants a feature right before the release is to go out. You're the VP of Engineering. You feel like this will destabilize your release.  How do you say no?"  Sometimes it can be fun to play out the natural follow-up, when they go over your head to the CEO.
  • "I'm a slacking employee. I have the skills, but for some reason I haven't been productive lately. It's our weekly one-on-one time -- what do you say?"
  • "You have to lay someone off. How do you prepare? What do you say?"

If they have follow up questions, or dispute the premise, then play it out. Usually those end up being the most insightful and fun interviews. Also, one warning: questions like these can take you far down the rabbit hole and use up the whole hour if you're not careful.  Remember that its your time.  You should cut them off and switch to the next question, abruptly if need be, to get to everything you need to cover.

Nuts and Bolts

Ask candidates about the mechanics of how they do their job and you can get insight into their experience and values. What tools (wikis, bug tracking, source control, continuous testing) do they use? Do they run team meetings or daily standups? If so, how do these go?

What you're looking for are thoughtful answers. Anyone who has used these tools a lot should have opinions about the them. If they don't then they don't care (bad) or they haven't done the work (bad). Are they negative about everything? Are they flexible?

Motivation

For me this often is the clincher. Why did you get into engineering management in the first place? Why are you doing it now? I won't say what I consider to be good answers, since this varies a lot and should be pretty personal. Again, what you're looking for is a thoughtful answer.

This can be most useful for turning up red-flag responses. You don't want the person who did it for power ("I wanted to boss people around") or ambition ("I wanted to make more money") or just fell into it.  Trust your gut for answers that don't sound genuine and ask probing follow-ups.

I find questions like these can lead to more insightful conversations and get the candidate off-script faster than the typical resume walkthrough.  

Why Quit? Because They Have Bigger Monitors

Good engineers are attracted to places with a strong engineering culture. But how can you see what the culture is really like from the outside? Here are my two quick-and-dirty indicators.

First a word about what I mean by an engineering culture. It means engineers are valued and important. Some implications:

  • How are decisions made? In an engineering culture, technical people have input into what gets built, when, and by whom.  Not signoff, but a real say.
  • Is there respect for the craft of making software? Coding is still creative work that requires the right time and space. Some projets are tough to predict how long they will take, and that's needs to be OK.
  • Infrastructure. How hard is it for the people who know (engineers, managers) to justify to their bosses when work is needed on non-feature driven stuff? This could be in the runtime system (like scaling work on the message queue) or back office (like build systems or version control).

Unfortunately, teasing this out in an interview can be tricky unless you have someone you really know and trust on the inside.

How big are the monitors?

A story from a prior company. I was an engineering manager that had a retention problem. One of the engineers on my team quit to go to a smaller, hipper company. This was from my exit interview:

Me: why are you leaving?

Him: because they have bigger monitors.

Me: (incredulous) are you kidding? we can get you a bigger monitor.

Him: it's not just me -- everyone has big monitors.

Me: why is that so important?

Him: because it shows how much they value my time. The extra money to cram that many more pixels into my retina must be worth it to them.

And now I understand that this is totally true. Places that value their people consider equipment expenses small compared to the productivity (and happiness) of their people.  The best engineers are given the best tools to do their jobs. Big monitors are a very visible sign of this.

Can people choose their own email addresses?

Non-engineers sometimes don't appreciate how important an email address is. It's your identity on line. A strict naming convention (first name last initial, or worse, last name first initial) indicates place that values conformity over engineer happiness. Worse, its a great way to make their people feel like cogs or "human resources," not the cool individuals that they are.

(Aside: let's do away with the term Human Resources.  It's horrible.)

This one is important for me personally since I have a weird first name. If you don't let me be [email protected] then you get major demerits in my book. And no, clunky alias tricks, like a mailing list with one member in it, doesn't count. It's what you see on your shell prompt that matters; it's what whoami returns that matters.

One final word: this isn't a slam on you hardworking IT guys and gals who keep important things running and have to enforce the rules you're given. Instead, I'm speaking to the bad policies (usually stemming from bad cultures) that can put you into bad positions. If you are at such a place, hunker down and pray for daylight.