A Good Interview Question

Sef Kloninger

2025-10-07 15:00

Python language logo

I liked the question that I got when interviewing at YouTube in 2015. At Google then we'd have an interview panel of four or five people, each assigned to cover a different area. Billy Biggs was the TL on my panel asked to evaluate "architecture." For a manager candidate that mostly meant evaluating technical cluefulness (someone else had me do some simple programming).

Billy's question: Discuss what would cause a Python interpreter to crash. Not a program written in Python, but the interpreter itself.

I remember this leading to a fun, rambling, back-and-forth discussion of the ways computers can fail. There are so many! Every level of the stack can fail in interesting ways: storage, RAM, memory management, networking. How would a bit flip in a TLB manifest? How does TCP/IP detect and handle ordering? collisions?

We also covered a bunch of engineering and process questions. How is the interpreter itself implemented, in what language and by whom? What would the quality processes be like for a product like that, especially given Python is presumably a really large open source project? How would you manage this? How important to quality is the role of the BDFL?

And then that lead to a some more interesting higher-level discussions about the actual costs and benefits of addressing failures like these in the field. When should programs hard-fail versus detect and recover? How would you staff an engineering team to find and chase down errors? What's the user impact to a failure like this?

One of my favorite things about this question is most likely it was something they'd actually seen right in their backyard. In 2015 significant parts of YouTube was written in Python (that's likely not the case anymore, I don't know). Crashes like these must have come up in the field. Not only are real-world problems relatively easy to ask, but they have the added benefit of showing the candidate the kinds of issues that the team actually deals with. It's also a well-shaped question: open ended, no right/wrong answers.

I got the job.

Naming Is Hard

Sef Kloninger

2025-09-23 22:00

Comments

This is a story about how hard it is to come up with a good name.

This was in 2013. I was part of a team building a system for online education at Stanford. We launched a nice little site, Class2Go, with a small team in a few months. It worked well and supported the basic features of an online course: videos, assessments, auth, reporting. We decided to keep it going and worked on it for another year, expanding the covered use cases and taking on more classes.

But even from the beginning the name Class2Go felt a bit off. Our original aim was to enable offline MOOC students, people trying to take classes offline or in poorly-connected places But we ended up pivoting away from the offline use case to more standard hosted-course features. It became a little multi-tenant CMS with a small database for auth, assessment results, and some basic analytics.

We decided to rebrand. Someone knew a naming consultant, presumably an alum, and soon enough we had an expert on board. They were really interesting to work with. After understanding the platform's goals and features, they generated hundreds of candidate names. Each was vetting for domain name availability and copyright conflicts. The consultant was good at their job, bringing us something like a hundred candidate names (I wish I still had that list).

One name caught everyone's attention: Tindra. It's a Swedish word that means "sparkle" or "twinkle." It's evocative and easy to say. It also happens to be a woman's name, but not a particularly common one we were told.

Surprisingly, tindra.com was available! In 2013 short and catchy domain names were getting hard to come The new gTLD program had only just been approved in 2012, and .com and .org were pretty crowded.

The problem was that if you Googled "tindra," the top result was a Swedish adult film actress! Not good. Maybe with enough SEO juice, over time, we would have overtaken that. But on launch day, that's not the association you want to have.

We ended up punting on the problem and going with a generic name, Stanford Online. It got the job done, but felt like a missed opportunity. We later merged with edX and I found another job. Eventually the team re-branded the site with the name Lagunita, the name of the lake on campus, which they're still using today. I like it.

I admit I still have a thing for Tindra though. The name, not the actress.

Postscript: tindra.com still doesn't seem to be used for anything today. It looks to just be parked by a German domain registrar. Google results don't seem to be a problem. Maybe it's available for something interesting?

Thanks to Jane Manning for her review, she was there too.

Launch Day

Sef Kloninger

2013-06-11 22:59

Comments

Today was launch day. It went really well. I wanted to capture what a good launch feels like and contrast that with a more exciting launch, just five months ago.

Today we turned on our first class on Stanford's instance the open-source edX platform, what we're calling OpenEdX. The class is Statistics in Medicine, taught by Kristin Sainani of the Stanford School of Medicine. With over thirteen thousand students signed up it's a medium-sized MOOC (Massive Open Online Course).

We have launched MOOC's for Stanford before: two in Fall Quarter, and one in Winter. Even though the classes were huge success, but the launch days weren't so smooth. We had written that platform, Class2Go, from the ground up with a small team in a dozen weeks in Fall; in the weeks before the Winter launch we ripped out the whole evaluation system, about one-third of the code, and replaced it with a whole new engine. In both cases most of our code was fresh off the presses.

Those launches were rocky. I'll tell the story of the DB class launch in January. The first thing we do is a "soft launch," where you open the front door and some people find their way in. Those first visits give you a sense of how things will go. Surprisingly, the servers were a bit busy. But we wanted to keep going, so we scaled up capacity and moved on.

The thing that drives real traffic is the announcement email. That gets people to the site. The announcements started going out, students started coming in, and the site lit up. We were in hot water. Servers were overloaded, and most surprising, the database was getting hammered. This was scary and unexpected. We control-C'ed the mail job and quickly hacked additional caching into the site. We had to trickle out announcements over the next twelve hours. We made it, but it was a long, stressful day.

And then the days/weeks post-launch were spent watching graphs, triaging 500 errors (user-visible "we're sorry" pages), and installing daily hotfixes. But we got through it. The classes were a success and the team was proud.

So, contrast that to today's launch. Totally different.

Everyone came in early as usual. I bought bagels. We turned on the class (soft-launch) and the servers hardly noticed. We sent the announcement mails, people came and took their pre-course survey and watched the intro video. Hardly any load. This chart shows the average CPU on our four appservers from 8:00 AM PDT / 15:00 UTC until 10:45 AM or so.

launch-app-cpu

Those are happy servers. Other charts we watched (db connections, load, etc.) told the same story. The most impressive thing was not a single user visible error, no 500's!

Those folks at edX made some solid software. We're happy to be working with a strong group of engineers and a quality product. We've had our hands on it only since April, and it was released open-source to the world on June 1. I fully expect a lot of other universities and organizations are going to have a great time running classes on OpenEdX too.

I just turned off half the appservers since we're fine on capacity. Now off to bed with a good feeling.

The Day I Took Down 100,000 Web Sites

Sef Kloninger

2012-07-25 10:56

Comments

Every engineer has their "the day I did something very very bad" story. This is mine.

It was my last week at Ning. For about a year I ran the Infrastructure and Operations group. It was the end of a good stretch, working with a strong team on hard problems.

In early 2010 (before I joined) Ning had successfully switched from a premium to paid model. Of Ning's two million social networks some were thriving onine communities, but many weren't. Trading those in for one hundred thousand good paying customers had been the right thing for the business. We kept around the 1.9M orphaned sites for some time in case someone wanted to assume ownership and start paying. By late 2011 the revival rate had dropped low enough that it was time to clear out the cruft.

I volunteered for the cleanup. I enjoyed little projects like this because it kept my hands in the code and took away a task that would be a chore and distraction for the engineers. The job was some SQL queries, scripting, load testing, checking, and double-checking. The hardest part was ensuring we had the right list of networks to delete.

It took about two months to write and run the scripts. I felt good that it had gone through uneventfully.

There was a legacy corner of our system that hadn't been included in my cleanup pass. I hadn't worried about it at first since it was just a few thousand networks. But now that I was leaving Ning I wanted to tie off my loose end projects, and this was one. I assembled the list of stragglers and ran my scripts. The 2,000 deletes went quickly. This was 11:00 AM on regular workday -- my last Wednesday on the job.

I immediately knew that I had done something very wrong. Since I sat close to the Customer Advocacy group I heard their IM clients and phones light up. Every Ning page was a "We're Sorry" 404 page. Oh crap.

It only took a couple of minutes to find the problem. One of the networks I had marked for deletion had core Javascript and CSS needed by every Ning network -- important stuff. The network, "socialnetworkmain" was known to everyone as being important, and if I had done my spot checks better, I would have seen it straight away. Instead, I had been sloppy.

The backdoor method I was using bypassed the safeties we had in place to protect our "really important" networks.

Luckily, Phil, one of our top-notch infrastructure engineers was sitting close by and quickly restored the network through the backend. Things returned to normal almost immediately. That feeling of panic started to back off. The jokes started and we started to relax.

It took us another few hours to find and fix second and third order effects of my snafu. The network creation flow was broken -- fixed. An in-house monitoring system was dying silently -- fixed. And so on.

I learned four lessons:

Undo is your friend. The tooling or features to revert your change will take time, but you will be happy to have it if you ever need it.
Make changes during the day. Resolving a hairy problem is hard if you don't have the right people, and when the people you do have are tired and foggy. This incident might have taken hours to resolve instead of minutes in the middle of the night.
It's not over until it's over. Once in crisis mode, you're eager to be get out of it. Once the main problem is solved it is tempting to exhale, go to lunch, go back to bed, whatever. But aftereffects might take a while to show up, or may just be harder to find.
Disproportionate effects. We all know that our complex systems nonlinear failure modes. Incidents like these are reminders that small changes have big effects. Said another way, even small things should be taken seriously.

My time at Akamai had prepared me well for managing a crisis. But it's nice to have a refresher course every now and then.

Why Quit? Because They Have Bigger Monitors

Sef Kloninger

2012-05-17 16:01

Comments

Good engineers are attracted to places with a strong engineering culture. But how can you see what the culture is really like from the outside? Here are my two quick-and-dirty indicators.

First a word about what I mean by an engineering culture. It means engineers are valued and important. Some implications:

How are decisions made? In an engineering culture, technical people have input into what gets built, when, and by whom. Not signoff, but a real say.
Is there respect for the craft of making software? Coding is still creative work that requires the right time and space. Some projets are tough to predict how long they will take, and that's needs to be OK.
Infrastructure. How hard is it for the people who know (engineers, managers) to justify to their bosses when work is needed on non-feature driven stuff? This could be in the runtime system (like scaling work on the message queue) or back office (like build systems or version control).

Unfortunately, teasing this out in an interview can be tricky unless you have someone you really know and trust on the inside.

How big are the monitors?

A story from a prior company. I was an engineering manager that had a retention problem. One of the engineers on my team quit to go to a smaller, hipper company. This was from my exit interview:

Me: why are you leaving?

Him: because they have bigger monitors.

Me: (incredulous) are you kidding? we can get you a bigger monitor.

Him: it's not just me -- everyone has big monitors.

Me: why is that so important?

Him: because it shows how much they value my time. The extra money to cram that many more pixels into my retina must be worth it to them.

And now I understand that this is totally true. Places that value their people consider equipment expenses small compared to the productivity (and happiness) of their people. The best engineers are given the best tools to do their jobs. Big monitors are a very visible sign of this.

Can people choose their own email addresses?

Non-engineers sometimes don't appreciate how important an email address is. It's your identity on line. A strict naming convention (first name last initial, or worse, last name first initial) indicates place that values conformity over engineer happiness. Worse, its a great way to make their people feel like cogs or "human resources," not the cool individuals that they are.

(Aside: let's do away with the term Human Resources. It's horrible.)

This one is important for me personally since I have a weird first name. If you don't let me be [email protected] then you get major demerits in my book. And no, clunky alias tricks, like a mailing list with one member in it, doesn't count. It's what you see on your shell prompt that matters; it's what whoami returns that matters.

One final word: this isn't a slam on you hardworking IT guys and gals who keep important things running and have to enforce the rules you're given. Instead, I'm speaking to the bad policies (usually stemming from bad cultures) that can put you into bad positions. If you are at such a place, hunker down and pray for daylight.