Posts about Technology

Cleaning Up Photo Duplicates

Pie Chart

I did a data cleanup over the weekend. I doubt this is interesting for anyone else, I just wanted to capture my own notes.

Our family photos are stored up in Google Photos. We have about 100k photos and short movies that take up 0.5TB. I wanted to back up to local storage, but also I liked the idea of trying the mdisc format for long-term storage.

I ran Takeout and unzipped everything. I was surprised to see so many duplicates. Some are copies in the same folders with suffixes like "(1)", most are multiple copies in different folders. Takeout seems to just store a copy for each album a photo is in.

Some poking at the data shows 88% of 302,755 files / 72% of the bytes were unique (histogram at the bottom of the post). Removing dups can save 200+GB. Sure not really worth the trouble but why not.

Steps

1. On the NAS, gather file checksums (md5sum) and sizes for all the files under Takeout/Google Files. Mostly variations on of find -print0 | xargs -0.

2. Data cleanup to get into nice file paths and clean delimiters, mostly interactively with vi.

3. Insert into sqlite using .separator and .import.

Here's what the database ended up looking like: Photos and Sizes are inputs, HashCounts, Duplicates, and Candidates are outputs.

CREATE TABLE Photos(hash TEXT,path TEXT);
CREATE TABLE Sizes(size INT,path TEXT);
CREATE TABLE HashCounts(hash TEXT,found INT);
CREATE TABLE Duplicates(path STRING,hash STRING,found INT,size INT);
CREATE TABLE Candidates(path STRING,hash STRING,size INT,pos INT);

4. Find the dups. I'm surprised to find some with 50 or more copies, but it out some family favorites end up in lots of albums.

insert into HashCounts
select hash, sum(1) as found from Photos group by hash;

5. Figure out candidates for deletion

insert into Duplicates
select Photos.path, Photos.hash, HashCounts.found, Sizes.size
from Photos, HashCounts, Sizes
where HashCounts.hash=Photos.hash and Photos.path=Sizes.path
and HashCounts.found > 1;

The first of each set is the one we'll keep.

insert into Candidates
select *
from (select path, hash, size, row_number() over (partition by hash) as row_number
    from (select * from Duplicates order by hash, path desc)
) where row_number > 1;

I was surprised that sqlplus supports window functions, nice.

The inner reverse-alpha sort on "path" takes care of two cases. I tend to prefer keeping photos with names that start with years, and Those come first alphabetically (nice). Also within folders often there are many copies with "(1)" and "(2)" suffixes that are generally cruft and most worthy of removing, and those also sort last alphabetically (nice).

5. Dump out the "Candidates" using .output. Copy back to the NAS. Do lots of spot checks. Convert to a bash script of rm commands, run very carefully.

Tools

Sqlite is my go-to tool for ad-hoc work like this. It's fast and simple, but only for small jobs -- this one is MB-scale.

select name, sum(pgsize) from dbstat group by name;
name           sum(pgsize)
-------------  -----------
Candidates     3006464
Duplicates     5541888
HashCounts     11051008
Photos         23240704
Sizes          13807616
sqlite_schema  4096

My Synology is a pretty good place for storage with ability to ssh in and run local commands. But if I had to do this again, though, I should have just bought a large locally-connected SSD. All the transfers to-from the NAS were a hassle. Looking now I'm stunned that you can get a 4 TB external SSD for under $300.

Some private notes here.

Histogram (source)

The iPhone's SIM Tray Went Away Too Soon

SIM

If you've traveled internationally, likely you've used a SIM card for local data and calls. There is a nice ecosystem around SIMs with a wealth of easy and affordable Pay As You Go (PAYG) options.

But the newer iPhones did away with the SIM slot if favor of some new eSIM hotness. Apple has all kinds of info claiming they have good international support, but I found reality falls short.

  • Only a few carriers support eSIM's,
  • The few that do require a contract. A tourist or student studying abroad is better served by a PAYG plan, and
  • Even if you can stomach a contract, that would require a UK bank account; no way to easily pull that off.

We ended up falling back to international roaming. It works but is expensive.

I think Apple made the wrong call removing the trusty old SIM tray. Clearly the new models can be made to work well with it, since how they are sold in the UK. If you're unlucky enough to have bought your recent iPhone in the US, you're out of luck.

I this is an example of Apple bad tendency sometimes choose form over function, "courage" over usability.

YouTube

YouTube

I'm excited to start my new job at YouTube in a few weeks. I'll manage the engineering team building the data warehouse for usage metrics.

I like that YouTube is important. It's firmly a part of our culture and I'm sure it will be how my kids watch video. YouTube's impressive statistics are the result. You don't see usage like that without a bunch of hard problems, and hard problems attract bright people. Indeed that's the clincher for why I'm looking forward to working there. People vote with their feet, and I have a lot of friends who have opted for Google, and YouTube specifically. They tell me that it's a great place to work.

YouTube is one of the worlds foremost platforms for social commentary, education, and free speech. And it's plenty of entertainment too. Sounds like fun.

Thick Apps Still Lose

Microsoft Excel 2016 Error Message

Thick apps won mobile. Fine.

On laptop (and desktop) it's not so clear. What is better, thick or thin? I tend to live mostly in thin land, although I use some thick apps regularly, like Twitter's Mac client and Apple Photos.

Every so often I give a big native app a try: Excel instead of Google Sheets, Mail.app instead of Gmail, Reminders instead of the barebones Tasks built into Gmail. (I can't bring myself to try Word). But it's disappointing to see how those fancy apps keep shooting themselves in the foot!

Take for example this Excel error message. Excel is whining that it can't verify my subscription the first time I ran Excel untethered (version 15.11.2, for what its worth). Sure you can click through the warning, but would a newbie know to do that? At best off-putting, at worst downright disorienting. Why warn me of this at all? And why in a modal that stops me dead in my tracks?

It seems thick apps should win. They rock the unplugged use case. An even better situation is flaky networks -- tethered, conference WiFi, travelling. UI's deal notoriously poorly with intermittent or partial outages. A thick client, relying on that connection only for hitting API's, can hide the network.

Another place they should shine is the UI itself. They should be fast, beautiful, and featureful. Too often they're not. For example I find Mail.app to be clunky, difficult to customize, and its keyboard shortcuts few and poorly done. Gmail is pretty good!

Finally there's the upgrade problem. Thick apps need conscious effort from their users before their work sees user time and they get feedback. And that's what drives innovation. Long cycles means slower (less) invention. One example I love is Gmail's "undo send" feature. Boy, you sure do miss that when you need it and it's not there! That should be on every thick client by now, but I don't think it is. I do know that Gmail has it and Mail.app still doesn't.

Maybe the Internet can help. Look at Chrome with its awesome auto updates. What makes this work is solid engineering and exceptional quality control. I've never seen behind the Google curtain, but I bet there's no magic, just a lot of good engineering that leads to good software. Like: good design and code reviews, tons of test coverage across many scenarios, diverse and well-instrumented canaries, and thorough performance and resource use testing. If Google didn't all of that so well, then we wouldn't accept frequent pushes. Without the frequent upgrade cycle, Chromes feature cycle would languish.

Electron is another bright spot. This is the framework that gives Slack and GitHub's thick clients their fit and finish. It makes these feel like true native apps, even though they are mostly web controls with JavaScript the covers. Right-clicking still doesn't do what I want, and text controls are finicky, but it's close. But what those rough edges buy you, and the software producer, are frequent, reliable, and clean upgrades.

My natural preference would be for thick apps. If they were done well, I'd use them.

My Next Job

Snowflake

I left my last job a few weeks back and it's high time to look for a new one. If you're working on something interesting and think I could help, let me know!

It's nice to not have a day job while looking for another. I was lucky enough to do this once before in 2012 which turned out great. I learned then that time and flexibility lets you talk to lots of friends and learn about a breadth of projects. I found a fun project in a new domain (online education), something I doubt I'd have found the normal way.

Maybe I'll get lucky again.

Enough small talk, what am I looking for?

I'm looking for some flavor of line manager. I'm a good senior manager and code-every-day engineer; but I'm exceptional leading a team and running a project. That's what line managers do: lead engineers, not other managers or departments or matrix-anything. Also, if you're some kind of executive then coding is an indulgence, and I'd rather it just be part of my job. Mostly I'm talking to small companies, say 10-100 people (fun-size).

I want to build on my experience. I know infrastructure and cloud, SaaS and enterprise, and online education. I'm probably not the best person for your storage, security, gaming, e-commerce, or cryptocurrency company. I want to stay working on Internet technology. I like the (micro)services model. For my own projects I choose Python, JavaScript (frontend and backend), and Java. I know web operations, especially the Amazon stack.

Location is important: I don't want to do a daily Menlo Park to San Francisco round-trip. I'd like to work with friends if possible. And I want to do something worthwhile.

You can always get to my resume from the header here, or via this short link. I'm open to a bunch of things, just no kick boxing. Let's have coffee/drink or take a walk.

Lessons from Three Years in AWS

AWS Logo

I've spent the last three years building and operating web sites with Amazon Web Services and here are a few lessons I've learned. But I first have to come clean that I'm a fan of AWS with only casual experience with other IAAS/PAAS platforms.

S3 Is Amazing. They made the right engineering choices and compromises: cheap, practically infinitely scalable, fast enough, with good availability. $0.03/GB/mo covers up for a lot of sins. Knowing it's there changes how you build systems.

IAM Machine Roles From The Start. IAM with Instance Metadata is a powerful way to manage secrets and rights. Trouble is you can't add to existing machines. Provision with machine roles in big categories (e.g. app servers, utility machines, databases) at the start, even if just placeholders.

Availability Zoness Are Only Mostly Decoupled. After the 2011 us-east-1 outage we were reassured that a coordinated outage wouldn't happen again, but it happened again just last month.

They Will Lock You In And You'll Like It. They secondary services work well, are cheap, and are handy. I'm speaking of SQS, SES, Glacier, even Elastic Transcoder. Who wants to run a durable queue again?

CloudFormation No. It's tough to get right. My objection isn't programming in YAML, I don't mind writing Ansible plays, it's the complexity/structure of CloudFormation that is impenetrable. Plus even if you get it working once, you'd never run it again on something that is running.

Boto Yes. Powerful and expressive. Don't script the CLI, use Boto. Easy as pie.

Qualify Machines Before Use. Some VMs have lousy networking, presumably due to a chatty same-host or same-rack neighbor. Test for loss and latency to other hosts you own and on EBS. (I've used home-grown scripts, don't know of a standard open-source widget, someone should write one).

VPC Yes. If you have machines talking to each other (i.e. not a lone machine doing something lonely) then put them in a VPC. It's not hard.

NAT No. You think that'll improve security, but it will just introduce SPOFS and capacity chokepoints. Give your machines publicly routable IP's and use security groups.

Network ACLs Are A Pain. Try to get as far as you can with just security groups.

You'll Peer VPC's Someday. Choose non-overlapping subnet IP ranges at the start. It's hard to change later.

Spot Instances Are Tricky. They're only For a very specific use case that likely isn't yours. Setting up a test network? You can spend the money you save by using spot on swear jar fees.

Pick a Management Toolset. Ansible, Chef, all those things aren't all that different when it comes down to it. Just don't dither back and forth. There's a little bit of extra Chef love w/ AWS but not enough to tip the scales in your decision I'd reckon.

Tech Support Is Terrible. My last little startup didn't get much out of the business level tech support we bought. We needed it so we could call in to get help when we needed it, and we used that for escalating some problems. It was nice to have a number to call when I urgently need to up a system limit, say. But debugging something real, like a networking problem? Pretty rough.

...Unless You Are Big. Stanford, on the other hand, had a named rep who was responsive and helpful. I guess she was sales, but I used her freely on support issues and she worked the backchannels for us. Presumably this is what any big/important customer would get, that's just not you, sorry.

The Real Power Is On Demand. I'm reaffirming cloud koolaid here. Running this way lets you build and run systems differently, much better. I've relied on the cloud this to bring up emergency capacity. I've used it to convert a class of machines on the fly to the double-price double-RAM tier when hitting a surprising capacity crunch. There are a whole class of problems that get much easier when you can have 2x the machines for just a little while. When someone comes to you with that cost/benefit spreadsheet arguing why you should self-host, that's when you need your file of "the cloud saved my bacon" stories at the ready.

CS Students: Learn to Write

Writing Hand

If I could do my college years over I would focus on writing. I would take courses that required a lot of writing, in the spirit of "learn by doing". I'd also take courses in the mechanics and craft: grammar, vocabulary, and rhetoric.

I used to consider myself a competent writer. And certainly good enough for an engineer, right? But I've learned that I have a long way to go. I've learned that engineers spend much more time writing than you expect. And I appreciate how hard it is to write well.

This came up just this past week. I came across a beautiful four page essay. It laid out the problem, described alternatives, and lead you concisely to a well-reasoned conclusion. Sure, it was about technology, but what carried the day was the good writing. Humbling!

You're saying: but wait, you're a pointy-haired boss now Sef, one of those guys who doesn't do Real Work anymore. You write email and boss people around. I understand why managers like you need to write, I'm an engineer. I write code. Well, there's some truth to that. But I'd like to convince you that even monster coders, if they're any good, write a lot, write well, and value good writing in others.

It is rare for someone to do their work in isolation and have it matter. Either you're working as part of a team, in a company or a community of developers, say on an open-source project. Sure, you build a great system or feature. But to have it matter to the world, for others to adopt it, you need to document, publicize, support, teach. 98% of that is writing.

I had a lunchtime conversation with Jacob Kaplan-Moss, Django's co-founder, at last year's PyCon. I asked him why Django caught on and was adopted by so many of us (including my last project and my current project). I was expecting him to point to features or timing. Instead he said it was "because we wrote good docs". The Django team didn't treat documentation as an afterthought. They have lots of docs, and they are good.

Consider that rare bird, the iconoclastic engineer working in isolation on their own project. I'm thinking of Antirez or Marco. (Maybe they aren't even truly on their own so much. Humor me!) They are prolific and strong coders. But they also write a lot of words! They write a ton about their project; also tech landscape and their place in it. Would their software have as much of an impact if they didn't write so much (and well)? I say no.

Case in point. Both Marco and I wrote blog posts riffing off the same topic two weeks ago. First, read mine, then his. It takes me a ton of words to get something basic out. His is concise and clear. Sheesh.

So kids, don't shortchange those liberal arts classes. It's not fluffy stuff on the side. That is core. You need to write well to be a good engineer.

Protip: the best way to do this is to major in the humanities. They write like crazy over there. Be like my friend Dan Chu and major in history, but secretly take CS courses on the side. If you're super smart like him, and can manage getting both degrees, then you're awesome. But don't sacrifice the BA for the BS. I would love to talk to a candidate with with a History BA and a Computer Science MS.

Switching to Static

Nikola Tesla

For those of you who frequently read this blog (n ≈ 0) you'll notice that it looks a bit different. I've moved it from my own hosted Wordpress instance to static pages generated by Nikola and hosted up on Github.

Why bother? I didn't like my blog being something I wouldn't be proud to write or operate myself. Wordpress was overkill and not worth the trouble:

  • Wordpress is database driven for dynamic sites. My little blog isn't dynamic at all. Once you buy into the static idea, lots of otehr things fall into place.

  • Comments and account spam are still is a nuisance. The nice people at Akismet have done a great job holding back the tide, but I still get at least one bogus registration a day. I expect a service like Disqus or Facebook or Google+ will be more likely to keep pace with the spammers.

  • I want to author in markdown; maybe even ReST someday. All this wysiwyg-ish stuff in browsers is for the birds.

  • I don't actually need to host anything. I'm not running my own IRC server or anything. Even though Dreamhost has been a perfectly good service, I just want something... well, less Wordpress-ey.

Choosing

So I need to pick a static web site hosting package, specifically one that is good for blogs. The big things I was looking for: static site generation, Markdown/ReST support, and ability to host on free services like Github Pages.

In additon I wanted these things (in no particular order):

  • active development

  • in real use on real sites

  • easy customization, ideally by theming

  • ability to move over posts and comments, probably via Wordpress's export function.

  • python, what I program in for fun

  • a bit extensible

I considered three options before settling on Nikola. There are a ton of static site generators, even when you limit them to Python. But the only alternatives I seriously looked at were Pelican and Hyde. Of these Nikola seems to have the most active development and the richest set of features. I ease especially drawn by their good documentation and theming.

Up and Running

Here are my notes of what it took. This isn't a nicely-written howto, but if you want to see the blow-by-blow, read on.

  1. The software was easy to set up and get started

    • pip install nikola did the right thing. Of course within a virtual environment, you have to keep the lab clean!

    • Along the way I found a few other requirements, like request, phpserialize, and markdown. These were easy to find/fix because of helpful error messages.

    • This is all now captured in the project's requirements.txt.

  2. Initial import from wordpress XML dump went ok but a fair bit of manual HTML prettification took some time.

    • The slugs aren't how I like them: 201311seven-things.wp instead of seven-things.wp. I tried a bit to do mass rewrite of those, but couldn't easily do that without breaking everything, so just decided to let this go for now. I'll just do slugs without dates going forward.

    • I considered html2text but figured I'd just let the HTML slide for now. Too much trouble to go back and clean all that up now.

    • The "wp" extension, by default, is mapped to markdown instead of html. Odd but easy to change.

    • Nikola has a nice simple redirect facility that the importer pre-populated for me.

  3. Migrate threads to Disqus - clean and simple, worked right out of the box.

    • wordpress plugin has a nice import facility. Said it'd take a day, but didn't take that long at all. Aside from one thread with 75 comments, not much else that should cause them much heartache.

    • nice simple URL mapping utility

  4. Publishing to github. Again, easy peasy. The way to make this work is to put all your static files in a repo titled USERNAME.github.io. In my case: sefk.github.io. Just a couple of "tricks" were needed to make this work smoothly:

    • So that it would correctly serve my own domain, you have to put a CNAME file in the root of the repo

    • I put my nikola themes and source files in the "src" subdirectory, and changed the OUTPUT_DIR configuration option to ... Github Pages wants things in the root.

  5. The final swich. I moved my domain from Dreamhost over to Gandi, which is the registrar I like to use. They throw in basic DNS with registration, which is sufficient for me.

So There You Have It

So far I am really pleased with the result. It's a bit more work to publish, but for me that's natural and what I want. I guess I wouldn't recommend this to someone who isn't adept with all these bits (virtual environments, DNS, github repos, etc.) but for me that's no problem.

Thanks to: