The Schema Lies: When Outlines Disagree With Drafts

I've been building a regional economic publication out of three federal data sources. Each one ships with a schema. Each schema is wrong.

Not maliciously wrong. The Bureau of Labor Statistics drops a single flat file with industry codes that map to a known taxonomy. The Colorado Secretary of State exposes 275,000 business entities with a tidy column for city. The IRS provides nonprofit filings with EINs you can join on. The schemas all look clean in the README. None of them tell you that "Colorado Springs" is spelled four different ways across 275,000 rows, that 11,000 Teller County entities will leak through if you filter naively, or that some entity in the SOS file has an effectivedate of January 30, year 0003.

That last one is real. Sixteen rows with formation dates before the year 1900. The schema says effectivedate is a DATE. The data says the schema doesn't know me.

I keep thinking about this because of how much it looks like writing.

When you write a long thing, a novel or an investigative piece or a series of essays, you usually start with an outline. The outline promises shape. It says: here are the chapters, here is the arc, here are the scenes you need. You spend a few hours getting it right because it feels like the kind of work you can actually finish. And it is. You finish it. Then you start writing and the outline starts lying.

It tells you chapter four is where the antagonist's motive becomes clear. You write chapter four, and the antagonist's motive doesn't become clear, because the version of him you have on the page would never explain himself there. He isn't ready. Maybe he's never ready. Maybe his motive was supposed to come from a different character entirely. The outline didn't know. It couldn't have known. It was built by someone (you) who hadn't met these people yet, working from a schema (story structure) that doesn't bend to the particular human at the center.

The same thing happens with data. The schema is built by people who knew the system would have an effectivedate column. They didn't know that one of seventeen million rows would carry the value "0003-01-30." They wrote the column. They didn't write the world.

The craft, in both cases, is the same. You stop trusting the artifact that claims to describe the material, and you start reading the material directly. In data work this looks like running a few diagnostic queries before you publish a number. In writing it looks like reading what you actually wrote, not what you remember intending to write, and noticing the gap. In both cases the cleaner version of the work is the one that absorbed what the schema couldn't say.

There's a temptation, when the data violates the schema, to fix the data. Cast that date string to NULL. Round that outlier to the nearest decade. Drop the sixteen pre-1900 rows because they're "obviously errors." Sometimes that's right. Often it isn't. The sixteen rows are real records. Somebody really did file them with that date, for whatever reason (typo, legacy system, intentional misfiling). They are not a violation of the world. They are a violation of your expectations of the world. The schema was your expectation in code.

The writing equivalent is the temptation to force the chapter to do what the outline said it should do. To make the antagonist explain himself even though he wouldn't, because you wrote down that he would. That isn't craft. That's stubbornness about a past version of yourself. The chapter you wrote is the truer document, and the outline is now a relic. Trust the chapter. Throw the outline out. Make a new one.

This is one of the quiet truths that separates people who get good at long work from people who don't. Beginners trust the plan. People who've been at it a while trust the work. The plan is a useful provocation, a starting move, a way to get going. But the moment the work disagrees with the plan, the plan is the part that's wrong.

There's a related move I keep making in the data work that I now realize I learned from writing. Exclude before you include. When I started filtering El Paso County entities by city name, I tried to match the cities I wanted. The result was a leak. About 11,000 Teller County rows snuck in because their city names shared substrings with my list. The fix was to invert the logic. Name the cities I knew should not be in the file, exclude them first, then do the broader match. The world doesn't sort itself into your includes. It sorts itself into your excludes.

The same is true on a page. You don't write your way to a clean draft by figuring out what to add. You get there by figuring out what to remove. The first pass is a 6,000-word essay. The second pass is the 1,500-word essay it actually was, with the other 4,500 words taken out. The shape was always in there. The shape only revealed itself once you cut the things that were obviously not the shape.

I've been doing both kinds of work this past week, the investigative data work and the literal kind of writing this site is supposed to be about, and the longer I do them in parallel the more they feel like the same craft. The discipline isn't planning. It's listening to what you actually made, and being willing to throw away the document that promised it would be something else.

The schema lies. The data is the truth. The outline lies. The draft is the truth. Trust the artifact you made, not the artifact that told you what you were going to make.