Writing Diary #53: Cleanup Reepeated Text

What makes a book different from an article? What makes a book different from an essay? One thing: it’s long. In my case, what I thought would be one book, seems to be becoming two or maybe three. However, over the years I’ve been working on it, it has grown to almost a million words. This creates just a writing challenge. How to organize, cut, edit, and work with the detritus of various versions, drafts, initial starts, notes on an idea, and it just piles up in a chaotic, Escher-like jumble, that is so overwhelming as it leave me lost. One problem is the order. One is duplication.

Order can be solved by coding. Putting things about the same topic together. This solves the issues that writing different iterations of these books for quite a long time. This means that the ideas have sometimes flourished in different drafts, different versions of the same text, living in different places. I’ve long since lost track of where things are.

Pragmatically, they’re in a big Tinderbox file. But, how to turn that into a book?

When it comes to the final steps of editing, tightening, and turning rough ideas into book form, one step that is both banal and annoying is the general cutting of duplicate text.

There are different ways of doing this.

For a long time I’ve done it by hand. It’s not very efficient when dealing with so much text.

More recently, I have been using a script I wrote called structur.py. With structur.py I code text and send paragraphs and pieces of text to the right place. I put ideas together that are about similar topics. However, once structur.py has worked its magic, there is still a problem: I still have a lot of duplicate text? One way to deal with this is to keep coding until all the similar text is in the same place, then delete it. This is what I did. But it takes a long time just to read 10,000 words. Let alone 30,000 or 100,000, which is my problem. The problem is that I want a copy of a piece of text, and then any good sentences that I might want to put together.

A few days ago, I put together a script called DejaText.py that flags files, paragraphs, sentences and even words that are duplicated. This is super useful for identifying where text is being reused in multiple places. The result was somewhat shocking. My drafts were full of duplicates.

This morning, as I was trying to code six files to deal with the most egregious case of duplicated paragraphs and sentences, I realised that I was coding the same text over and over again. Why not write a script that deletes the second and subsequent instance of repeated paragraphs and sentences? It’s dangerous. But, point it at a temporary folder, and it works. I call it [dejatext_cleanup.py]. It’s on GitHub.

This combination of deleting duplicate text, and then coding with structur.py, allowed me to cut 35,000 words down to 13,000 words. The scripts will speed= things up considerably. How to proceed? The way forward is to work at the level of what I’ve already organised. Take a section, say a section on pencils. Delete duplicate text. Code the remainder. Put the codes in order. Then, I’ll have a complete section on pencils. Revise it a couple times, and then I’ll have a draft.

These are power tools. They’re dangerous. But, it iwll speed up turning a folder of notes into a book.