pdf_splitr.py

On problem with scanning books, for academic purposes, is one often ends up with two pages side by side, in a single PDF. This makes it hard to OCR, read, annotate, or process.

pdf_splitr is a (very) simple Python tool, which I ‘wrote’ with cursor.ai and Claude. It splits each page into left and right halves, while preserving annotations and handling different page sizes. It uses the Media Box, so as not to change the resulting file size.

It runs from the command line or as a drag-and-drop macOS app using Automator, making it easy to turn scans of two pages into 1 page PDFs.

pdf_splitr.py

2 pages

1 Page

wordwright.py

As a graduate student, my supervisor sat me down one day and confessed to me that he used WordPerfect’s spell-checking window because it helped him find passive voice. He, like I, overused it.

From that moment, over the years, I’ve found many ways to automate the flagging of passive voice in my writing. I’ve written scripts to find it in Tinderbox and BBedit. I’ve written scripts to find words I don’t want or that are redundant. But, with those scripts, it means I have to read and remove the words by hand. Sometimes, this forces me to think to find a better way of saying something. Sometimes, they can deleted without much care.

WordWright.py is a collection of python scripts that automates these editing steps, the ones I use on a regular basis. It’s a variation on what I described a few years ago in Writer’s Diary #09: On Freewriting a First Draft.

Simply. I free-wrote > used ChatGPT to fix typos > used DeepL to make minor changes > used ProWritingAid to remove adverbs and redundant expressions and make minor stylistic fixes.

WordWright automates this process, except for the ProWritingAid step. With a keyboard shortcut, I can write a paragraph, then use wordwright to grammar check the text and remove stylistic bugaboos.

It’s not so different from what John McPhee describes in his New Yorker article on Structure, or in the book Draft No. 4. He uses tools to find duplicate expressions.

Writing is not one step; it is many steps. Hundreds. Wordwright helps make a few of those steps easier, but it won’t help you with figure out what you think, make your ideas your own, reworking them, or make them sing. That takes time, at a desk, doing the work.

WordWright doesn’t fundamentally change this process, I don’t find.

But, it is a little easier to get into a state of flow, because I don’t have to stop and go to different apps to fix typos, grammar, adverbs, or overused expressions.

In fact, it’s the case that prolific writers have wonderful editors—sometimes it’s a spouse, an assistant, and publisher.

What WordWright gives offers is not so much a first reader, but a first editor. Never the last, mind.

Using AI this way is, in my mind, not so different from a spell checker or WordPerfect’s passive voice checker. Just more powerful.

WordWright GitHub page.

dejatext.py

DejaText is a Python script for identifying duplicate and similar text in a directory of text or markdown files. It scans a directory of .txt' or.md’ files, identifies duplicate and similar text segments, and produces organized reports for easy review. As part of my writing, I find it useful to go through a project and flag repeated words, phrases, or sentences. DejaText helps me with this.

q_transcribe.py

I want to introduce q_transcribe a simple tool to transcribe images using QWEN 2 VL AI models.

What did I do to write q_transcribe? I’ve added some simple logic to a CLI wrapper that Andy Janco wrote to run QWEN 2 VL from the CLI.

q_transcribe can be used to transcribe typed and handwritten text from any image.

How could it be used?

  • Transcribe handwritten notes. One of the methods I use is freewriting longhand. Notetaking is often the first step in my writing process. But, at times, it can feel a slog to transcribe 20 page of handwritten notes. Enter, q_transcribe.
  • Transcribe handwritten archives. One of the projects I am working on with colleagues is an archival project in Colombia. We’re using QWEN 2B to extract text from images as part of a longer pipeline.

q_transcribe is a simplification of our workflow, which works on an image, a folder of images, or a folder of folders of images.

What is my contribution? I added logic to Andy Janco’s CLI wrapper to QWEB 2 VL’s sample code. My logic handles JPG, JPEG, or PNG files, sorts them, skips files that have already transcribed, and chooses between a CUDA (Nvidia GPU), MPS (Apple Silicon GPU), or CPU.

In my testing, it works with QWEN 2B on my M1 MacBook Pro with 16 GB of RAM, and on a https://lightning.ai server which offers free access to a GPU for researchers.

To install, clone the repository from GitHub, install the necessary dependencies, and then run.

   git clone https://github.com/dtubb/q_transcribe.git
   cd q_transcribe
   pip install -r requirements.txt
    python q_transcribe.py images

structur.py

Structur is a simple, Python-based command-line tool to help extract and organize coded text from research notes.

I’ve been using it for a year now, from the Finder. It’s useful to find the structure of longer pieces of text.

I was inspired by John McPhee’s writing process, which he describes in Draft No. 4.

Structur exploded my notes. It read the codes by which each note was given a destination or destinations (including the dustbin). It created and named as many new Kedit files as there were codes, and, of course, it preserved intact the original set. In my first I.B.M. computer, Structur took about four minutes to sift and separate fifty thousand words. My first computer cost five thousand dollars. I called it a five-thousand-dollar pair of scissors.

Structur is my take on what McPhee describes.

It is available on GitHub.

Cite as:

Tubb, Daniel. Structur.py. GitHub, 2024. https://github.com/dtubb/structur.