Fichero — Daniel Tubb

Fichero transcribes and auto-catalogues historical archives using vision large language models and artificial intelligence, running locally or in the cloud. It is in early development. To learn more about Fichero, please read the Fichero development blog and the frequently asked questions below.

Releases

0.1.0-dev -- Ongoing

I am working towards a new release. Follow along here.

0.0.3 -- June 5, 2025

Added support for multi-step LLM processing, using llm_process.py, llm_utils.py, and a config file. Tested with OpenAI models. Config files set prompts, choose batch sizes, decide between sending pages individually or in groups up to chunk size based on max tokens, allow combining results, and sending previous results.
Added a config file to auto-catalogue by generating using LLMs named entity recognition for people, places, organizations, events, but also specific things we are interested in, e.g. mines, animals, plants, injuries; a timeline of events; most important people; ten tags or keywords; and a 150 word summary.

0.0.2 -- May 25, 2025

Added support for running on multiple processors and in parallel, and doing asynchronous transcription calls.

0.0.1 -- May 9, 2025

Initial development release, which crops, splits, rotates, enhances, removes background, and transcribes.

Frequently Asked Questions

What is 'Fichero'?

It's a digital tool being developed by myself, Daniel Tubb and Andy Janco, along with support from many others. I am an anthropologist who has published work on the Choco, in Colombia and has some experience with the programming language Python. Andy Janco is a systems engineer and also holds a PhD in history. Daniel is based at the University of New Brunswick in Canada, and Andy is at Princeton. Fichero is still very much a work in progress! The group testing it and producing inputs for Andy and Daniel consists of about six people.

What does Fichero do?

Fichero can process digitized documents: it first transcribes them and then generates automated cataloguing. This allows it to produce basic metadata so we can continue working with documents.

What has our collaborative process been like?

Andy and I began working with a group of researchers and young people working on or from the Choco 2023, in the context of a British Library project -- EAP1477 -- where we held a workshop on how to produce metadata and develop a catalogue. With the Semillero de Jovenes de Muntu Bantu -- a social foundation focused on the African Diaspora -- we began cataloguing with the goal of producing basic metadata for all the case files we had digitized. As part of that project, beyond cataloguing, each of us selected a case and developed a historical analysis, publishing our essays in a short book.