Teaching a Machine to Clean Up My Document Chaos

Inhaltsverzeichnis

First approach: TF-IDF and centroid-based classification
Moving to embeddings and why that alone was not sufficient
Why the model cannot “just learn that”
Using the taxonomy as weak supervision
Why this is not a rules engine
Keeping a human in the loop
What this experiment reinforced for me

This „project“ did not start with the ambition to build a generic document classifier or to compete with existing document management systems. It started with a much more personal and probably familiar situation.

I wanted to explore whether machine learning could help me to organize my PDFs better. Not reminding me of deadlines or summarizing documents, but assisting with a very concrete task: taking a new PDF and proposing the folder where it belongs, based on how similar documents were filed in the past. And for safety reasons, the system should not act autonomously but make suggestions that I could understand and approve.

Over the years, I accumulated a large number of PDF documents. Bank statements, insurance contracts, correspondence with authorities, job-related paperwork, car documents, utility bills, and all the small things that somehow feel important enough not to delete. Everything is stored in a folder structure that grew organically over time. Some parts were carefully organized, others were created in a hurry, and many folders reflected assumptions I no longer remembered making.

The result is a creative mess. Structured enough to look intentional, chaotic enough to guarantee that I would never find a document again when I actually needed it.

And for sure, privacy was a hard requirement. I know, sharing is caring, but my bank statements don’t belong into the public cloud, so everything had to run locally. And finally, I don’t want to build something I criticize often, a AI that should be a rules engine.

First approach: TF-IDF and centroid-based classification

I started with TF-IDF, not because it is fashionable, but because it is a well-understood baseline that forces you to think clearly about representations.

In this setup, each document is represented as a vector of term weights. The TF-IDF score increases when a term is frequent in a document but rare across the corpus. For each folder, I computed a centroid by averaging the vectors of all documents assigned to that folder. Classification then becomes a cosine similarity problem: take the vector of a new document and find the folder centroid it is closest to.

This approach has some attractive properties. It is interpretable. It is fast. It is easy to implement. And in domains where vocabulary differences actually matter, it can work surprisingly well.

However, it failed almost immediately in my use case, and for very fundamental reasons.

Most of my documents, especially bank statements, insurance contracts, and utility correspondence, share an overwhelming amount of boilerplate text. Headers, footers, legal disclaimers, standardized wording, page numbers, and repeated phrases dominate the content. When you average TF-IDF vectors across documents in the same folder, the centroid becomes a representation of the template rather than the individual identity of the document.

As a consequence, centroids for different folders belonging to the same issuer or document type end up extremely close to each other. Cosine similarity scores approach one, not because the documents truly belong together, but because they are dominated by the same repeated terms. From a mathematical perspective, the model behaves correctly. From a practical perspective, it becomes confidently wrong.

This was not a tuning problem. It was a structural limitation of the representation.

Moving to embeddings and why that alone was not sufficient

The next step was to move away from bag-of-words representations and towards semantic embeddings. Modern embedding models capture more than token frequency. They encode phrasing, intent, and contextual relationships, and they are much more robust to noise caused by repetitive text.

To keep everything local, I used Ollama to generate embeddings with models such as bge-m3 and mxbai-embed-large. The integration was straightforward, and the improvement over TF-IDF was immediately visible. The top-K suggestions became more plausible, and the system stopped collapsing everything into near-identical scores.

Still, a critical problem remained.

Many of my folders are distinguished by identifiers that are semantically uninteresting but operationally crucial. Account numbers, policy numbers, customer IDs, or contract references are exactly the information that determines where a document belongs. From the perspective of a language model, these identifiers are just long digit sequences with no inherent meaning. They do not influence semantic similarity in a strong way, especially when they are surrounded by pages of boilerplate text.

This led to a particularly instructive failure case. A document clearly contained the account number 123456789. The correct folder also contained this number in its path. Yet the embedding-based classifier preferred folders for other accounts because those folders contained more documents and therefore produced a more dominant centroid.

The model was not missing the number. It simply did not know that this number mattered.

Why the model cannot “just learn that”

At this point, it is tempting to say that the model should learn this automatically. In practice, that expectation is unrealistic.

Embedding models optimize for semantic similarity, not for alignment with a private folder taxonomy. They do not know which parts of the text represent identity keys and which parts are decorative. Without additional structure, the signal from a single identifier is drowned out by the surrounding template.

This is not a limitation of the specific model. It is a consequence of how the problem is posed.

Using the taxonomy as weak supervision

The key insight was that I already had a form of supervision available, encoded directly in my folder structure.

Folder paths like bank/xyz/accounts/123456789/statements are not arbitrary strings. They contain discriminators that humans intentionally used to separate documents. Instead of hardcoding document-type rules, I extracted these discriminators automatically from the label paths themselves.

During training, the system scans each folder label and extracts candidate anchors such as long digit sequences or long alphanumeric identifiers that appear in only a small number of labels. These anchors are stored as part of the model.

During prediction, the document text is scanned for these same anchors. If a document contains an anchor that exists in a folder path, the similarity score for that folder receives a small, capped boost. The embedding similarity remains the primary signal. The anchor only helps break ties where the semantic representation is ambiguous.

This approach is intentionally conservative. It does not force a routing decision. It does not depend on document type. It simply aligns the model’s representation with the structure of the taxonomy it is supposed to learn.

In the account number example, this immediately resolved the issue. The correct folder contained the same identifier as the document, and the boosted score was enough to move it to the top of the ranking.

Why this is not a rules engine

It is important to draw a clear distinction here.

A rules engine encodes domain knowledge explicitly and requires continuous maintenance as document types evolve. What I implemented is taxonomy-driven weak supervision. The system does not know what an account number or policy number is. It only knows that certain tokens appear in both the document and the folder structure, and that this coincidence is statistically informative.

The same mechanism applies to insurance contracts, real estate documents, or vehicle-related paperwork, as long as the folder taxonomy contains the relevant identifiers. If it does not, then the problem is genuinely underdetermined, and no amount of machine learning will recover information that is not there.

Keeping a human in the loop

The final design decision was to keep an explicit approval step. The system presents a ranked list of candidate folders and asks the user to select one or skip the document.

This is not a compromise. It is an acknowledgment of uncertainty. In many cases, the model is confident and correct. In others, the ambiguity is real, and pretending otherwise would only create silent errors. We would see a slow degeneration of accuracy, as new documents would get moved to the wrong folders and shifting the centroids with the next training towards a wrong result.

The approval step also makes the system explainable. When a folder is suggested, I can see whether the decision was driven by semantic similarity, by shared anchors, or by both.

What this experiment reinforced for me

First, document classification is not difficult because the models are weak. It is difficult because the data is boring. Boilerplate-heavy documents break naive assumptions very effectively.

Second, representation matters more than model choice. A strong model with the wrong input will fail just as reliably as a simple one.

Third, taxonomies are not legacy artifacts. They are condensed human knowledge, and treating them as such opens up surprisingly robust solutions.

In the end, this was not about building a universally smart classifier or proving that a particular model performs well on a benchmark. It was about exploring how taxonomy, representation, and a small amount of structure interact in practice. The system is not flawless, but it is understandable, predictable, and genuinely helpful in day-to-day use.

For me, it was also simply a good weekend. I learned something new, validated a few assumptions, corrected others, and ended up with a tool that solves a real problem I actually had. That is usually a much better outcome than chasing theoretical elegance for its own sake.

And finally, I didn’t vibe code it at all.