Ein Bild, das Text, Rad, Landfahrzeug, Fahrzeug enthält. Automatisch generierte Beschreibung

 

The main goal of the CLARIAH-AT project Esperanto Newspaper Excerpts was to create complete texts for newspaper articles from the Hachette collection. This collection contains articles about Esperanto from the period 1898 until 1915, which are being held at the Department of Planned Languages and Esperanto Museum of the Austrian National Library (ONB). In this blog post we want to present the starting point, progress and results of the yearlong research project.

The Hachette collection

The collection of newspaper cuttings processed in this project consists of around 17,000 articles taken from magazines published mainly in France, but also in many other European countries between 1898 and 1915. The articles themselves report on events and people related to Esperanto, e.g., reports from regional, national or world congresses, and therefore provide an excellent and unique opportunity to study the history of the Esperanto movement in Europe in the early 20th century.

Research questions

There is great interest in this period among interlinguists and historians. As a result of the project, this important collection of Esperanto articles became searchable. This enables further research as the text is now accessible for analysis using digital methods. Finally, we complemented the collection by moving from simple images to images with text files and metadata all in one place.

Overview of the pipeline

A pipeline was developed in the project with which full texts were extracted from the original images. This pipeline consists of four steps:

  1. Segmentation of the images belonging to an article into individual text boxes
  2. Rotating the boxes so that the text is horizontal
  3. Applying Tesseract OCR to each box for text recognition
  4. Creating a IIIF metadata manifest per article from the generated text files

In the following, we would like to go into more detail about the individual steps in the hope that it may serve as inspiration for projects with similar historical material.

For the layout analysis task, we use the YOLOv8 model (see here for more information), which is a powerful model for common tasks such as object recognition, image classification and segmentation. To improve the quality of layout recognition, fine-tuning was performed with manually annotated images. The model was then applied to all images in the dataset and all areas annotated with “text” were passed on to the next stage. See Figure 1 for a visual representation of the data generated in this step.

Figure 1: Applying the YOLOv8 network to a section of a two-column article in the dataset. Left: original image, right: overlay of the original image and the generated text annotations.

A typical problem with the newspaper clippings in this collection is that they are not glued parallel to the paper’s edges and the resulting quality of the full texts generated with typical OCR software decreases significantly if the input texts are not horizontal. Therefore, we have developed a Python script based on the open library OpenCV, which detects the direction of a text block and then rotates the image back by this angle. See Figure 2 for a demonstration of the script’s application to two sample images.

Ein Bild, das Text, Handschrift, Papier enthält. Automatisch generierte BeschreibungEin Bild, das Text, Schrift, Papier enthält. Automatisch generierte BeschreibungEin Bild, das Text, Handschrift, Papier, Dokument enthält. Automatisch generierte BeschreibungEin Bild, das Text, Handschrift, Papier, Dokument enthält. Automatisch generierte Beschreibung

Figure 2: Application of the developed Python script to extracted text boxes. The result of the calculation is a counterclockwise rotation of 3.36° (for the first image from the left) and 2.75° (for the third image from the left).

We used the open-source software Tesseract OCR for text recognition, as Esperanto and 21 other languages in the dataset are directly supported. In the fourth and final step, the text files belonging to all text boxes on a page are combined into a single file, and this is converted into IIIF-compliant annotations. A IIIF metadata manifest is then created for each article with one or more images, each of which contains the text annotations in question.

Results: ONB catalog entries, full-text search and IIIF dataset

The metadata mentioned above has been integrated into the ONB catalog system as part of the project and can be accessed using the term “Sammlung Hachette”. This allows searches to be done using the catalog’s usual filter methods and includes a reference to the corresponding digital copies.

The main aim of the project was to make the full texts of the newspaper articles searchable. This has been done with the help of Solr and we offer a full-text search in the collection via the ONB Labs. In fact, the chosen implementation allows a combined search in all metadata (title, author, journal, date, place, language, keywords) as well as the full texts, including Solr’s usual similarity search.

We offer the data generated in the project as a bundled dataset here. In addition to the project results as a download, further information on the scope and content of the dataset can be accessed there. There is also rights information, a citation suggestion and the possibility to browse the dataset.

In the spirit of interoperability and reusability, which was desired in the call for proposals, we have created a IIIF collection that contains all IIIF manifests generated in the project and thus allows access to all images, metadata and full texts. This collection can be accessed here.

Links, funding information and community

CLARIAH-AT

This project was supported by CLARIAH-AT. The website of the project at the funding body can be found here. A corresponding article for this project including more technical details appeared on the website of ONB Labs (see here).

The source code, as well as the training dataset and the generated model for layout analysis can be found in the project’s Gitlab repository. We are happy to answer questions about the project and look forward to connecting with the Esperanto community. Send us an e-mail to labs@onb.ac.at.

We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept