Esprit: Resolving Location Hierarchies in Planning Documents

  • Project name: Geographic data extraction from Slovak documents
  • Project’s period: April 2024 – July 2024
  • Partner: Esprit

The main task was to create a prototype of an algorithm for extracting geographic information from documents with spatial plans. The extraction targeted specific administrative and cadastral entities from text extracted from PDF documents, including regions, districts, municipalities, cadastral territories, and parcels. The challenge went beyond simple text extraction, as it also required resolving ambiguities in geographic references. Several municipalities in Slovakia share the same name, making it necessary to correctly assign each municipality to its administrative district or region, and to determine the full geographical hierarchy even when parts of it were not explicitly stated in the document. For parcel extraction, two different marking systems are currently in use in Slovakia, requiring the algorithm to distinguish between them and assign each parcel to the correct marking type.

How we approached it

The prototype was designed to handle the specific complexities of geographic data extraction from spatial planning documents. The solution focused on several key areas:

  • Geographic hierarchy resolution: The algorithm was built to determine the full administrative hierarchy of a location, region, district, municipality, even when some levels were missing from the source document.
  • Municipality disambiguation: Given that multiple municipalities in Slovakia share identical names, the algorithm resolves ambiguity by assigning each municipality to the correct administrative district or region.
  • Parcel marking distinction: The two parcel marking systems currently used in Slovakia were accounted for, with the algorithm identifying the marking type and assigning it accordingly.
  • Handling text errors: Text similarity algorithms were explored to allow geographic data to be extracted correctly even when the text contains typos, missing diacritics, or minor grammatical errors.

“Through our collaboration on this project, we have demonstrated that structured geographical data embedded within unstructured documents can be successfully extracted. The resulting tool not only identifies localities and parcels, but also enables this information to be linked with other geographical data, such as data on environmental hazards in the given area. The aim was to overcome the barriers in document processing and to automate a process that had until now been dependent on manual work.”

MIROSLAV BLŠTÁK
AI Specialist in NLP team

What we delivered

The collaboration resulted in a working prototype of the extraction algorithm, accompanied by training on how the algorithm functions and how it can be further expanded. We also held an educational seminar, focused on the description, explanation, and comparison of various algorithms for calculating the similarity of text strings, algorithms which the extraction prototype relies on to handle imperfect input text.

In the partner’s own words

“We still encounter a large number of documents containing important data for land-use planning, nature and landscape protection, and crisis management, which are not available in a format suitable for direct machine processing. Thanks to this collaboration, we were able to better verify the possibilities, limits, and suitable algorithmic approaches to processing such data, which provides us with a valuable foundation for the further development of solutions supporting effective spatial decision-making. The cooperation was of high professional quality, enriching, and has moved us significantly forward in this field.”

VERONIKA SOLDÁNOVÁ
GIS Developer in Esprit