Demystifying the Medallion Architecture for Geospatial Data Processing

Introduction

Geospatial data volumes and complexity are growing due to diverse sources, such as GPS, satellite imagery, and sensor data. Traditional geospatial processing methods face challenges, including scalability, handling various formats, and ensuring data consistency. The medallion architecture offers a layered approach to data management, improving data processing, reliability, and scalability. While the medallion architecture is often associated with specific implementation such as the Delta Lake, its concepts are applicable to other technical implementations. This post introduces the medallion architecture and discusses two workflows—traditional GIS-based and advanced cloud-native—to demonstrate how it can be applied to geospatial data processing.

Overview of the Medallion Architecture

The medallion architecture was developed to address the need for incremental, layered data processing, especially in big data and analytics environments. It is composed of three layers:

  • Bronze Layer: Stores raw data as-is from various sources.
  • Silver Layer: Cleans and transforms data for consistency and enrichment.
  • Gold Layer: Contains aggregated and optimized data ready for analysis and visualization.

The architecture is particularly useful in geospatial applications due to its ability to handle large datasets, maintain data lineage, and support both batch and real-time data processing. This structured approach ensures that data quality improves progressively, making downstream consumption more reliable and efficient.

Why Geospatial Data Architects Should Consider the Medallion Architecture

Geospatial data processing involves unique challenges, such as handling different formats (raster, vector), managing spatial operations (joins, buffers), and accommodating varying data sizes. Traditional methods struggle when scaling to large, real-time datasets or integrating data from multiple sources. The medallion architecture addresses these challenges through its layered approach. The bronze layer preserves the integrity of raw data, allowing for transformations to be traced easily. The silver layer handles transformations of the data, such as projections, spatial joins, and data enrichment. The gold layer provides ready-to-consume, performance optimized data ready for downstream systems. 

Example Workflow 1: Traditional GIS-Based Workflow  

For organizations that rely on established GIS tools or operate with limited cloud infrastructure, the medallion architecture provides a structured approach to data management while maintaining compatibility with traditional workflows. This method ensures efficient handling of both vector and raster data, leveraging familiar GIS technologies while optimizing data accessibility and performance.  

This workflow integrates key technologies to support data ingestion, processing, and visualization. FME serves as the primary ETL tool, streamlining data movement and transformation. Object storage solutions like AWS S3 or Azure Blob Storage store raw spatial data, ensuring scalable and cost-effective management. PostGIS enables spatial analysis and processing for vector datasets. Cloud-Optimized GeoTIFFs (COGs) facilitate efficient access to large raster datasets by allowing partial file reads, reducing storage and processing overhead. 

Bronze – Raw Data Ingestion 

The process begins with the ingestion of raw spatial data into object storage. Vector datasets, such as Shapefiles and CSVs containing spatial attributes, are uploaded alongside raster datasets like GeoTIFFs. FME plays a crucial role in automating this ingestion, ensuring that all incoming data is systematically organized and accessible for further processing.  

Silver – Data Cleaning and Processing

At this stage, vector data is loaded into PostGIS, where essential transformations take place. Operations such as spatial joins, coordinate system projections, and attribute filtering help refine the dataset for analytical use. Meanwhile, raster data undergoes optimization through conversion into COGs using FME. This transformation enhances performance by enabling GIS applications to read only the necessary portions of large imagery files, improving efficiency in spatial analysis and visualization.  

Gold – Optimized Data for Analysis and Visualization  

Once processed, the refined vector data in PostGIS and optimized raster datasets in COG format are made available for GIS tools. Analysts and decision-makers can interact with the data using platforms such as QGIS, Tableau, or Geoserver. These tools provide the necessary visualization and analytical capabilities, allowing users to generate maps, conduct spatial analyses, and derive actionable insights.  

This traditional GIS-based implementation of medallion architecture offers several advantages. It leverages established GIS tools and workflows, minimizing the need for extensive retraining or infrastructure changes. It is optimized for traditional environments yet still provides the flexibility to integrate with hybrid or cloud-based analytics platforms. Additionally, it enhances data accessibility and performance, ensuring that spatial datasets remain efficient and manageable for analysis and visualization.  

By adopting this workflow, organizations can modernize their spatial data management practices while maintaining compatibility with familiar GIS tools, resulting in a seamless transition toward more structured and optimized data handling. 

Example Workflow 2: Advanced Cloud-Native Workflow  

For organizations managing large-scale spatial datasets and requiring high-performance processing in cloud environments, a cloud-native approach to medallion architecture provides scalability, efficiency, and advanced analytics capabilities. By leveraging distributed computing and modern storage solutions, this workflow enables seamless processing of vector and raster data while maintaining cost efficiency and performance.  

This workflow is powered by cutting-edge cloud-native technologies that optimize storage, processing, and version control. 

Object Storage solutions such as AWS S3, Google Cloud Storage, or Azure Blob Storage serve as the foundation for storing raw geospatial data, ensuring scalable and cost-effective data management. Apache Spark with Apache Sedona enables large-scale spatial data processing, leveraging distributed computing to handle complex spatial joins, transformations, and aggregations. Delta Lake provides structured data management, supporting versioning and ACID transactions to ensure data integrity throughout processing. RasterFrames or Rasterio facilitate raster data transformations, including operations like mosaicking, resampling, and reprojection, while optimizing data storage and retrieval.  

Bronze – Raw Data Ingestion

The workflow begins by ingesting raw spatial data into object storage. This includes vector data such as GPS logs in CSV format and raster data like satellite imagery stored as GeoTIFFs. By leveraging cloud-based storage solutions, organizations can manage and access massive datasets without traditional on-premises limitations.  

Silver – Data Processing and Transformation

At this stage, vector data undergoes large-scale processing using Spark with Sedona. Distributed spatial operations such as filtering, joins, and projections enable efficient refinement of large datasets. Meanwhile, raster data is transformed using RasterFrames or Rasterio, which facilitate operations like mosaicking, resampling, and metadata extraction. These tools ensure that raster datasets are optimized for both analytical workloads and visualization purposes.  

Gold – Optimized Data for Analysis and Visualization

Once processed, vector data is stored in Delta Lake, where it benefits from structured storage, versioning, and enhanced querying capabilities. This ensures that analysts can access well-maintained datasets with full historical tracking. Optimized raster data is converted into Cloud-Optimized GeoTIFFs, allowing efficient cloud-based visualization and integration with GIS tools. These refined datasets can then be used in cloud analytics environments or GIS platforms for advanced spatial analysis and decision-making.  

This cloud-native implementation of medallion architecture provides several advantages for large-scale spatial data workflows. It features high scalability, enabling efficient processing of vast datasets without the constraints of traditional infrastructure, parallelized data transformations, significantly reducing processing time through distributed computing frameworks, and cloud-native optimizations, ensuring seamless integration with advanced analytics platforms, storage solutions, and visualization tools.  

By adopting this approach, organizations can harness the power of cloud computing to manage, analyze, and visualize geospatial data at an unprecedented scale, improving both efficiency and insight generation.  

Comparing the Two Workflows

AspectTraditional Workflow (FME + PostGIS)Advanced Workflow (Spark + Delta Lake)
ScalabilitySuitable for small to medium workloadsIdeal for large-scale datasets
TechnologiesFME, PostGIS, COGs, file system or object storageSpark, Sedona, Delta Lake, RasterFrames, object storage
Processing MethodSequential or batch processingParallel and distributed processing
PerformanceLimited by local infrastructure or on-premise serversOptimized for cloud-native and distributed environments
Use CasesSmall teams, traditional GIS setups, hybrid cloud setupsLarge organizations, big data environments

Key Takeaways

The medallion architecture offers much needed flexibility and scalability for geospatial data processing. It meshes well with traditional workflows using FME and PostGIS, which is effective for organizations with established GIS infrastructure. Additionally, it can be used in cloud-native workflows using Apache Spark and Delta Lake to provide scalability for large-scale processing. Both of these workflows can be adapted depending on the organization’s technological maturity and requirements. 

Conclusion

Medallion architecture provides a structured, scalable approach to geospatial data management, ensuring better data quality and streamlined processing. Whether using a traditional GIS-based workflow or an advanced cloud-native approach, this framework helps organizations refine raw spatial data into high-value insights. By assessing their infrastructure and data needs, teams can adopt the workflow that best aligns with their goals, optimizing efficiency and unlocking the full potential of their geospatial data.

Three Ways to Use GeoPandas in Your ArcGIS Workflow

Introduction

When combining open-source GIS tools with the ArcGIS ecosystem, there are a handful of challenges one can encounter. The compatibility of data formats, issues with interoperability, tool chain fragmentation, and performance at scale come to mind quickly. However, the use of the open-source Python library GeoPandas can be an effective way of working around these problems. When working with GeoPandas, there’s a simple series of steps to follow – you start with the data in ArcGIS, process it with the GeoPandas library, and import it back into ArcGIS.

It is worth noting that ArcPy and GeoPandas are not mutually exclusive. Because of its tight coupling with ArcGIS, it may be advantageous to use ArcPy in parts of your workflow and pass your data off to GeoPandas for other parts. This post covers three specific ways GeoPandas can enhance ArcGIS workflows and why it can better than using ArcPy in some cases.

Scenario 1: Spatial Joins Between Large Datasets

Spatial joins in ArcPy can be computationally expensive and time-consuming, especially for large datasets, as they process row by row and write to disk. GeoPandas’ gpd.sjoin() provides a more efficient in-memory alternative for point-to-polygon and polygon-to-polygon joins, leveraging Shapely’s spatial operations. While GeoPandas can be significantly faster for moderately large datasets that fit in memory, ArcPy’s disk-based approach may handle extremely large datasets more efficiently. GeoPandas also simplifies attribute-based filtering and aggregation, making it easier to summarize data—such as joining customer locations to sales regions and calculating total sales per region. Results can be exported to ArcGIS-compatible formats, though conversion is required. For best performance, enabling spatial indexing (gdf.sindex) in GeoPandas is recommended.

Bplewe, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Scenario 2: Geometric Operations (Buffering, Clipping, and Dissolving Features)

Buffering and dissolving in ArcPy can be memory-intensive and time-consuming, particularly for large or complex geometries. Using functions like buffer(), clip(), and dissolve() to preprocess geometries before importing them back to ArcGIS is an effective solution to that problem. These functions can help make a multitude of processes more efficient. They can create buffer zones around road networks, dissolve any overlapping zones, and export the results as a new feature class for ArcGIS-based impact analysis. 

These functions can be cleaner and more efficient with regards to geometry processing than ArcPy and require fewer steps to carry out. They also integrate well with data science workflows using pandas-like syntax. 

Below is a detailed side-by-side comparison of GeoPandas and ArcPy for spatial analysis operations, specifically focusing on buffering and dissolving tasks.

AspectGeoPandas 🐍ArcPy 🌎
Processing SpeedFaster for medium-sized datasets due to vectorized NumPy/Shapely operations. Slows down with very large datasets.Slower for smaller datasets but optimized for large-scale GIS processing due to disk-based operations.
Memory UsageFully in-memory, efficient for moderately large data but can struggle with very large datasets.Uses ArcGIS’s optimized storage and caching mechanisms, which help handle large datasets without running out of RAM.
Ease of UseRequires fewer lines of code; syntax is cleaner for many operations.More verbose; requires handling geoprocessing environments and ArcPy-specific data structures.
Buffering CapabilitiesUses GeoSeries.buffer(distance), efficient but requires a projected CRS.arcpy.Buffer_analysis(), supports geodesic buffers and larger datasets more reliably.
Dissolve FunctionalityGeoDataFrame.dissolve(by=”column”), vectorized and fast for reasonably large data.arcpy.Dissolve_management(), slower for small datasets but scales better for massive datasets.
Coordinate System HandlingRequires explicit CRS conversion for accurate distance-based operations.Natively supports geodesic buffering (without requiring projection changes).
Data FormatsWorks with GeoDataFrames, exports to GeoJSON, Shapefile, Parquet, etc.Works with File Geodatabases (.gdb), Shapefiles, and enterprise GIS databases.
Integration with ArcGISRequires conversion (e.g., gdf.to_file(“data.shp”)) before using results in ArcGIS.Seamless integration with ArcGIS software and services.
Parallel Processing SupportLimited parallelism (can use Dask or multiprocessing for workarounds).Can leverage ArcGIS Pro’s built-in multiprocessing tools.
License RequirementsOpen-source, free to use.Requires an ArcGIS license.

Scenario 3: Bulk Updates and Data Cleaning

When performing bulk updates (e.g., modifying attribute values, recalculating fields, or updating geometries), ArcPy and GeoPandas have different approaches and performance characteristics. ArcPy uses a cursor-based approach, applying updates row-by-row. GeoPandas uses an in-memory GeoDataframe and vectorized operations via the underlying Pandas library. This can make GeoPandas orders of magnitude faster on bulk updates than ArcPy, but it can be memory intensive. Modern computing systems generally have a lot of memory so this is rarely a concern but, if you are working in a memory-constrained environment, ArcPy may suit your needs better.

Here is a side-by-side comparison:

FeatureGeoPandas 🐍ArcPy 🌎
Processing ModelUses in-memory GeoDataFrame for updates (vectorized with Pandas).Uses a cursor-based approach (UpdateCursor), modifying records row by row.
SpeedFaster for large batch updates (leverages NumPy, vectorized operations).Slower for large datasets due to row-by-row processing but scales well with large file geodatabases.
Memory UsageHigher, since it loads the entire dataset into memory.Lower, as it processes one row at a time and writes directly to disk.
Ease of UseSimpler, using Pandas-like syntax.More complex, requiring explicit cursor handling.
Parallel ProcessingCan use multiprocessing/Dask to improve performance.Limited, but ArcGIS Pro supports some multiprocessing tools.
Spatial Database SupportWorks well with PostGIS, SpatiaLite, and other open formats.Optimized for Esri File Geodatabases (.gdb) and enterprise databases.
File Format CompatibilityReads/writes GeoJSON, Shapefiles, Parquet, etc.Reads/writes File Geodatabase, Shapefile, Enterprise Databases.

5. When to Use ArcPy Instead

There are still times that using ArcPy would be the better solution. Things like network analysis, topology validation, or tasks that require a deeper integration with ArcGIS Enterprise in some other capacity are better done in ArcPy as opposed to GeoPandas. In the case of network analysis, ArcPy integrates ArcGIS’s native network analyst extension. On its own, it supports finding the shortest path between locations, calculating service areas, origin-destination cost analysis, vehicle routing problems, and closest facility analysis. It also works natively with ArcGIS’s advanced network datasets such as turn restrictions, traffic conditions, one-way streets, and elevation-based restrictions. 

6. Conclusion

GeoPandas offer greater efficiency, speed, flexibility, and simplicity when working with open-source tools in ArcGIS workflows, especially with regard to custom analysis and preprocessing. If you haven’t tried using GeoPandas before, it is more than worth your time to play around with. 

Have you had your own positive or negative experiences using GeoPandas with ArcGIS? Feel free to leave them in the comments, or give us a suggestion of other workflows you would like to see a blog post about! 

Applying Porter’s Five Forces to Open-Source Geospatial

Introduction

The geospatial industry has seen significant transformation with the rise of open-source solutions. Tools like QGIS, PostGIS, OpenLayers, and GDAL have provided alternatives to proprietary GIS software, providing cost-effective, customizable, and community-driven mapping and spatial analysis capabilities. While open-source GIS thrives on collaboration and accessibility, it still operates within a competitive landscape influenced by external pressures.

Applying Porter’s Five Forces, a framework for competitive analysis developed by Michael E. Porter in 1979, allows us to analyze the industry dynamics and understand the challenges and opportunities open-source GIS solutions face. The five forces include the threat of new entrants, bargaining power of suppliers, industry rivalry, bargaining power of buyers, and the threat of substitutes. We will explore how these forces shape the world of open-source geospatial technology.

Porter’s Five Forces was conceived to analyze traditional market-driven dynamics. While open-source software development is not necessarily driven by a profit motive, successful open-source projects require thriving, supportive communities. Such communities still require resources – either money or, even more importantly and scarce, time. As a result, a certain amount of market thinking can be useful when considering adoption of open-source into your operations or starting a new project.

Porter articulated the five forces in terms of “threats” and “power” and “rivalry.” We have chosen to retain that language here for alignment with the model but, in the open-source world, many of these threats can represent opportunities for greater collaboration.

1. Threat of New Entrants: Low to Moderate

The barriers to entry in open-source geospatial solutions are low for basic tool development compared to proprietary software development. Developers can utilize existing open-source libraries, open geospatial data, and community-driven documentation to build new tools with minimal investment.

However, gaining significant adoption or community traction presents higher barriers than described in traditional new entrant scenarios. Well-established open-source solutions like QGIS, PostGIS, and OpenLayers have strong community backing and extensive documentation, making it challenging for new entrants to attract users.

New players may find success by focusing on novel or emerging use case areas like AI-powered GIS, cloud-based mapping solutions, or real-time spatial analytics. Companies that provide specialized integrations or enhancements to existing open-source GIS tools may also gain traction. DuckDB and its edge-deployability is a good example of this.

While new tools are relatively easy to develop, achieving broad community engagement often requires differentiation, sustained innovation, and compatibility with established standards and ecosystems.

2. Bargaining Power of Suppliers: Low to Moderate

Unlike proprietary GIS, where vendors control software access, open-source GIS minimizes supplier dependence due to its open standards and community-driven development. The availability of open geospatial datasets (e.g., OpenStreetMap, NASA Earthdata, USGS) further reduces the influence of traditional suppliers.

Moderate supplier power can arise in scenarios where users depend heavily on specific service providers for enterprise-level support, long-term maintenance, or proprietary enhancements (e.g., enterprise hosting or AI-powered extensions). Companies offering such services, like Red Hat’s model for Linux, could gain localized influence over organizations that require continuous, tailored support.

However, competition among service providers ensures that no single vendor holds significant leverage. This can work to the benefit of users, who often require lifecycle support. Localized supplier influence can grow in enterprise settings where long-term support contracts are critical, making it a consideration in high-complexity deployments.

3. Industry Rivalry: Moderate to High

While open-source GIS tools are developed with a collaborative ethos, competition still exists, particularly in terms of user adoption, funding, and enterprise contracts. Users typically don’t choose multiple solutions in a single category, so a level of de facto competition is implied even though open-source projects don’t explicitly and directly compete with each other in the same manner as proprietary software.

  • Open-source projects compete for users: QGIS, GRASS GIS, and gvSIG compete in desktop GIS; OpenLayers, Leaflet, and MapLibre compete in web mapping.
  • Enterprise support: Companies providing commercial support for open-source GIS tools compete for government and business contracts.
  • Competition from proprietary GIS: Esri, Google Maps, and Hexagon offer integrated GIS solutions with robust support, putting pressure on open-source tools to keep innovating.

However, open-source collaboration reduces direct rivalry. Many projects integrate with one another (e.g., PostGIS works alongside QGIS), creating a cooperative rather than competitive environment. While open-source GIS projects indirectly compete for users and funding, collaboration mitigates this and creates shared value. 

Emerging competition from cloud-native platforms and real-time analytics tools, such as SaaS GIS and geospatial AI services, increases rivalry. As geospatial technology evolves, integrating AI and cloud functionalities may determine long-term competitiveness.

When looking to adopt open-source, consider that loose coupling through the use of open standards can add greater value. When considering starting a new open-source project, have integration and standardization in mind to potentially increase adoption.

4. Bargaining Power of Buyers: Moderate

In the case of open-source, “bargaining” refers to the ability of the user to switch between projects, rather than a form of direct negotiation. The bargaining power of buyers in the open-source GIS space is significant, primarily due to the lack of upfront capital expenditure. This financial flexibility enables users to explore and switch between tools without major cost concerns. While both organizational and individual users have numerous alternatives across different categories, this flexibility does not necessarily translate to strong influence over the software’s development.

Key factors influencing buyer power:

  • Minimal financial lock-in: In the early stages of adoption, users can easily migrate between open-source tools. However, as organizations invest more time in customization, workflow integration, and user training, switching costs increase, gradually reducing their flexibility.
  • Community-driven and self-support options: Buyers can access free support through online forums, GitHub repositories, and community-driven resources, lowering their dependence on paid services.
  • Customizability and adaptability: Open-source GIS software allows organizations to tailor the tools to their specific needs without vendor constraints. However, creating a custom version (or “fork”) requires caution, as it could result in a bespoke solution that the organization must maintain independently.

To maximize their influence, new users should familiarize themselves with the project’s community and actively participate by submitting bug reports, fixes, or documentation. Consistent contributions aligned with community practices can gradually enhance a user’s role and influence over time.

For large enterprises and government agencies, long-term support requirements – especially for mission-critical applications – can reduce their flexibility and bargaining power over time. This dependency highlights the importance of enterprise-level agreements in managing risk.

5. Threat of Substitutes: Moderate to High

Substitutes for open-source GIS tools refer to alternatives that provide similar functionality. These substitutes include:

  • Proprietary GIS software: Tools like ArcGIS, Google Maps, and Hexagon are preferred by many organizations due to their perceived stability, advanced features, and enterprise-level support.
  • Cloud-based and SaaS GIS platforms: Services such as Felt, MapIdea, Atlas, Mapbox, and CARTO offer user-friendly, web-based mapping solutions with minimal infrastructure requirements.
  • Business Intelligence (BI) and AI-driven analytics: Platforms like Tableau, Power BI, and AI-driven geospatial tools can partially or fully replace traditional GIS in certain applications.
  • Other open-source GIS tools: Users can switch between alternatives like QGIS, GRASS, OpenLayers, or MapServer with minimal switching costs.

However, open-source GIS tools often complement rather than fully replace proprietary systems. For instance, libraries like GDAL and GeoPandas are frequently used alongside proprietary solutions like ArcGIS. Additionally, many SaaS platforms incorporate open-source components, offering organizations a hybrid approach that minimizes infrastructure investment while leveraging open-source capabilities.

The emergence of AI-driven spatial analysis and real-time location intelligence platforms is increasingly positioning them as partial substitutes to traditional GIS, intensifying this threat. As these technologies mature, hybrid models integrating both open-source and proprietary elements will become more common.

Conclusion

Porter’s Five Forces analysis reveals that open-source geospatial solutions exist in a highly competitive and evolving landscape. While they benefit from free access, strong community support, and low supplier dependence, they also face competition from proprietary GIS, SaaS-based alternatives, and substitutes like AI-driven geospatial analytics.

To remain competitive, open-source GIS projects must not only innovate in cloud integration and AI-enhanced spatial analysis but also respond to the shifting landscape of real-time analytics and SaaS-based delivery models. Strengthening enterprise support, improving user-friendliness, and maintaining strong community engagement will be key to their long-term sustainability.

As geospatial technology advances, open-source GIS will continue to play a crucial role in democratizing access to spatial data and analytics, offering an alternative to fully proprietary systems while fostering collaboration and technological growth.

To learn more about how Cercana can help you develop your open-source geospatial strategy, contact us here.