A Novice Takes a Stab at GIS – Part 3

At this point in my entry-level upskilling project, the ground work has been done. I have a polygon of the Chesapeake Bay laid over an OpenStreetMap layer and I know how to change the color of it. Going back to the initial post, my hope with this project is to show change over time in the crab population of the Bay. As a complete novice, I don’t even know if there’s a way for me to do that in QGIS, or if I’m going to make 15 different maps with the 15 years of data and turn images of them into a .gif. So, I went back to ChatGPT for guidance. 

It also told me I could use style changes by time attribute, the TimeManager Plugin, or the manual process I had considered doing with turning a series of images into a .gif. 

I’ll be using the Temporal Controller since it was the first option. I asked ChatGPT for a step-by-step guide of how to do this.

Before getting bogged down in the process of creating the visualization, it’s important to have my data prepped and ready to go. I asked ChatGPT how it needed to be set up in order to use the temporal controller function.

In this case, I’ve decided to not do the thing that ChatGPT says is easier. The “Join External Time Data to Polygon” option seems to involve more data preparation work and be a better process to know for future projects. I began by taking a screen capture of the data table from the Maryland DNR’s Winter Dredge Survey history, uploaded it into ChatGPT, and had it use its OCR capabilities to make a table that I could paste into Excel and save as a .csv.

Step 1. 

Step 2. 

Step 3. 

After looking back at some of the steps in the process of using the temporal controller (Step 2 above), the final product ended up looking like this. I went into the attribute table of the polygon and saw that it already had an assigned ID of “2250”, so I added that column. Additionally, the geometry type is a polygon so that was added, as well.

With that, data preparation was complete and now I’m ready to move on to joining the data table to the polygon and creating the visualization. 

A Novice Takes a Stab at GIS – Part Two

Last week, I was able to settle on what the map I was creating would illustrate and find trustworthy data to use. This week, the focus is on actually creating the map itself. To do this, I need shapefiles of the Chesapeake Bay Watershed. 

I was able to source one from the Chesapeake Bay Program at data-chesbay.opendata.argis.com. This took me a handful of tries as most of the publicly available shape files of the Bay are a polygon of all the land and water considered to be within the Chesapeake Bay watershed. For the purposes of this map, I was looking for just the water itself. 

As a reminder, this is a self-guided process where I’m using ChatGPT to guide me through learning how to use QGIS. I’ve never loaded a shapefile before and ChatGPT gave me clear instructions.

In order to load the shapefile into QGIS, I dragged the downloaded folder, which included .shp, .xml, .shx, .prj, .dbf, and .cpg files, into a blank new project. I felt a brief moment of triumph before realizing that getting the land surrounding the Bay into the project would likely not be as simple, but it was actually even easier. 

QGIS has an OpenStreetMap layer built into the “XYZ Tiles” tab on the left side of the window. I turned it on, reordered the layers so that my shapefile of the water was over top of OSM, and that was all that needed to be done. The program had already lined up the shapefile of the Bay itself perfectly with where OSM had the Bay. 

Now it’s time to go back to Professor ChatGPT. I need to know how to change the color of the shapefile before I can even worry about assigning different colors to different levels of crab population, finding out how to automatically change the color based on data in a table, or anything else. 

Just to practice, I made the Bay crimson. 

Step 1. 

Step 2.

Step 3.

In my next post, I’ll be going back to ChatGPT to learn how I can set up a table of data and instruct QGIS to change the color of the water based on the data in said table. I’m not sure how that will work or look yet, but that’s part of the learning. 

A Novice Takes a Stab at GIS

For the last handful of months, I’ve been working with my father’s company, Cercana Systems, to assist with content marketing and business development. In college, I finished most of a public relations degree at The University of Alabama before the first of my two daughters graced us with her presence and we decided to move back home to Maryland in order for her to actually have her extended family in her life. Since that time, I’ve found myself wanting more and more to find a way to help build things that provide a more tangible contribution to the world around me. Then, two weeks ago, I had the opportunity to attend FedGeoDay 2025 and decided it was time to start teaching myself some novice-level GIS skills. 

This year’s FedGeoDay theme of “open resilience” brought out a host of presenters that were using open geospatial ecosystems to do highly critical work in disaster prediction, preparation, and response. Everyone there was doing work that was well worth doing.

That brings us to the purpose of this blog post. It is the first in a series of undetermined length about me teaching myself how to use QGIS as a first foray into learning GIS skills. I am a complete novice with no coding or technical background of any kind. My hope is that documenting the process of attempting to make my first data-visualization in a commonly used tool may help others that want to delve into the field get past any apprehension they have about getting off the starting block. 

Step 1 – Pick a topic and find data

Initially, I had wanted to create a handful of heat maps that showed the change in the number of youth lacrosse programs in the mid-atlantic over the last 15 years. I turned to Chat GPT to source the data, but the engine was unable to find historical records for me to use.

“I want to build a heat map showing change over time in the concentration of youth lacrosse programs in the states of Maryland, Virginia, Delaware, New Jersey, Pennsylvania, North Carolina, West Virginia, and the District of Columbia. Could you scour the internet for data that I can use for this and put the data in a spreadsheet? The spread sheet should have three columns. The first column will indicate the year, the second will indicate the county, and the third should indicate the number of youth teams in that county. Let’s look at Spring teams only, not Fall, Winter, or Summer. Data should go back 15 years (2010-2025). Let’s consider “youth lacrosse” to be teams for children as old as 8th grade or 14 years of age. Some teams will use the “under X” age categorization, some will use the projected year of high school graduation, and some will just indicate a standard age range or grade level for the children it caters to, so be sure to look out for all of those. Can you do this?”

Compiling a comprehensive dataset detailing the number of youth lacrosse teams by county from 2010 to 2025 across Maryland, Virginia, Delaware, New Jersey, Pennsylvania, North Carolina, West Virginia, and the District of Columbia presents significant challenges. The primary hurdle is the lack of publicly available, granular historical data spanning this 15-year period. While some organizations provide current information, historical records at the county level are scarce.

### Available Data and Resources

* **Maryland Youth Lacrosse Association (MYLA)**: Oversees youth lacrosse in Maryland, offering current season details and age groupings. However, historical data by county is not readily accessible on their website. ([mylalax.com][1])

* **Western Maryland Youth Lacrosse Conference (WMYLC)**: Covers counties in Western Maryland and parts of neighboring states, with a focus on promoting youth lacrosse. Specific team counts by county and year are not provided online. 

* **Anne Arundel County Recreation and Parks**: Provides detailed information on current boys’ lacrosse leagues, including age divisions and game schedules. Historical data is not available on their platform. ([Anne Arundel County Government][2])

* **Inside Lacrosse Articles**: Discuss trends in lacrosse participation, noting shifts in player demographics and regional growth. These articles offer insights but lack specific county-level data over the desired time frame. ([Inside Lacrosse][3])

### Challenges in Data Compilation

* **Data Availability**: Most organizations do not maintain or publicly share historical team counts by county.

In the real world, sourcing and compiling accurate data can often be the bulk of the work done on a GIS project. However the point of this pursuit is to learn how to use QGIS, not spend countless hours calling up the Parks and Rec department of every county in the Mid-Atlantic, so I decided to pivot to something else.

So now, I’m looking for historical data over the last 15 years on the blue crab population in various sections of the Chesapeake Bay estuary. My new goal will be to create one map that shows the places where the population has increased the most, increased the least, and even decreased since 2010. 

This information was readily available from Maryland’s Department of Natural Resources, with one caveat. 

There was plenty of data on blue crab population available, but I wasn’t finding any that was split up into certain regions of the bay. Nonetheless, creating the map and shading the entire Bay based on percent change in population density from the median of the data year-to-year is a good beginner project to learn anything about QGIS at all, so we’re rolling with it. 

Step 2 – Installing QGIS

While it may seem like a silly step to document, this is supposed to be a properly novice guide to making a map in QGIS, and it’s a touch difficult to do that without installing the program. The machine I’m using is a 2020 M1 Macbook Air running Sonoma 14.6.1. I downloaded the installer for the “long term” version of QGIS from qgis.org, went through the install process, and attempted to open it. 

Naturally, my Macbook was less than thrilled that I was attempting to run a program that I hadn’t downloaded from the app store. It was completely blocking me from running the software when I opened it from the main application navigation screen. This issue was resolved by going to the “Applications” folder in Finder and using the control+left click method. A warning popped up about not being able to verify that the application contained no malware, I ran it anyway, and I have not had any issues opening the application since then. 

The next step will be to actually crack QGIS open and begin creating a map of the Chesapeake Bay. 

Geospatial Without Maps

When most people hear “geospatial,” they immediately think of maps. But in many advanced applications, maps never enter the picture at all. Instead, geospatial data becomes a powerful input to machine learning workflows, unlocking insights and automation in ways that don’t require a single visual.

At its core, geospatial data is structured around location—coordinates, areas, movements, or relationships in space. Machine learning models can harness this spatial logic to solve complex problems without ever generating a map. For example:

  • Predictive Maintenance: Utility companies use the GPS coordinates of assets (like transformers or pipelines) to predict failures based on environmental variables like elevation, soil type, or proximity to vegetation (AltexSoft, 2020). No map is needed—only spatially enriched feature sets for training the model.
  • Crop Classification and Yield Prediction: Satellite imagery is commonly processed into grids of numerical features (such as NDVI indices, surface temperature, soil moisture) associated with locations. Models use these purely as tabular inputs to predict crop types or estimate yields (Dash, 2023).
  • Urban Mobility Analysis: Ride-share companies model supply, demand, and surge pricing based on geographic patterns. Inputs like distance to transit hubs, density of trip starts, or average trip speeds by zone feed machine learning models that optimize logistics in real time (MIT Urban Mobility Lab, n.d.).
  • Smart Infrastructure Optimization: Photometrics AI employs geospatial AI to enhance urban lighting systems. By integrating spatial data and AI-driven analytics, it optimizes outdoor lighting to ensure appropriate illumination on streets, sidewalks, crosswalks, and bike lanes while minimizing light pollution in residential areas and natural habitats. This approach not only improves safety and energy efficiency but also supports environmental conservation efforts (EvariLABS, n.d.).

These examples show how spatial logic—such as spatial joins, proximity analysis, and zonal statistics—can drive powerful workflows even when no visualization is involved. In each case, the emphasis shifts from presenting information to enabling analysis and automation. Features are engineered based on where things are, not just what they are. However, once the spatial context is baked into the dataset, the model itself treats location-derived features just like any other numerical or categorical variable.

Using geospatial technology without maps allows organizations to focus on operational efficiency, predictive insights, and automation without the overhead of visualization. In many workflows, the spatial relationships between objects are valuable as data features rather than elements needing human interpretation. By integrating geospatial intelligence directly into machine learning models and decision systems, businesses and governments can act on spatial context faster, at scale, and with greater precision.

To capture these relationships systematically, spatial models like the Dimensionally Extended nine-Intersection Model (DE-9IM) (Clementini & Felice, 1993) provide a critical foundation. In traditional relational databases, connections between records are typically simple—one-to-one, one-to-many, or many-to-many—and must be explicitly designed and maintained. DE-9IM extends this by defining nuanced geometric interactions, such as overlapping, touching, containment, or disjointness, which are implicit in the spatial nature of geographic objects. This significantly reduces the design and maintenance overhead while allowing for much richer, more dynamic spatial relationships to be leveraged in analysis and workflows.

By embedding DE-9IM spatial predicates into machine learning workflows, organizations can extract richer, context-aware features from their data. For example, rather than merely knowing two infrastructure assets are ‘related,’ DE-9IM enables classification of whether one is physically inside a risk zone, adjacent to a hazard, or entirely separate—substantially improving the precision of classification models, risk assessments, and operational planning.

Machine learning and AI systems benefit from the DE-9IM framework by gaining access to structured, machine-readable spatial relationships without requiring manual feature engineering. Instead of inferring spatial context from raw coordinates or designing custom proximity rules, models can directly leverage DE-9IM predicates as input features. This enhances model performance in tasks such as spatial clustering, anomaly detection, and context-aware classification, where the precise nature of spatial interactions often carries critical predictive signals. Integrating DE-9IM into AI pipelines streamlines spatial feature extraction, improves model explainability, and reduces the risk of omitting important spatial dependencies.

Harnessing geospatial intelligence without relying on maps opens up powerful new pathways for innovation, operational excellence, and automation. Whether optimizing infrastructure, improving predictive maintenance, or enriching machine learning models with spatial logic, organizations can leverage these techniques to achieve better outcomes with less overhead. At Cercana Systems, we specialize in helping clients turn geospatial data into actionable insights that drive real-world results. Ready to put geospatial AI to work for you? Contact us today to learn how we can help you modernize and optimize your data-driven workflows.

References

Clementini, E., & Felice, P. D. (1993). A model for representing topological relationships between complex geometric objects. ACM Transactions on Information Systems, 11(2), 161–193. https://doi.org/10.1016/0020-0255(95)00289-8

AltexSoft. (2020). Predictive maintenance: Employing IIoT and machine learning to prevent equipment failures. AltexSoft. https://www.altexsoft.com/blog/predictive-maintenance/

Dash, S. K. (2023, May 10). Crop classification via satellite image time-series and PSETAE deep learning model. Medium. https://medium.com/geoai/crop-classification-via-satellite-image-time-series-and-psetae-deep-learning-model-c685bfb52ce

MIT Urban Mobility Lab. (n.d.). Machine learning for transportation. Massachusetts Institute of Technology. https://mobility.mit.edu/machine-learning

EvariLABS. (2025, April 14). Photometrics AI. https://www.linkedin.com/pulse/what-counts-real-roi-streetlight-owners-operators-photometricsai-vqv7c/

Reflections on the Process of Planning FedGeoDay 2025

What is FedGeoDay?

FedGeoDay is a single-track conference dedicated to federal use-cases of open geospatial ecosystems. The open ecosystems have a wide variety of uses and forms, but largely include anything designed around open data, open source software, and open standards. The main event is a one day commitment and is followed by a day of optional hands-on workshops. 

FedGeoDay has existed for roughly a decade , serving as a day of learning, networking, and collaboration in the Washington, D.C. area. Recently, Cercana Systems president Bill Dollins was invited to join the planning committee, and served as one of the co-chairs for FedGeoDay 2024 and 2025. His hope is that attendees are able to come away with practical examples of how to effectively use open geospatial ecosystems in their jobs. 

Photo courtesy of OpenStreetMap US on LinkedIn.

“Sometimes the discussion around those concepts can be highly technical and even a little esoteric, and that’s not necessarily helpful for someone who’s just got a day job that revolves around solving a problem. Events like this are very helpful in showing practical ways that open software and open data can be used.”

Dollins joined the committee for a multitude of reasons. In this post, we will explore some of his reasons for joining, as well as what he thinks he brings to the table in planning the event and things he has learned from the process. 

Why did you join the committee?

When asked for some of the reasons why he joined the planning committee for FedGeoDay, Dollins indicated that his primary purpose was to give back to a community that has been very helpful and valuable to him throughout his career in a very hands-on way. 

“In my business, I derive a lot of value from open-source software. I use it a lot in the solutions I deliver in my consulting, and when you’re using open-source software you should find a way that works for you to give back to the community that developed it. That can come in a number of ways. That can be contributing code back to the projects that you use to make them better. You can develop documentation for it, you can provide funding, or you can provide education, advocacy, and outreach. Those last three components are a big part of what FedGeoDay does.”

He also says that while being a co-chair of such an impactful event helps him maintain visibility in the community, getting the opportunity to keep his team working skills fresh was important to him, too. 

“For me, also, I’m self-employed. Essentially, I am my team,” said Dollins. “It can be really easy to sit at your desk and deliver things and sort of lose those skills.”

What do you think you brought to the committee?

Dollins has had a long career in the geospatial field and has spent the majority of his time in leadership positions, so he was confident in his ability to contribute in this new form of leadership role. Event planning is a beast of its own, but early on in the more junior roles of his career, the senior leadership around him went out of their way to teach him about project cost management, staffing, and planning agendas. He then was able to take those skills into a partner role at a small contracting firm where he wore every hat he could fit on his head for the next 15 years, including still doing a lot of technical and development work. Following his time there, he had the opportunity to join the C-suite of a private sector SaaS company and was there for six years, really rounding out his leadership experience. 

He felt one thing he was lacking in was experience in community engagement, and event planning is a great way to develop those skills. 

“Luckily, there’s a core group of people who have been planning and organizing these events for several years. They’re generally always happy to get additional help and they’re really encouraging and really patient in showing you the rules of the road, so that’s been beneficial, but my core skills around leadership were what applied most directly. It also didn’t hurt that I’ve worked with geospatial technology for over 30 years and open-source geospatial technology for almost 20, so I understood the community these events serve and the technology they are centered around,” said Dollins.

Photo courtesy of Ran Goldblatt on LinkedIn.

What were some of the hard decisions that had to be made?

Photo Courtesy of Cercana Systems on LinkedIn.

Attendees of FedGeoDay in previous years will likely remember that, in the past, the event has always been free for feds to attend. The planning committee, upon examining the revenue sheets from last year’s event, noted that the single largest unaccounted for cost was the free luncheon. A post-event survey was sent out, and federal attendees largely indicated that they would not take issue with contributing $20 to cover the cost of lunch. However, the landscape of the community changed in a manner most people did not see coming.

“We made the decision last year, and keep in mind the tickets went on sale before the change of administration, so at the time we made the decision last year it looked like a pretty low-risk thing to do,” said Dollins.

Dollins continued to say that while the landscape changes any time the administration changes, even without changing parties in power, this one has been a particularly jarring change. 

“There’s probably a case to be made that we could have bumped up the cost of some of the sponsorships and possibly the industry tickets a little bit and made an attempt to close the gap that way. We’ll have to see what the numbers look like at the end. The most obvious variable cost was the cost of lunches against the free tickets, so it made sense to do last year and we’ll just have to look and see how the numbers play out this year.”**

What have you taken away from this experience?

Dollins says one of the biggest takeaways from the process of helping to plan FedGeoDay has been learning to apply leadership in a different context. Throughout most of his career, he has served as a leader in more traditional team structures with a clearly defined hierarchy and specified roles. When working with a team of volunteers that have their own day jobs to be primarily concerned with, it requires a different approach. 

“Everyone’s got a point of view, everyone’s a professional and generally a peer of yours, and so there’s a lot more dialogue. The other aspect is that it also means everyone else has a day job, so sometimes there’s an important meeting and the one person that you needed to be there couldn’t do it because of that. You have to be able to be a lot more asynchronous in the way you do these things. That’s a good thing to give you a different approach to leadership and team work,” said Dollins on the growth opportunity. 

Dollins has even picked up some new work from his efforts on the planning committee by virtue of getting to work and network with people that weren’t necessarily in his circle beforehand. Though he’s worked in the geospatial field for 30 years and focused heavily on open-source work for 20, he says he felt hidden away from the community in a sense during his time in the private sector. 

Photo courtesy of Lane Goodman on LinkedIn.

“This has helped me get back circulating in the community and to be perceived in a different way. In my previous iterations, I was seen mainly from a technical perspective, and so this has kind of helped me let the community see me in a different capacity, which I think has been beneficial.”

FedGeoDay 2025 has concluded and was a huge success for all involved. Cercana Systems looks forward to continuing to sponsor the event going forward, and Dollins looks forward to continuing to help this impactful event bring the community together in the future. 

Photo courtesy of Cercana Systems on LinkedIn.

**This interview was conducted before FedGeoDay 2025 took place. The event exceeded the attendance levels of FedGeoDay 2024. 

FedGeoDay 2025 Highlights

The Cercana Systems team had a wonderful time attending FedGeoDay 2025 in Washington, D.C.! It was fun to catch up with long-time colleagues, make new professional connections, and learn how a wide array of new projects are contributing to the ever-evolving world of open geospatial ecosystems. 

A standout highlight was the in-depth keynote by Katie Picchione of NASA’s Disasters Program on the critical role played by open geospatial data in disaster response. Additionally, Ryan Burley of GeoSolutions moderated an excellent panel on Open-Source Geospatial Applications for Resilience, and Eddie Pickle of Crunchy Data led an energetic panel on Open Data for Resilience. 

We were especially excited about the “Demystifying AI” panel with panelists Emily Kalda of RGi, Jason Gilman of Element 84, Ran Goldblatt of New Light Technologies, and Jackie Kazil of Bana Solutions which was moderated by Cercana’s president Bill Dollins.

Location is an increasingly important component of cybersecurity and FedGeoDay featured a fireside chat on cybersecurity led by Ashley Fairman of DICE Cyber.  On either side of the lunch break, Wayne Hawkins of RGi moderated a series of informative lightning talks on a range of topics. 

FedGeoDay was a content-rich event that was upbeat from beginning to end. We are grateful to all of the presenters and panelists for taking the time to share their knowledge and to the organizing committee for their work in pulling together such a high-quality event. Cercana is proud to support FedGeoDay and looks forward to continuing to do so for years to come.

Why Young Professionals Should Get Out of the Office and Into Industry Events

In today’s fast-paced professional world, it’s easy for young professionals to assume that hard work alone will get them ahead. While grinding at the desk and delivering results matters, relying solely on your work to speak for itself may leave you overlooked in a competitive field. Getting out of the office and into local conferences, workshops, and networking events can provide invaluable opportunities that simply can’t be replicated from behind a desk.

Build Meaningful Professional Relationships

Networking remains one of the most powerful tools for career growth. According to a 2023 LinkedIn survey, 85% of job roles are filled through networking, not traditional applications. Attending local conferences puts you face-to-face with people in your industry—from potential mentors and collaborators to future employers and clients. These relationships can open doors to new opportunities that might never make it to job boards or public listings.

Stay Current With Industry Trends

Local events are also a great way to keep your knowledge sharp and up to date. Industry leaders often use conferences as platforms to discuss the latest trends, tools, and innovations. The Harvard Business Review emphasizes that staying current with changes in your field helps you remain relevant and competitive, especially in industries being rapidly transformed by technology and globalization (HBR, 2021).

Showcase Yourself Beyond the Resume

When you attend events, you get the chance to show people not just what you do—but how you do it. Your communication style, curiosity, and initiative become part of the impression you make. This visibility can lead to referrals, collaborations, or speaking invitations, all of which enhance your professional reputation in ways your LinkedIn profile alone cannot.

Gain Confidence and Soft Skills

Finally, stepping into a room full of peers and industry veterans can be intimidating—but it’s also empowering. Each interaction hones your communication skills, boosts your confidence, and teaches you how to navigate complex social dynamics in a professional context—critical soft skills that employers value highly.

Bottom Line

If you’re a young professional looking to grow, staying in your comfort zone won’t cut it. Attending local conferences and events is more than just networking—it’s about investing in your personal and professional development. Get out there, be visible, and let the right people see what you’re capable of.

Sources

Demystifying the Medallion Architecture for Geospatial Data Processing

Introduction

Geospatial data volumes and complexity are growing due to diverse sources, such as GPS, satellite imagery, and sensor data. Traditional geospatial processing methods face challenges, including scalability, handling various formats, and ensuring data consistency. The medallion architecture offers a layered approach to data management, improving data processing, reliability, and scalability. While the medallion architecture is often associated with specific implementation such as the Delta Lake, its concepts are applicable to other technical implementations. This post introduces the medallion architecture and discusses two workflows—traditional GIS-based and advanced cloud-native—to demonstrate how it can be applied to geospatial data processing.

Overview of the Medallion Architecture

The medallion architecture was developed to address the need for incremental, layered data processing, especially in big data and analytics environments. It is composed of three layers:

  • Bronze Layer: Stores raw data as-is from various sources.
  • Silver Layer: Cleans and transforms data for consistency and enrichment.
  • Gold Layer: Contains aggregated and optimized data ready for analysis and visualization.

The architecture is particularly useful in geospatial applications due to its ability to handle large datasets, maintain data lineage, and support both batch and real-time data processing. This structured approach ensures that data quality improves progressively, making downstream consumption more reliable and efficient.

Why Geospatial Data Architects Should Consider the Medallion Architecture

Geospatial data processing involves unique challenges, such as handling different formats (raster, vector), managing spatial operations (joins, buffers), and accommodating varying data sizes. Traditional methods struggle when scaling to large, real-time datasets or integrating data from multiple sources. The medallion architecture addresses these challenges through its layered approach. The bronze layer preserves the integrity of raw data, allowing for transformations to be traced easily. The silver layer handles transformations of the data, such as projections, spatial joins, and data enrichment. The gold layer provides ready-to-consume, performance optimized data ready for downstream systems. 

Example Workflow 1: Traditional GIS-Based Workflow  

For organizations that rely on established GIS tools or operate with limited cloud infrastructure, the medallion architecture provides a structured approach to data management while maintaining compatibility with traditional workflows. This method ensures efficient handling of both vector and raster data, leveraging familiar GIS technologies while optimizing data accessibility and performance.  

This workflow integrates key technologies to support data ingestion, processing, and visualization. FME serves as the primary ETL tool, streamlining data movement and transformation. Object storage solutions like AWS S3 or Azure Blob Storage store raw spatial data, ensuring scalable and cost-effective management. PostGIS enables spatial analysis and processing for vector datasets. Cloud-Optimized GeoTIFFs (COGs) facilitate efficient access to large raster datasets by allowing partial file reads, reducing storage and processing overhead. 

Bronze – Raw Data Ingestion 

The process begins with the ingestion of raw spatial data into object storage. Vector datasets, such as Shapefiles and CSVs containing spatial attributes, are uploaded alongside raster datasets like GeoTIFFs. FME plays a crucial role in automating this ingestion, ensuring that all incoming data is systematically organized and accessible for further processing.  

Silver – Data Cleaning and Processing

At this stage, vector data is loaded into PostGIS, where essential transformations take place. Operations such as spatial joins, coordinate system projections, and attribute filtering help refine the dataset for analytical use. Meanwhile, raster data undergoes optimization through conversion into COGs using FME. This transformation enhances performance by enabling GIS applications to read only the necessary portions of large imagery files, improving efficiency in spatial analysis and visualization.  

Gold – Optimized Data for Analysis and Visualization  

Once processed, the refined vector data in PostGIS and optimized raster datasets in COG format are made available for GIS tools. Analysts and decision-makers can interact with the data using platforms such as QGIS, Tableau, or Geoserver. These tools provide the necessary visualization and analytical capabilities, allowing users to generate maps, conduct spatial analyses, and derive actionable insights.  

This traditional GIS-based implementation of medallion architecture offers several advantages. It leverages established GIS tools and workflows, minimizing the need for extensive retraining or infrastructure changes. It is optimized for traditional environments yet still provides the flexibility to integrate with hybrid or cloud-based analytics platforms. Additionally, it enhances data accessibility and performance, ensuring that spatial datasets remain efficient and manageable for analysis and visualization.  

By adopting this workflow, organizations can modernize their spatial data management practices while maintaining compatibility with familiar GIS tools, resulting in a seamless transition toward more structured and optimized data handling. 

Example Workflow 2: Advanced Cloud-Native Workflow  

For organizations managing large-scale spatial datasets and requiring high-performance processing in cloud environments, a cloud-native approach to medallion architecture provides scalability, efficiency, and advanced analytics capabilities. By leveraging distributed computing and modern storage solutions, this workflow enables seamless processing of vector and raster data while maintaining cost efficiency and performance.  

This workflow is powered by cutting-edge cloud-native technologies that optimize storage, processing, and version control. 

Object Storage solutions such as AWS S3, Google Cloud Storage, or Azure Blob Storage serve as the foundation for storing raw geospatial data, ensuring scalable and cost-effective data management. Apache Spark with Apache Sedona enables large-scale spatial data processing, leveraging distributed computing to handle complex spatial joins, transformations, and aggregations. Delta Lake provides structured data management, supporting versioning and ACID transactions to ensure data integrity throughout processing. RasterFrames or Rasterio facilitate raster data transformations, including operations like mosaicking, resampling, and reprojection, while optimizing data storage and retrieval.  

Bronze – Raw Data Ingestion

The workflow begins by ingesting raw spatial data into object storage. This includes vector data such as GPS logs in CSV format and raster data like satellite imagery stored as GeoTIFFs. By leveraging cloud-based storage solutions, organizations can manage and access massive datasets without traditional on-premises limitations.  

Silver – Data Processing and Transformation

At this stage, vector data undergoes large-scale processing using Spark with Sedona. Distributed spatial operations such as filtering, joins, and projections enable efficient refinement of large datasets. Meanwhile, raster data is transformed using RasterFrames or Rasterio, which facilitate operations like mosaicking, resampling, and metadata extraction. These tools ensure that raster datasets are optimized for both analytical workloads and visualization purposes.  

Gold – Optimized Data for Analysis and Visualization

Once processed, vector data is stored in Delta Lake, where it benefits from structured storage, versioning, and enhanced querying capabilities. This ensures that analysts can access well-maintained datasets with full historical tracking. Optimized raster data is converted into Cloud-Optimized GeoTIFFs, allowing efficient cloud-based visualization and integration with GIS tools. These refined datasets can then be used in cloud analytics environments or GIS platforms for advanced spatial analysis and decision-making.  

This cloud-native implementation of medallion architecture provides several advantages for large-scale spatial data workflows. It features high scalability, enabling efficient processing of vast datasets without the constraints of traditional infrastructure, parallelized data transformations, significantly reducing processing time through distributed computing frameworks, and cloud-native optimizations, ensuring seamless integration with advanced analytics platforms, storage solutions, and visualization tools.  

By adopting this approach, organizations can harness the power of cloud computing to manage, analyze, and visualize geospatial data at an unprecedented scale, improving both efficiency and insight generation.  

Comparing the Two Workflows

AspectTraditional Workflow (FME + PostGIS)Advanced Workflow (Spark + Delta Lake)
ScalabilitySuitable for small to medium workloadsIdeal for large-scale datasets
TechnologiesFME, PostGIS, COGs, file system or object storageSpark, Sedona, Delta Lake, RasterFrames, object storage
Processing MethodSequential or batch processingParallel and distributed processing
PerformanceLimited by local infrastructure or on-premise serversOptimized for cloud-native and distributed environments
Use CasesSmall teams, traditional GIS setups, hybrid cloud setupsLarge organizations, big data environments

Key Takeaways

The medallion architecture offers much needed flexibility and scalability for geospatial data processing. It meshes well with traditional workflows using FME and PostGIS, which is effective for organizations with established GIS infrastructure. Additionally, it can be used in cloud-native workflows using Apache Spark and Delta Lake to provide scalability for large-scale processing. Both of these workflows can be adapted depending on the organization’s technological maturity and requirements. 

Conclusion

Medallion architecture provides a structured, scalable approach to geospatial data management, ensuring better data quality and streamlined processing. Whether using a traditional GIS-based workflow or an advanced cloud-native approach, this framework helps organizations refine raw spatial data into high-value insights. By assessing their infrastructure and data needs, teams can adopt the workflow that best aligns with their goals, optimizing efficiency and unlocking the full potential of their geospatial data.

Using Hstore to Analyze OSM in PostgreSQL

OpenStreetMap (OSM) is a primary authoritative source of geographic information, offering a variety of community-validated feature types. However, efficiently querying and analyzing OSM poses unique challenges. PostgreSQL, with its hstore data type, can be a powerful tool in the data analyst’s arsenal.

Understanding hstore in PostgreSQL

Before getting into the specifics of OpenStreetMap, let’s understand the hstore data type. Hstore is a key-value store within PostgreSQL, allowing data to be stored in a schema-less fashion. This flexibility makes it ideal for handling semi-structured data like OpenStreetMap.

Setting Up Your Environment

To get started, you’ll need a PostgreSQL database with PostGIS extension, which adds support for geographic objects. You will also need to add support for the hstore type. Both PostGIS and hstore are installed as extensions. The SQL to install them is:

create extension postgis;
create extension hstore;

After setting up your database, import OpenStreetMap data using tools like osm2pgsql, ensuring to import the data with the hstore option enabled. This step is crucial as it allows the key-value pairs of OSM tags to be stored in an hstore column. Be sure to install osm2pgsql using the instructions for your platform.

The syntax for importing is as follows:

osm2pgsql -c -d my_database -U my_username -W -H my_host -P my_port --hstore my_downloaded.osm

Querying OpenStreetMap Data

With your data imported, you can now unleash the power of hstore. Here’s a basic example: Let’s say you want to find all the coffee shops in a specific area. The SQL query would look something like this:

SELECT name, tags
FROM planet_osm_point
where name is not null
and tags -> 'cuisine' = 'pizza'

This query demonstrates the power of using hstore to filter data based on specific key-value pairs (finding pizza shops in this case).

Advanced Analysis Techniques

While basic queries are useful, the real power of hstore comes with its ability to facilitate complex analyses. For example, you can aggregate data based on certain criteria, such as counting the number of amenities in a given area or categorizing roads based on their condition.

Here is an example that totals the sources for each type of cuisine available in Leonardtown, Maryland:

SELECT tags -> 'cuisine' AS amenity_type, COUNT(*) AS total
FROM planet_osm_point
WHERE tags ? 'cuisine'
AND ST_Within(ST_Transform(way, 4326), ST_MakeEnvelope(-76.66779675183034, 38.285044882153485, -76.62251613561185, 38.31911201477845, 4326))
GROUP BY tags -> 'cuisine'
ORDER BY total DESC;

The above query combines hstore analysis with a PostGIS function to limit the query to a specific area. The full range of PostGIS functions can be used to perform spatial analysis in combination with hstore queries. For instance, you could analyze the spatial distribution of certain amenities, like public toilets or bus stops, within a city. You can use PostGIS functions to calculate distances, create buffers, and perform spatial joins.

Performance Considerations

Working with large datasets like OpenStreetMap can be resource-intensive. Indexing your hstore column is crucial for performance. Creating GIN (Generalized Inverted Index) indexes on hstore columns can significantly speed up query times.

Challenges and Best Practices

While hstore is powerful, it also comes with challenges. The schema-less nature of hstore can lead to inconsistencies in data, especially if the source data is not standardized. It’s important to clean and preprocess your data before analysis. OSM tends to preserve local flavor in attribution, so a good knowledge of the geographic area you are analyzing will help you be more successful when using hstore with OSM.

Conclusion

The PostgreSQL hstore data type is a potent tool for analyzing OpenStreetMap data. Its flexibility in handling semi-structured data, combined with the spatial analysis capabilities of PostGIS, makes it an compelling resource for geospatial analysts. By understanding its strengths and limitations, you can harness the power of PostgreSQL and OpenStreetMap in your work.

Remember, the key to effective data analysis is not just about choosing the right tools but also understanding the data itself. With PostgreSQL and hstore, you are well-equipped to extract meaningful insights from OpenStreetMap data.

Contact us to learn more about our services and how we can help turn your geospatial challenges into opportunities.

Do You Need a Data Pipeline?

Do you need a data pipeline? That depends on a few things. Does your organization see data as an input into its key decisions? Is data a product? Do you deal with large volumes of data or data from disparate sources? Depending on the answers to these and other questions, you may be looking at the need for a data pipeline. But what is a data pipeline and what are the considerations for implementing one, especially if your organization deals heavily with geospatial data? This post will examine those issues.

A data pipeline is a set of actions that extract, transform, and load data from one system to another. A data pipeline may be set up to run on a specific schedule (e.g., every night at midnight), or it might be event-driven, running in response to specific triggers or actions. Data pipelines are critical to data-driven organizations, as key information may need to be synthesized from various systems or sources. A data pipeline automates accepted processes, enabling data to be efficiently and reliably moved and transformed for analysis and decision-making.

A data pipeline can start small – maybe a set of shell or python scripts that run on a schedule – and it can be modified to grow along with your organization to the point where it may be driven my a full-fledged event-driven platform like AirFlow or FME (discussed later). It can be confusing, and there are a lot of commercial and open-source solutions available, so we’ll try to demystify data pipelines in this post.

Geospatial data presents unique challenges in data pipelines. Geospatial data are often large and complex, containing multiple dimensions of information (geometry, elevation, time, etc.). Processing and transforming this data can be computationally intensive and may require significant storage capacity. Managing this complexity efficiently is a major challenge. Data quality and accuracy is also a challenge. Geospatial data can come from a variety of sources (satellites, sensors, user inputs, etc.) and can be prone to errors, inconsistencies, or inaccuracies. Ensuring data quality – dealing with missing data, handling noise and outliers, verifying accuracy of coordinates – adds complexity to standard data management processes.

Standardization and interoperability challenges, while not unique to geospatial data, present additional challenges due to the nature of the data. There are many different formats, standards, and coordinate systems used in geospatial data (for example, reconciling coordinate systems between WGS84, Mercator, state plane, and various national grids). Transforming between these can be complex, due to issues such as datum transformation. Furthermore, metadata (data about the data) is crucial in geospatial datasets to understand the context, source, and reliability of the data, which adds another layer of complexity to the processing pipeline.

While these challenges make the design, implementation, and management of data pipelines for geospatial data a complex task, they can provide significant benefits to organizations that process large amounts of geospatial data:

  • Efficiency and automation: Data pipelines can automate the entire process of data extraction, transformation, and loading (ETL). Automation is particularly powerful in the transformation stage. “Transformation” is a deceptively simple term for a process that can contain many enrichment and standardization tasks. For example, as the coordinate system transformations described above are validated, they can be automated and included in the transformation stage to remove human error. Additionally, tools like Segment Anything can be called during this stage to turn imagery into actionable, analyst-ready information.
  • Data quality and consistency: The transformation phase includes steps to clean and validate data, helping to ensure data quality. This can include resolving inconsistencies, filling in missing values, normalizing data, and validating the format and accuracy of geospatial coordinates. By standardizing and automating these operations in a pipeline, an organization can ensure that the same operations are applied consistently to all data, improving overall data quality and reliability.
  • Data Integration: So far, we’ve talked a lot about the transformation phase, but the extract phase provides integration benefits. A data pipeline allows for the integration of diverse data sources, such as your CRM, ERP, or support ticketing system. It also enables extraction from a wide variety of formats (shapefile, GeoParquet, GeoJSON, GeoPackage, etc). This is crucial for organizations dealing with geospatial data, as it often comes from a variety of sources in different formats. Integration with data from business systems can provide insights into performance as relates to the use of geospatial data. 
  • Staging analyst-ready data: With good execution, a data pipeline produces clean, consistent, integrated data that enables people to conduct advanced analysis, such as predictive modeling, machine learning, or complex geospatial statistical analysis. This can provide valuable insights and support data-driven decision making.

A data pipeline is first and foremost about automating accepted data acquisition and management processes for your organization, but it is ultimately a technical architecture that will be added to your portfolio. The technology ecosystem for such tools is vast, but we will discuss a few with which we have experience.

  • Apache Airflow: Developed by Airbnb and later donated to the Apache Foundation, Airflow is a platform to programmatically author, schedule, and monitor workflows. It uses directed acyclic graphs (DAGs) to manage workflow orchestration. It supports a wide range of integrations and is highly customizable, making it a popular choice for complex data pipelines. AirFlow is capable of being your entire data pipeline.
  • GDAL/OGR: The Geospatial Data Abstraction Library (GDAL) is an open-source, translator library for raster and vector geospatial data formats. It provides a unified API for over 200 types of geospatial data formats, allowing developers to write applications that are format-agnostic. GDAL supports various operations like format conversion, data extraction, reprojection, and mosaicking. It is used in GIS software like QGIS, ArcGIS, and PostGIS. As a library it can also be used in large data processing tasks and in AirFlow workflows. Its flexibility makes it a powerful component of a data pipeline, especially where support for geospatial data is required.
  • FME: FME is a data integration platform developed by Safe Software. It allows users to connect and transform data between over 450 different formats, including geospatial, tabular, and more. With its visual interface, users can create complex data transformation workflows without coding. FME’s capabilities include data validation, transformation, integration, and distribution. FME in the geospatial information market and is the most geospatially literate commercial product in the data integration segment. In addition it supports a wide range of non-spatial sources, including proprietary platforms such as Salesforce. FME has a wide range of components, making it possible for it to scale up to support enterprise-scale data pipelines.

In addition to the tools listed above, there is a fairly crowded market segment for hosted solutions, known as “integration platform as a service” or IPaaS. These platforms all generally have ready-made connectors for various sources and destinations, but spatial awareness tends to be limited, as does customization options for adding spatial. A good data pipeline is tightly coupled to the data governance procedures of your organization, so you’ll see greater benefits from technologies that allow you customize to your needs.

Back to the original question: Do you need a data pipeline? If data-driven decisions are key to your organization, and consistent data governance is necessary to have confidence in your decisions, then you may need a data pipeline. At Cercana, we have experience implementing data pipelines and data governance procedures for organizations large and small. Contact us today to learn more about how we can help you.