Skip to main content
WorkProjects

OpenStreetMap PBF Parser

Protocol Buffers from scratch, no libraries

stable
View raw

An OSM .pbf parser in C with a hand-written Protocol Buffers deserializer. 2,672 lines across 11 files, one runtime dependency (zlib), and no libprotobuf or protoc-generated code. Five CLI query modes against real OpenStreetMap extracts.

What it is

Given a .pbf file, the tool decodes the blob stream, inflates compressed data blocks, reconstructs HeaderBlock and PrimitiveBlock messages, and exposes the underlying OSM_Map as nodes, ways, and a bounding box. All Protocol Buffers parsing — tag/wire-type decoding, varint, length-delimited fields, fixed-width I32/I64, packed repeated fields, embedded messages — is hand-written against the protobuf.h spec. The CLI answers structural and key/value queries in a single pass over the parsed map.

By the numbers

MetricValue
C source2,672 lines across 11 files
Runtime dependencies1 (zlib)
CLI query modes5 (summary, bbox, node, way, tag filter)
Sample map46,415 nodes, 5,812 ways
Sample bbox-73.1387 .. -73.1074 lon, 40.9040 .. 40.9290 lat
Compile flags-std=gnu11 -Wall -Werror

Architecture

.pbf file
   |
   v
+--------------------+
| main.c             |  argv -> validate -> OSM_read_Map -> query
+--------------------+
           |
           v
+--------------------+
| osmpbf.c           |  blob loop: BlobHeader, Blob, inflate,
|                    |  HeaderBlock vs PrimitiveBlock, build OSM_Map
+--------------------+
           |
           v
+--------------------+
| protobuf.c         |  tag/varint/length-delimited/I32/I64/packed
|                    |  decoding into PB_Field linked list
+--------------------+
           |
           v
+--------------------+
| zlib_inflate.c     |  zlib inflate() over fmemopen / open_memstream
+--------------------+
PathLinesRole
src/osmpbf.c1,038OSM model, blob loop, string table, delta + zig-zag decode
src/protobuf.c912wire-format decoder, packed field expansion
src/process_args.c233CLI validation and query dispatch
include/protobuf.h95PB_Field, PB_Message, wire-type enum
src/zlib_inflate.c83zlib stream inflation over FILE streams
src/main.c68entrypoint, two-pass argv handling
include/osm.h63opaque OSM_Map / OSM_Node / OSM_Way

Key features

  • Hand-written Protocol Buffers deserializerPB_read_tag splits wire type (3 bits) and field number; PB_read_value dispatches on VARINT_TYPE, I64_TYPE, LEN_TYPE, I32_TYPE; fields stream into a circular doubly-linked list headed by a SENTINEL_TYPE for forward/backward traversal.
  • Embedded-message and packed-field handlingPB_read_embedded_message parses nested messages from in-memory buffers, inflating zlib blobs on the fly; PB_expand_packed_fields expands packed repeated scalars into individual PB_Field entries for uniform traversal.
  • Zlib decompression over memory buffers — compressed Blob payloads pipe through zlib_inflate using fmemopen / open_memstream, so inflation works against in-memory buffers without temp files.
  • Delta + zig-zag decodingDenseNodes store IDs, lat, and lon as deltas; the parser accumulates the running sum and reverses zig-zag ((n << 1) ^ (n >> 63)) so negative coordinates round-trip. Nanodegrees print as decimal degrees at 6-digit precision.
  • Opaque OSM object modelinclude/osm.h exposes OSM_Map, OSM_BBox, OSM_Node, OSM_Way as opaque handles; nodes and ways carry parallel keys / vals arrays built from the PrimitiveBlock string table.
  • Five CLI query modes-s summary, -b bounding box, -n <id> node lookup, -w <id> [key ...] way lookup (node refs or tag values), -f <file> input path. Argument order flexible; validation first pass, queries second pass.

What makes it stand out

  • No libprotobuf, no protoc. The entire wire format — varint, zig-zag, delta, packed repeated, embedded messages — is decoded by hand against the Protocol Buffers spec. The only runtime link is libz.
  • Two-pass CLI against a single loaded map. Validation and query phases share one in-memory OSM_Map, so -s -b -n <id> on the same invocation parses the file once.
  • Valgrind-clean. Opaque types, explicit ownership, no leaks on the sample extract.

Stack

LayerTechnology
LanguageC (-std=gnu11)
BuildGNU Make, gcc, -MMD auto-deps
Compressionzlib (-lz)
TestsCriterion (-lcriterion)
PlatformLinux, macOS

Built for CSE 320 (Systems Fundamentals) at Stony Brook, Jan–Feb 2025. The course fixed the public API in include/protobuf.h and include/osm.h; the src/ implementation is original.