OpenStreetMap PBF Parser
Protocol Buffers from scratch, no libraries
An OSM .pbf parser in C with a hand-written Protocol Buffers
deserializer. 2,672 lines across 11 files, one runtime dependency
(zlib), and no libprotobuf or protoc-generated code. Five CLI
query modes against real OpenStreetMap extracts.
What it is
Given a .pbf file, the tool decodes the blob stream, inflates
compressed data blocks, reconstructs HeaderBlock and PrimitiveBlock
messages, and exposes the underlying OSM_Map as nodes, ways, and a
bounding box. All Protocol Buffers parsing — tag/wire-type decoding,
varint, length-delimited fields, fixed-width I32/I64, packed
repeated fields, embedded messages — is hand-written against the
protobuf.h spec. The CLI answers structural and key/value queries in
a single pass over the parsed map.
By the numbers
| Metric | Value |
|---|---|
| C source | 2,672 lines across 11 files |
| Runtime dependencies | 1 (zlib) |
| CLI query modes | 5 (summary, bbox, node, way, tag filter) |
| Sample map | 46,415 nodes, 5,812 ways |
| Sample bbox | -73.1387 .. -73.1074 lon, 40.9040 .. 40.9290 lat |
| Compile flags | -std=gnu11 -Wall -Werror |
Architecture
.pbf file
|
v
+--------------------+
| main.c | argv -> validate -> OSM_read_Map -> query
+--------------------+
|
v
+--------------------+
| osmpbf.c | blob loop: BlobHeader, Blob, inflate,
| | HeaderBlock vs PrimitiveBlock, build OSM_Map
+--------------------+
|
v
+--------------------+
| protobuf.c | tag/varint/length-delimited/I32/I64/packed
| | decoding into PB_Field linked list
+--------------------+
|
v
+--------------------+
| zlib_inflate.c | zlib inflate() over fmemopen / open_memstream
+--------------------+| Path | Lines | Role |
|---|---|---|
src/osmpbf.c | 1,038 | OSM model, blob loop, string table, delta + zig-zag decode |
src/protobuf.c | 912 | wire-format decoder, packed field expansion |
src/process_args.c | 233 | CLI validation and query dispatch |
include/protobuf.h | 95 | PB_Field, PB_Message, wire-type enum |
src/zlib_inflate.c | 83 | zlib stream inflation over FILE streams |
src/main.c | 68 | entrypoint, two-pass argv handling |
include/osm.h | 63 | opaque OSM_Map / OSM_Node / OSM_Way |
Key features
- Hand-written Protocol Buffers deserializer —
PB_read_tagsplits wire type (3 bits) and field number;PB_read_valuedispatches onVARINT_TYPE,I64_TYPE,LEN_TYPE,I32_TYPE; fields stream into a circular doubly-linked list headed by aSENTINEL_TYPEfor forward/backward traversal. - Embedded-message and packed-field handling —
PB_read_embedded_messageparses nested messages from in-memory buffers, inflating zlib blobs on the fly;PB_expand_packed_fieldsexpands packed repeated scalars into individualPB_Fieldentries for uniform traversal. - Zlib decompression over memory buffers — compressed
Blobpayloads pipe throughzlib_inflateusingfmemopen/open_memstream, so inflation works against in-memory buffers without temp files. - Delta + zig-zag decoding —
DenseNodesstore IDs, lat, and lon as deltas; the parser accumulates the running sum and reverses zig-zag ((n << 1) ^ (n >> 63)) so negative coordinates round-trip. Nanodegrees print as decimal degrees at 6-digit precision. - Opaque OSM object model —
include/osm.hexposesOSM_Map,OSM_BBox,OSM_Node,OSM_Wayas opaque handles; nodes and ways carry parallelkeys/valsarrays built from the PrimitiveBlock string table. - Five CLI query modes —
-ssummary,-bbounding box,-n <id>node lookup,-w <id> [key ...]way lookup (node refs or tag values),-f <file>input path. Argument order flexible; validation first pass, queries second pass.
What makes it stand out
- No
libprotobuf, noprotoc. The entire wire format — varint, zig-zag, delta, packed repeated, embedded messages — is decoded by hand against the Protocol Buffers spec. The only runtime link islibz. - Two-pass CLI against a single loaded map. Validation and query
phases share one in-memory
OSM_Map, so-s -b -n <id>on the same invocation parses the file once. - Valgrind-clean. Opaque types, explicit ownership, no leaks on the sample extract.
Stack
| Layer | Technology |
|---|---|
| Language | C (-std=gnu11) |
| Build | GNU Make, gcc, -MMD auto-deps |
| Compression | zlib (-lz) |
| Tests | Criterion (-lcriterion) |
| Platform | Linux, macOS |
Built for CSE 320 (Systems Fundamentals) at Stony Brook, Jan–Feb 2025.
The course fixed the public API in include/protobuf.h and
include/osm.h; the src/ implementation is original.