Tutorial

GLM-OCR MLX

High-Fidelity OCR on Apple Silicon

A local, zero-config OCR inference tool powered by GLM-4V and MLX.
Designed for macOS M-series Macs with a beautiful web interface.

Zero-Config MLX Accelerated Web UI PDF + Images Layout Detection
Overview

What does this project do?

GLM-OCR MLX wraps the GLM-OCR model in a turnkey macOS application. You upload a PDF or image, and it returns structured Markdown with tables, formulas, and layout-aware text — all running locally on your Mac's GPU.

📄

PDF & Images

Upload multi-page PDFs, PNGs, or JPEGs for instant OCR.

🧠

GLM-4V Model

State-of-the-art 0.9B param model scoring 94.6 on OmniDocBench.

MLX Native

Runs on Apple Metal GPU via mlx-vlm for fast local inference.

🔍

Layout Detection

PP-DocLayoutV3 detects tables, formulas, images, and text blocks.

Architecture

How it works

Two servers work together: an MLX inference server and a Flask web UI.

Browser
:5003
Flask App
app.py
GLM-OCR SDK
glmocr
MLX Server
:8080
Metal GPU
Unified Memory

MLX Server (:8080)

Serves the GLM-OCR-bf16 model via an OpenAI-compatible chat/completions endpoint. Handles the actual neural network inference on the Metal GPU.

Flask Web UI (:5003)

Accepts uploads, splits PDFs to images, queues OCR jobs, and displays results page-by-page with a live progress bar and Markdown rendering.

Before You Start

Prerequisites

🍎

macOS + Apple Silicon

M1, M2, M3, or M4 Mac required. MLX uses the Metal GPU and unified memory architecture.

🐍

Python 3.12+

Download from python.org/downloads/macos if not installed. The launcher checks automatically.

📦

Git

Needed to clone the GLM-OCR SDK on first launch. Install via Xcode Command Line Tools or Homebrew.

Disk Space & Memory

The model weights are ~20 GB (GLM-OCR-bf16 + PP-DocLayoutV3). They download automatically on first launch from Hugging Face. 16 GB unified memory is minimum; 32 GB+ recommended for multi-page PDFs.

Files

Project Structure

glm-ocr-mlx/
  launch.command   ← double-click to start
  app.py            ← Flask web server
  glm_config.yaml   ← all settings
  requirements.txt  ← Python deps
  templates/
    index.html      ← web UI
  static/
    css/ style.css
    js/  main.js
  utils/
    download_weights.py
    deep_clean.command

Auto-generated directories

These folders are created at runtime:

weights/ — Downloaded AI model files (~20 GB)

output/ — OCR results (Markdown + JSON + images)

sessions/ — Job state files (JSON per job)

glm-ocr/ — Cloned GLM-OCR SDK

Quick Start

Launch in one click

The entire setup is automated. Just double-click launch.command in Finder.

  1. Double-click launch.command in Finder.
    If macOS blocks it: right-click → Open → confirm.
  2. First run auto-setup — The script clones the GLM-OCR SDK, creates a virtual environment, installs all Python dependencies, and downloads model weights from Hugging Face.
  3. MLX Server starts on port 8080. It loads the GLM-OCR-bf16 model into unified memory. The first load is slow (~30–60 s).
  4. Flask Web UI starts on port 5003 and your browser opens automatically to http://localhost:5003
  5. Keep the terminal open. Press Ctrl+C to stop both servers when done.
Under the Hood

What launch.command does

A breakdown of the automated startup sequence:

# Step 0 — Clone SDK if missing
if [ ! -d "glm-ocr" ]; then
    git clone https://github.com/zai-org/GLM-OCR glm-ocr
fi

# Step 1 — Verify Python 3.12+
python3 -c "import sys; sys.exit(0 if sys.version_info >= (3,12) else 1)"

# Step 2 — Create venv & install deps (first run only)
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Step 3 — Download / verify model weights
python utils/download_weights.py

# Step 4 — Start MLX server (background)
mlx_vlm.server --trust-remote-code &

# Step 5 — Start Flask app (background)
python app.py &

# Step 6 — Open browser
open http://localhost:5003
Usage

Using the Web UI

1. Upload

Drag and drop a PDF, PNG, or JPEG onto the upload card — or click to browse. Accepted: .pdf .png .jpg .jpeg

2. Processing

A progress bar shows real-time status. PDFs are split into page images, then each page is OCR'd sequentially. Results stream as they finish.

3. Review Results

A split-panel view: original document on the left, rendered Markdown on the right. Navigate pages with Prev/Next or jump to any page.

4. Export

Click Export to download results as Markdown or JSON — either the current page or the full document.

Additional features

Layout Toggle Switch between the original image and the layout visualization overlay to see detected regions. History Click the History button to browse and reload previous scan results.

API

REST API Endpoints

The Flask app exposes these endpoints — useful for scripting or integration.

MethodEndpointDescription
POST/api/uploadUpload a file and start OCR. Returns job_id.
GET/api/status/<job_id>Poll job status, progress %, and page counts.
GET/api/page/<job_id>/<idx>Get OCR result for a single page (JSON).
GET/api/jobsList all jobs sorted by newest first.
GET/api/export/<job_id>Export results. Params: format, scope, page_idx.
POST/api/last_page/<job_id>Save the user's last-viewed page index.
GET/api/image/<job_id>/<path>Serve an image file belonging to a job.
Config

Configuration

All settings live in glm_config.yaml. Key sections:

server

MLX server host, port (8080), and debug flag.

pipeline.ocr_api

Connection to the MLX server: host, port, model path, timeouts, retries.

pipeline.layout

PP-DocLayoutV3 settings: detection threshold, batch size, label-to-task mappings.

Common tweaks

SettingDefault
pipeline.enable_layouttrue
pipeline.max_workers4
pipeline.ocr_api.api_port8080
pipeline.page_loader.max_tokens4096
pipeline.layout.threshold0.3

MaaS Mode (Cloud API)

Set pipeline.maas.enabled: true and provide a Zhipu API key to use the cloud API instead of local inference. This bypasses the MLX server entirely — no GPU needed.

pipeline:
  maas:
    enabled: true
    api_key: your-zhipu-key
Feature Deep Dive

Layout Detection Pipeline

When enable_layout: true, documents go through a two-stage pipeline:

Input Image
PP-DocLayoutV3
Region Detection
Task Routing
text / table / formula
GLM-OCR
Per-Region OCR
Markdown Output

text

Paragraphs, titles, references, seals, vertical text

table

Tables → recognized with table prompt

formula

Display & inline formulas → LaTeX output

skip / abandon

Charts & images kept; headers, footers, page numbers discarded

Export

Exporting Results

Two formats, two scopes — choose what you need from the Export dropdown.

Markdown (.md)

Clean Markdown with headings, tables, and LaTeX formulas. Pages separated by horizontal rules. Ideal for docs, Notion, Obsidian.

JSON (.json)

Structured JSON with page data, content strings, and image paths. Ideal for programmatic processing and data pipelines.

Scope: All Pages

Downloads the entire document's OCR output in one file.

Scope: Current Page

Downloads only the page you're currently viewing.

# Programmatic export via API
$ curl "http://localhost:5003/api/export/JOB_ID?format=markdown&scope=all" -o result.md
$ curl "http://localhost:5003/api/export/JOB_ID?format=json&scope=current&page_idx=0" -o page1.json
Troubleshooting

Common Issues & Fixes

"Python 3.12 or higher is required"

Install the latest Python from python.org/downloads/macos. The system Python on macOS is too old.

macOS blocks launch.command

Right-click the file → Open → confirm in the dialog. Or: System Settings → Privacy → Allow.

MLX Server won't start (port 8080)

Another process may hold the port. Run lsof -i :8080 to check. Use the deep clean script to kill stale processes.

First scan is very slow

Normal — the model weights load into unified memory on the first request. Subsequent scans are much faster.

Out of memory

The model needs ~8 GB of unified memory. Close other heavy apps. 16 GB Macs should work; 8 GB Macs may struggle.

Weight download fails

Check your internet connection. Run python utils/download_weights.py manually to retry. Weights come from Hugging Face.

Maintenance

Deep Clean / Reset

If something goes wrong, use the interactive reset utility:

$ ./utils/deep_clean.command

It prompts you to selectively reset each component:

Processes Kill stale servers

Stops any lingering processes on ports 8080 and 5003.

Venv Remove .venv

Deletes the virtual environment. Recreated on next launch.

SDK Delete glm-ocr/

Removes the cloned SDK. Re-cloned on next launch.

Data Clear uploads/sessions/output

Wipes all OCR results and job history.

Weights Delete AI models (20 GB+)

Nuclear option — deletes all downloaded weights. They'll re-download on next launch (requires internet).

Tips & Tricks

Getting the Best Results

📏

Use high-resolution scans

The default PDF rendering is 200 DPI. Higher-quality input images produce better OCR output, especially for small text.

🔄

Toggle layout mode

For simple single-column docs, try enable_layout: false in the config for faster, direct whole-image OCR.

⚙️

Tune max_workers

Default is 4 parallel workers for region OCR. Lower to 1–2 on 8 GB Macs, raise on 64 GB+ machines.

📂

Find raw output

Full SDK output lives in the output/ directory — including per-page Markdown, JSON, layout visualizations, and original images.

Results persist across app restarts — the History button lets you reload any previous scan.

Summary

You're all set!

Double-click launch.command, upload a document, and get structured Markdown in minutes — all running locally on your Mac.

1️⃣

Launch

launch.command

2️⃣

Upload

PDF or image

3️⃣

Export

Markdown or JSON

GLM-OCR MLX  ·  Powered by GLM-4V + Apple MLX  ·  MIT License

or click to navigate