The Problem
I inherited a WordPress site. You know the type:
- 1 million+ words across 500+ blog posts
- 3-4 second load times (on a good day)
- A plugin ecosystem that required monthly "security updates"
- $40/month hosting just to keep it limping along
The content was good. The platform was holding it hostage.
I wanted to migrate to a static Next.js site — mostly Markdown files, rendered at build time, served from a CDN. The dream: sub-200ms loads, $0 hosting on Vercel, and no more WordPress maintenance anxiety.
The reality: migrating 1M words of HTML content to clean Markdown is a nightmare.
Why Existing Tools Failed
I tried everything:
Turndown (JS library)
The go-to HTML-to-Markdown library. Works fine for simple content. But WordPress posts aren't simple.
Tables were mangled. Complex HTML tables with merged cells, nested content, or inline styles? Gone. Or worse, rendered as malformed Markdown that broke the build.
Media links were lost. Images hosted on wp-content/uploads/2019/03/image.jpg weren't preserved correctly. Relative paths, CDN URLs, various image plugins — all edge cases that required manual cleanup.
WordPress Export + Converters
The XML export → Markdown pipeline. Some tools exist. All of them:
- Lost formatting on edge cases
- Couldn't handle custom shortcodes
- Required post-processing anyway
I was spending more time fixing the output than I would have spent copy-pasting manually. That's when I knew I needed a different approach.
Building mdurl
Instead of converting exported XML, I went direct: hit the live URL, scrape the rendered HTML, convert to Markdown.
Why? Because the rendered HTML is the source of truth. It's what readers actually see. All the shortcodes are expanded, all the plugins have done their thing, all the content is there.
The Core Approach
mdurl -s ".post-content" https://oldsite.com/blog/some-post
- Fetch the URL
- Extract content using a CSS selector (skip the nav, sidebar, footer)
- Convert HTML → Markdown with proper handling for tables, media, code blocks
- Output a clean
.mdfile
What Made It Work
CSS selector targeting. WordPress themes vary wildly. Some use <article>, some use .entry-content, some use custom classes. Being able to specify -s ".post-content" and grab just the content was essential.
Advanced table support. This was 40% of the work. HTML tables can have:
colspanandrowspan- Nested tables (yes, really)
- Inline styles that affect rendering
<thead>/<tbody>or not
I wrote a custom table converter that handles most of these cases and falls back to HTML passthrough when Markdown can't represent the structure.
Media preservation. Images, embeds, iframes — all preserved with their full URLs. No broken links. Videos embedded from YouTube/Vimeo stay as iframes (Markdown doesn't have a native video syntax anyway).
The Migration Script
Once mdurl worked reliably, the migration was just a loop:
#!/bin/bash
# List of all post URLs (one per line)
urls="urls.txt"
# Output directory
out="./content/blog"
mkdir -p "$out"
while IFS= read -r url; do
# Extract slug from URL for filename
slug=$(echo "$url" | sed 's/.*\///' | sed 's/\/$//')
# Convert and save
mdurl -s ".entry-content" -o "$out/$slug.md" "$url"
echo "✓ $slug"
sleep 1 # Be nice to the server
done < "$urls"
Time to migrate 500+ posts: About 90 minutes (with the sleep delay).
Time I would have spent manually: Weeks. Maybe never finished.
Next.js Setup
With Markdown files in hand, the Next.js setup was standard:
/content
/blog
post-1.md
post-2.md
...
/app
/blog
/[slug]
page.tsx
Using @next/mdx or contentlayer for processing. Static generation at build time. Every post is a static HTML file on the CDN.
Frontmatter
mdurl outputs raw Markdown. I wrote a quick script to add frontmatter based on a CSV export of post metadata:
---
title: "Original Post Title"
date: "2019-03-15"
author: "Author Name"
tags: ["category1", "category2"]
---
Could've built this into mdurl, but keeping concerns separate made iteration faster.
Results
| Metric | WordPress | Next.js | | ------ | --------- | ------- | | TTFB | 800ms-1.2s | 50-80ms | | Full load | 3-4s | 150-200ms | | Lighthouse perf | 45-60 | 98-100 | | Hosting cost | $40/mo | $0 (Vercel free) | | Maintenance | Weekly updates | None |
The site is now:
- 15-20x faster on initial load
- Essentially free to host
- Zero maintenance (no PHP, no MySQL, no plugin updates)
- More secure (static files, nothing to hack)
Lessons Learned
1. Scrape the rendered output, not the source
The WordPress database/XML export is messy. The rendered HTML is clean (or at least, consistent). Work with what users actually see.
2. CSS selectors are essential
Every site is different. A tool that can't target specific content areas is useless at scale.
3. Tables are hard
Markdown tables are limited. Complex HTML tables need to either be simplified or passed through as raw HTML. There's no perfect solution, but "works in 90% of cases" beats "fails completely."
4. Build the tool you need
I spent 2 days building mdurl. I would have spent 2 weeks doing manual cleanup with existing tools. Sometimes the meta-work is the real work.
Get mdurl
If you're facing a similar migration, grab it:
# Coming to npm/Homebrew soon
# For now:
gh repo clone michaelmonetized/url-to-md
cd url-to-md
bun run build
which mdurl
# Usage
mdurl --help
mdurl https://michaelchurley.com # Basic
mdurl -s ".content" -o post.md https://... # With selector and output
It's designed for exactly this use case: bulk HTML → Markdown conversion where quality matters.
Conclusion
1 million words. 500+ posts. 3+ years of content.
Migrated in a weekend.
If you're stuck on WordPress (or any CMS) and want the speed and simplicity of static, it's more achievable than you think. The right tooling makes it trivial.
The future is static. The present doesn't have to be slow.
Have questions about the migration? Hit me up on Twitter @MichaelH_rley42.