Migrating content

Migrating existing content into WordPress is a very common task.

ImpEx provides tooling support for migrating data to WordPress.

Preparation

ImpEx imports data from a directory containing JSON files organized in chunk-\* sub-directories.

my-exported-website
├── chunk-0001
│   ├── slice-0001.json
│   ├── slice-0002.json
│   ├── slice-0003.json
│   ├── slice-0004.json
│   └── slice-0005.json
├── chunk-0002
│   ├── slice-0001.json
│   ├── slice-0001-wes-walker-unsplash.jpg
│   ├── slice-0002-greysen-johnson-unsplash.jpg
│   ├── slice-0002.json
│   ├── slice-0003-james-wheeler-unsplash.jpg
│   └── slice-0003.json
...

Why that chunk-* sub directory structure ?

Organizing thousands of content documents and hundreds of images/videos in a single directory slows down file managers like Windows Explorer. That's the one and only reason for chunk-\* sub directories.

Both chunk-* sub directories and the JSON files are suffixed by a 4 digit number.

ImpEx imports slice files ordered by name. So the slices in sub directory chunk-0001 will be imported first, then the slices in chunk-0002 and so on.

Same rule for slice-*.json files within the same chunk-\* sub directory : slice-0001.json will be imported before slice-0002.json and so on.

Knowing that import order is important. If you import content referencing images/videos in the wrong order, you will get broken links in your posts. ImpEx will rewrite/fix media links in the content if you import content as first and media afterwards.

Have a look at this sample ImpEx export provided by the ImpEx plugin to get a clue about a minimal working ImpEx export containing content and referencing images.

Data files

slice-*.json files are JSON files containing data.

The real data is stored in the data property.

The data might be anything expressed in textual form. Beside the data itself, each slice-*.json file contains some meta-data describing the contained data so that ImpEx knows how to import.

An minimal slice file transporting a single WordPress post looks like this:

{
  "version": "1.0.0",
  "type": "php",
  "tag": "content-exporter",
  "meta": {
    "entity": "content-exporter"
  },
  "data": {
    "posts": [
      {
        "wp:post_id": 1,
        "wp:post_content": "<!-- wp:paragraph -->\n<p>Hello from first imported post !</p>\n<!-- /wp:paragraph -->",
        "title": "Hello first post!"
      }
    ]
  }
}

As you can see the real content is located in the data property.

Everything except the data property ist used for versioning and content identification.

Content (aka WordPress posts/pages)

Content slice files wrap regular WordPress posts and pages.

Content slices may also transport further content like comments, custom fields, terms, taxonomies, categories, FSE templates/template-parts, global styles and so on. But that's another story.

To get a clue about the power of content slices by exporting a FSE enabled WordPress instance and inspecting the resulting slice-_.json files.

Below is the JSON Schema describing the content slice file format.

Download JSON Schema definition for content slices : slice-content.json

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "title": "ImpEx Content Slice",
  "description": "a ImpEx slice containing content",
  "type": "object",
  "properties": {
    "$schema": {
      "type": "string",
      "description": "The JSON schema for this slice"
    },
    "version": {
      "title": "ImpEx Provider content version",
      "description": "Version will be used by ImpEx to know what data format to expect",
      "const": "1.0.0"
    },
    "type": {
      "title": "ImpEx slice type",
      "description": "Value will be used by ImpEx to know the content type of the slice",
      "const": "php"
    },
    "tag": {
      "title": "ImpEx slice tag",
      "description": "The ImpEx slice tag contains information about the responsible ImpEx provider for this slice",
      "const": "content-exporter"
    },
    "meta": {
      "title": "Metadata for the slice",
      "description": "Metadata is a JSON object used to store additional information about the slice",
      "const": {
        "entity": "content-exporter"
      }
    },
    "data": {
      "type": "object",
      "properties": {
        "posts": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/posts-item"
          },
          "minItems": 1,
          "$comment": "@TODO: unique ids are not yet supported by jsonschema",
          "uniqueItems": true
        }
      },
      "title": "Data portion of this ImpEx slice",
      "description": "data contains the real ImpEx data.",
      "required": ["posts"]
    }
  },
  "additionalProperties": false,
  "required": ["version", "type", "tag", "meta", "data"],
  "definitions": {
    "posts-item": {
      "title": "WordPress posts stored in this this ImpEx slice",
      "type": "object",
      "properties": {
        "wp:post_id": {
          "type": "integer",
          "minimum": 1,
          "title": "WordPress post_id",
          "description": "The unique WordPress post id of the post"
        },
        "title": {
          "type": "string",
          "minLength": 1,
          "title": "WordPress post title",
          "description": "The title of the WordPress post as it is stored in the database"
        },
        "wp:post_content": {
          "type": "string",
          "minLength": 1,
          "title": "WordPress post content",
          "description": "The content of the WordPress post",
          "examples": [
            "<!-- wp:paragraph -->\n<p>Hello from first imported post !</p>\n<!-- /wp:paragraph -->",
            "<!-- wp:paragraph -->\n<p>Hello world</p>\n<!-- /wp:paragraph -->\n\n<!-- wp:html -->\n<p>A bit of custom html utilizing the Gutenberg html block</p>\n<ul>\n  <li>hi</li>\n  <li>ho</li>\n  <li>howdy</li>\n</ul><!-- /wp:html -->"
          ]
        },
        "wp:post_type": {
          "title": "WordPress post type",
          "description": "The type of the WordPress post.\nCcontent related post types are 'post' and 'page'.\nIf not declared, type 'post' will be assumed.",
          "type": "string",
          "enum": [
            "post",
            "page",
            "nav_menu_item",
            "wp_template",
            "wp_template_part",
            "wp_block",
            "wp_global_styles"
          ],
          "default": "page"
        },
        "wp:status": {
          "type": "string",
          "title": "WordPress post status",
          "description": "The WordPress post status (https://wordpress.org/support/article/post-status/)",
          "enum": ["publish", "future", "draft", "pending", "private"],
          "default": "draft"
        },
        "wp:post_excerpt": {
          "type": "string",
          "title": "WordPress excerpt",
          "description": "The excerpt of the post.",
          "minLength": 1
        },
        "wp:post_name": {
          "type": "string",
          "title": "WordPress post slug",
          "description": "Used to generate the permalink. If not given, the sanitized post title will be used instead.",
          "minLength": 1
        },
        "wp:post_parent": {
          "type": "integer",
          "minimum": 1,
          "title": "WordPress post parent id",
          "description": "The WordPress post id of the parent post.\n If not given, the post will be created as a top level post."
        }
      },
      "required": ["wp:post_id", "title", "wp:post_content"],
      "additionalProperties": false
    }
  }
}

A content slice may contain any number of WordPress posts/pages/etc.

When generating a content slice file, it's best to embed only a single page/post per slice-_.json file

Each content document is identified by a unique wp:post_id property.

The title property is used as the title.

wp:post_content transports the content.

See the Content slice JSONSchema definition for all supported properties.

Since WordPress expects block-annotated HTML you need to transform your HTML content into block-annotated HTML.

There are 2 options to do that :

  • The gold solution : annotate almost every HTML tag with the matching Gutenberg block.

    <!-- wp:paragraph -->
    <p>A bit of custom html utilizing the Gutenberg html block</p>
    <!-- /wp:paragraph -->
    
    <!-- wp:list -->
    <ul>
      <li>hi</li>
      <li>ho</li>
      <li>howdy</li>
    </ul>
    <!-- /wp:list -->
    
    <!-- wp:image -->
    <figure class="wp-block-image">
      <img src="./greysen-johnson-unsplash.jpg" />
      <figcaption>Fly fishing</figcaption>
    </figure>
    <!-- /wp:image -->
    
  • the quick and dirty solution : wrap the whole html content into a WordPress Custom HTML block :

    <!-- wp:html -->
    <p>A bit of custom html utilizing the Gutenberg html block</p>
    <ul>
      <li>hi</li>
      <li>ho</li>
      <li>howdy</li>
    </ul>
    <figure>
      <img src="./greysen-johnson-unsplash.jpg" />
      <figcaption>Fly fishing</figcaption>
    </figure>
    <!-- /wp:html -->
    

    Why is this solution dirty ?

    => If you open up a page/post containing a - the quick and dirty solution : wrap the whole html content into a WordPress Custom HTML block in the Gutenberg editor, you will see just the HTML content but its not rendered. So the quick and dirty solution is actually a no-go from a designers perspective.

The HTML content must be encoded as JSON string in the slice file. See this example content slice.

See Attachments (like Pictures and Videos) for importing referenced media files.

Attachments (like Pictures and Videos)

Attachments a binary files like images/videos or anything else stored in the WordPress uploads directory.

Such binary data is handled a bit differently than textual - because it cannot be easily embedded into a JSON file.

Below is a JSON Schema describing the attachment slice file format.

Download JSON Schema definition for media files : slice-attachment.json

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "title": "ImpEx Content Slice",
  "description": "a ImpEx slice containing content",
  "type": "object",
  "properties": {
    "$schema": {
      "type": "string",
      "description": "The JSON schema for this slice"
    },
    "version": {
      "title": "ImpEx Provider content version",
      "description": "Version will be used by ImpEx to know what data format to expect",
      "const": "1.0.0"
    },
    "type": {
      "title": "ImpEx slice type",
      "description": "Value will be used by ImpEx to know the content type of the slice",
      "const": "php"
    },
    "tag": {
      "title": "ImpEx slice tag",
      "description": "The ImpEx slice tag contains information about the responsible ImpEx provider for this slice",
      "const": "attachment"
    },
    "meta": {
      "title": "Metadata for the slice",
      "description": "Metadata is a JSON object used to store additional information about the slice",
      "type": "object",
      "properties": {
        "entity": {
          "title": "Entity",
          "description": "The entity type of this slice",
          "type": "string",
          "default": "attachment",
          "enum": ["attachment"]
        },
        "impex:post-references": {
          "title": "Array of urls referencing this attachment in posts",
          "description": "When the attachment was imported, all references in this array will be replaces by the url of the imported attachment",
          "type": "array",
          "items": {
            "type": "string"
          },
          "minItems": 1,
          "$comment": "@TODO: unique ids are not yet supported by jsonschema",
          "uniqueItems": true
        }
      }
    },
    "data": {
      "format": "uri-template",
      "type": "string",
      "title": "Data portion of this ImpEx slice",
      "description": "Attachment slice data is expected to be a URI to the attachment file.\nFor media it has to be the URI to the image.\nURI can be absolute or relative.",
      "examples": [
        "./media/image.jpg",
        "https://www.example.com/attachment.jpg"
      ]
    }
  },
  "additionalProperties": false,
  "required": ["version", "type", "tag", "meta", "data"]
}

Let's say you have a reference to an image in your content :

<img src="./greysen-johnson-unsplash.jpg" />

So you need to import the image into your WordPress instance. To do so, you need to

  • create a slice-*json file (let's name it slice-0002.json) declaring the attachment :

    {
      "version": "1.0.0",
      "type": "resource",
      "tag": "attachment",
      "meta": {
        "entity": "attachment"
      },
      "data": "./greysen-johnson-unsplash.jpg"
    }
    

    As you can see, there is actually only the data property referencing the image. Rest of the slice file is just meta-data.

  • provide the image in the same chunk directory as it's slice json file and prefixed with the slice json file name (slice-0002.json) :

    slice-0002-greysen-johnson-unsplash.jpg
    

If you import the slice file using ImpEx, the image will appear in the WordPress uploads directory and in the WordPress media page. If you referenced the image in your content, it will also appear in your imported pages/posts.

Remember: Content slices referencing media files should ALWAYS be imported before the attachment slices.

This can be achieved by naming content slicing with a lower number than the media slices or - much simpler - keeping the content slices in a lower numbered chunk-* directory than the attachments.

See simple-import example for a full featured manually written import at the ImpEx WordPress plugin GitHub repository.

Adjusting attachment urls

If you import posts referencing an image using relative paths, you will need to adjust the image url in your imported posts to the newly imported attachment.

Suppose you have various posts referencing an image in different ways :

<!-- sub/page-one.html -->
...
<img src="../images/greysen-johnson-unsplash.jpg" />

<!-- page-two.html -->
...
<img src="/images/greysen-johnson-unsplash.jpg" />

<!-- page-tree.html -->
...
<img src="./images/greysen-johnson-unsplash.jpg" />

After importing generated pages will reference exactly the same IMG src attribute, but the url of the imported image attachment will be different.

In this case you can configure replacing the original with the url of the imported image using slice meta property impex:post-references. This property tells ImpEx that the given references should be replaced with the url of the imported attachment file.

{
  "version": "1.0.0",
  "type": "resource",
  "tag": "attachment",
  "meta": {
    "entity": "attachment",
    "impex:post-references": [
      "../images/greysen-johnson-unsplash.jpg",
      "./images/greysen-johnson-unsplash.jpg"
      "/images/greysen-johnson-unsplash.jpg",
    ]
  },
  "data": "./greysen-johnson-unsplash.jpg"
}

Other data

Although ImpEx provides a simple way to import content and media, you may also want to import more advanced data like database tables or settings into WordPress.

ImpEx provides built-in support for further data :

  • relational data like database tables

  • key/value based settings (aka wp_options)

@TODO: Add JSONSchema / examples for other data.