Lernapparat

Inlining Images in Jupyter Notebooks

June 29, 2021

I like Jupyter Notebooks a lot. But so while linking in images in Markdown is quite handy, I sometimes want the Jupyter Notebooks to work on their own without zipping up a lot files. Because I could not find a tool that does this, I wrote a quick and dirty script to (approximately) inline images.

Packing up the baggage

So Jupyter notebooks are saved as JSONNot that a text-only primary format wouldn't be much nicer, but that is for another day..

import json
import re
import mimetypes
import os
import base64
import copy

fn = './autograd_course_intro.ipynb'
origjson = json.load(open(fn))

This gives us a top-level dict, in which the cells entry is a list of cells. Each is represented by a dict, with cell_type telling us whether it is "markdown". Then source is a list of lines(?) with markdown.

Now, properly parsing markdown is hard and brittle, but we only need approximate and so we will use regular expressions. Matching ![text](image.img) is reasonably easy, and the regexp we use is r'\!\[([^\]]*)\]\(([^\)]*)\)'.

But now, my markdown sometimes has HTML img tags, too. And these are extremely hard to parseActually, for that other day, I also have a HTML5 with Markdown-Like and Mathmode parser., but we try to get by with a regexp by assuming noone will use an escaped or quoted > before we find the src= attribute. Now the value of the src could be quoted in single or double quotes or not quoted. Also, space around the = is allowed. We are not necessarily interested in getting the entire tag, so ending after the source attribute is OK. Also, I don't believe I need to handle escaped quotation chars. This let's me think I might work with r"""(<img[^>]*src\s*=\s*)([^\s'"]+|"[^"]*"|'[^']*')""". Putting these in re.sub gets me something like

newjson = copy.deepcopy(origjson)
for c in newjson['cells']:
    if c.get('cell_type') == 'markdown':
        IMG_MD_RE = r'\!\[([^\]]*)\]\(([^\)]*)\)'
        IMG_HTML_RE = r"""(<img[^>]*src\s*=\s*)([^\s'"]+|"[^"]*"|'[^']*')"""
        c['source'] = [re.sub(IMG_MD_RE, replace_md_link, 
                             re.sub(IMG_HTML_RE, replace_html_img, par)) for par in c['source']]

json.dump(newjson, open(os.path.splitext(fn)[0]+'_inlined.ipynb', 'w'))

Now we just need the replace_... functions. We take the groups from the two regular expressions, see if it doesn't start with data: (in which case it is already inlined) and the file is there. We check that we have the file and guess the mime-type. If everything works out, we create an img tag with data:-source and base64-encoded content. Being a quick and dirty script, I didn't care about abstracting the common bits. Of course, if you want to copy-paste this code, you would need to put it above.

def replace_md_link(match):
    txt, link = match.groups()
    if os.path.exists(link):
        data = open(link, 'rb').read()
    else:
        return f"![{txt}]({link})"
    typ, enc = mimetypes.guess_type(link)
    if enc is None:
        data = base64.b64encode(data)
        enc = 'base64'
    # sometimes it seems to want a newline to get the following paragraph right
    return f'''<img src="data:{typ};{enc},{data.decode()}" alt="{txt}" />\n'''

def replace_html_img(match):
    prefix, src = match.groups()
    if src[:1] in '"'"'": # this string prints as "'
        src = src[1:-1]
    if src.startswith("data:"):
        return match.group(0)
    if os.path.exists(src):
        data = open(src, 'rb').read()
    else:
        return match.group(0)
    typ, enc = mimetypes.guess_type(src)
    if enc is None:
        data = base64.b64encode(data)
        enc = 'base64'
    return f'''{prefix}"data:{typ};{enc},{data.decode()}"'''

Simple enough for me, but it is not quite what I originally wanted to do, but I'm finally getting to that. The most avid readers will know. But here is hoping that my little code snippet saves you some time if you are in a similar situation of wanting to pack up a Jupyter Notebook.