Using MySQL’s LOAD DATA with Django

Thursday, January 23, 2020
Tags: django

While attempting to improve performance of bulk inserting data into MySQL database my coworker came across the LOAD DATA SQL statement. It allows you to read data from a text file (in a comma separated variable-like format) and quickly insert it into a table. There’s two variations of it, a local remote version. We did not experiment with the local version since we were connecting to a remote MySQL server and did not have access to the database’s local disk.

Since we are using Django we were hoping to match the calling behavior of the bulk_create() method of a QuerySet to do the following:

  1. Write the data to be inserted into a temporary file.
  2. Execute the LOAD DATA referencing the file.
  3. Clean-up the temporary file.

After a bit of experimentation we were able to come up with a sub-class of QuerySet with the desired features. This is heavily based on Django’s implementation of bulk_create(). There were a few details to get this to work properly:

  • Dynamically getting the fields to insert into the database.
  • Converting the Python value to a proper string value for database insertion.
  • Ensuring the proper database was used when using a read-replica with a router.

Warning

The below code is not fully vetted if the escaping of the database values is safe.

It is also missing error checking and handling of edge-cases (e.g. an empty list of objects; see the start of the bulk_create() implementation).

Without further ado, see the implementation:

from __future__ import absolute_import, print_function, unicode_literals

import tempfile

from django.db import connections, models
from django.db.models.fields import AutoField

import six


class LoadDataQuerySet(models.QuerySet):
    """A QuerySet with an additional load_data method which inserts data quickly in bulk."""

    def _convert_instance_to_line(self, fields, instance, connection):
        """Convert an object to a single line to be placed in the temporary file."""
        # Convert each field value to a database value.
        #
        # Escape the enclosure and escape characters since they are not handled
        # automatically.
        db_prep_values = [
            six.text_type(self.model._meta.get_field(field_name).get_db_prep_value(
                getattr(instance, field_name), connection)).replace('\\', '\\\\').replace('"', '\\"')
            for field_name in fields
        ]

        # Comma separate the wrapped values.
        return ','.join(map(lambda v: '"' + v + '"', db_prep_values)) + '\n'

    def load_data(self, objs):
        """
        Inserts each of the instances into the database. This does *not* call
        save() on each of the instances, does not send any pre/post save
        signals, and does not set the primary key attribute if it is an
        autoincrement field. Multi-table models are not supported.

        Write the data to a temporary file, then insert that data into MySQL via a LOAD DATA call.

        :param tuple fields: A tuple of string field names.
        :param list objs: The list of objects to insert.
        :returns: The number of inserted records.
        :rtype int:
        """

        # This is based on by bulk_create.
        self._for_write = True
        connection = connections[self.db]
        fields = self.model._meta.concrete_fields
        fields = [f.name for f in fields if not isinstance(f, AutoField)]

        # The table name and field names cannot be parameterized when executing
        # a SQL statement with the Django ORM. The name of the file where data
        # is loaded from can be parameterized, however.
        #
        # The result of this is a partially formatted string to be fed into MySQL.
        load_data_statement = """
            LOAD DATA LOCAL INFILE %s INTO TABLE {mysql_table_name} CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\\n' ({fields});
            """.format(
                mysql_table_name=self.model._meta.db_table,
                fields=', '.join(fields)
            ).strip()

        # Write each object to their own line in a temporary file and bulk
        # insert the data into MySQL using the LOAD DATA statement.
        with tempfile.NamedTemporaryFile(mode='w', suffix='.data', delete=True) as data_file:
            data_file.writelines(
                self._convert_instance_to_line(fields, obj, connection) for obj in objs
            )
            data_file.flush()

            with connection.cursor() as cursor:
                return cursor.execute(load_data_statement, [data_file.name])

This can be used by using a custom manager to the LoadDataQuerySet manager.

from django.db import models

class OpinionPoll(models.Model):
    question = models.CharField(max_length=200)
    poll_date = models.DateField()
    objects = LoadDataQuerySet.as_manager()

You can then easily call load_data() just like you would bulk_create()!

from datetime import date

# Ask the same poll on the first of each month.
polls = []
for month in range(12):
    polls.append(
        OpinionPoll(
            question="What's your favorite ice cream?",
            poll_date=date(2020, month, 1)
        )
    )

# Insert all the polls at once.
OpinionPoll.objects.load_data(polls)

Although the above code did work (using Django 1.11.x and MySQL 5.6, it should work without much work on similar versions) only a small increase in insertion rate was observed. In order to avoid complexity (and custom code) it was decided that the above code was not worth keeping, but it was a fun journey into extending Django’s QuerySet methods (and investigating obscure MySQL features).