While attempting to improve performance of bulk inserting data into MySQL database my coworker came across the LOAD DATA SQL statement. It allows you to read data from a text file (in a comma separated variable-like format) and quickly insert it into a table. There’s two variations of it, a local remote version. We did not experiment with the local version since we were connecting to a remote MySQL server and did not have access to the database’s local disk.
Since we are using Django we were hoping to match the calling behavior of the bulk_create() method of a QuerySet to do the following:
After a bit of experimentation we were able to come up with a sub-class of QuerySet with the desired features. This is heavily based on Django’s implementation of bulk_create(). There were a few details to get this to work properly:
Warning
The below code is not fully vetted if the escaping of the database values is safe.
It is also missing error checking and handling of edge-cases (e.g. an empty list of objects; see the start of the bulk_create() implementation).
Without further ado, see the implementation:
from __future__ import absolute_import, print_function, unicode_literals import tempfile from django.db import connections, models from django.db.models.fields import AutoField import six class LoadDataQuerySet(models.QuerySet): """A QuerySet with an additional load_data method which inserts data quickly in bulk.""" def _convert_instance_to_line(self, fields, instance, connection): """Convert an object to a single line to be placed in the temporary file.""" # Convert each field value to a database value. # # Escape the enclosure and escape characters since they are not handled # automatically. db_prep_values = [ six.text_type(self.model._meta.get_field(field_name).get_db_prep_value( getattr(instance, field_name), connection)).replace('\\', '\\\\').replace('"', '\\"') for field_name in fields ] # Comma separate the wrapped values. return ','.join(map(lambda v: '"' + v + '"', db_prep_values)) + '\n' def load_data(self, objs): """ Inserts each of the instances into the database. This does *not* call save() on each of the instances, does not send any pre/post save signals, and does not set the primary key attribute if it is an autoincrement field. Multi-table models are not supported. Write the data to a temporary file, then insert that data into MySQL via a LOAD DATA call. :param tuple fields: A tuple of string field names. :param list objs: The list of objects to insert. :returns: The number of inserted records. :rtype int: """ # This is based on by bulk_create. self._for_write = True connection = connections[self.db] fields = self.model._meta.concrete_fields fields = [f.name for f in fields if not isinstance(f, AutoField)] # The table name and field names cannot be parameterized when executing # a SQL statement with the Django ORM. The name of the file where data # is loaded from can be parameterized, however. # # The result of this is a partially formatted string to be fed into MySQL. load_data_statement = """ LOAD DATA LOCAL INFILE %s INTO TABLE {mysql_table_name} CHARACTER SET utf8 FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\\n' ({fields}); """.format( mysql_table_name=self.model._meta.db_table, fields=', '.join(fields) ).strip() # Write each object to their own line in a temporary file and bulk # insert the data into MySQL using the LOAD DATA statement. with tempfile.NamedTemporaryFile(mode='w', suffix='.data', delete=True) as data_file: data_file.writelines( self._convert_instance_to_line(fields, obj, connection) for obj in objs ) data_file.flush() with connection.cursor() as cursor: return cursor.execute(load_data_statement, [data_file.name])
This can be used by using a custom manager to the LoadDataQuerySet manager.
from django.db import models
class OpinionPoll(models.Model):
question = models.CharField(max_length=200)
poll_date = models.DateField()
objects = LoadDataQuerySet.as_manager()
You can then easily call load_data() just like you would bulk_create()!
from datetime import date
# Ask the same poll on the first of each month.
polls = []
for month in range(12):
polls.append(
OpinionPoll(
question="What's your favorite ice cream?",
poll_date=date(2020, month, 1)
)
)
# Insert all the polls at once.
OpinionPoll.objects.load_data(polls)
Although the above code did work (using Django 1.11.x and MySQL 5.6, it should work without much work on similar versions) only a small increase in insertion rate was observed. In order to avoid complexity (and custom code) it was decided that the above code was not worth keeping, but it was a fun journey into extending Django’s QuerySet methods (and investigating obscure MySQL features).