How to quickly import large datasets into Laravel

November 22nd, 2024 | 分类: PHP

Chances are you’ve had to import a lot of data from some source, an old legacy codebase for example, into your database. These imports, if not properly optimised, can take a long time and consume a lot of resources. So lets optimise your imports.

Setting everything up

For my imports, I like to use my own Artisan Command. This way I can simply execute the imports via the console. However, it can sometimes be useful to implement an import as part of a DatabaseSeeder.

app/Console/Commands/DataImportCommand.php

class DataImportCommand extends Illuminate\Console\Command
{
    protected $signature = 'data:import';

    public function handle()
    {
      //...
    }
}

The techniques we use here are also usefully outside DatabseSeeder, Commands and Importer in general.

We also need some data. For this test, I simply put a JSON file photos.json with 20000 entries in my storage directory. The structure of the individual entries looks like this:

storage/app/photos.json

[
  {
  "id": 1,
  "title": "Quia aliquam quo quod dolor reiciendis nihil. Delectus officia perspiciatis quas aliquam et molestias autem.",
  "url": "https://via.placeholder.com/640x480.png/00ee33?text=ullam",
  "thumbnailUrl": "https://via.placeholder.com/640x480.png/0022bb?text=repudiandae"
}
...
]

And last but not least, we need a simple eloquent Model and Migration:

database/migrations/xx_create_photos_table.php

return new class extends Illuminate\Database\Migrations\Migration {
    public function up()
    {
        Schema::create('photos', function (Illuminate\Database\Schema\Blueprint $table) {
            $table->id();
            $table->string('title');
            $table->string('url');
            $table->string('thumbnail_url');
            $table->timestamps();
        });
    }
};

app/Models/Photo.php

class Photo extends Illuminate\Database\Eloquent\Model
{
    protected $guarded = [];
}

We will use Mass Assignment so we need to add our $fillable attributes here or unguard every attribute like I did.

How to measure time and memory usage

ℹ️ If you do not care about the metrics of your importer you can skip this section

Now that everything is set up, we need a way to measure our success. We want to know how long our Import took and how much memory we used in the process. Luckily this isn’t a hard problem to solve because Laravel added in one of it last releases a new Benchmark class that lets us easily measure the time a functions needs to be resolved. For the Memory consumption PHP has two build in functions that lets us get the current and peak memory usage, memory_get_peak_usage() and memory_get_usage().

app/Console/Commands/DataImportCommand.php

public function run(InputInterface $input, OutputInterface $output): int
{
    $return = 0;
    
    $benchmarkResult = Benchmark::measure(
        function () use (&$returnValue, $input, $output) {
            $return = parent::run($input, $output);
        }
    );

    $timeSpend = $benchmarkResult / 1000 .'s';
    $currentMemory = $this->convert(memory_get_usage());
    $peakMemory = $this->convert(memory_get_peak_usage());
    
    $this->table(
        ['time', 'memory_usage', 'memory_peak_usage'], 
        [$timeSpend, $currentMemory, $peakMemory]
    );

    return $return;
}


protected function convert(int $size): string
{
    $units = ['b', 'kb', 'mb', 'gb', 'tb', 'pb'];
    $exponent = floor(log($size, 1024));
    return round($size / pow(1024, $exponent), 2) . ' ' . $units[$exponent];
}

To conveniently get the metrics after each run of our import Command we overwrite the run() method of the command. Then we crated a callable which calls the parent::run() and call the new Laravel Illuminate\Support\Benchmark class with our callable. The Illuminate\Support\Benchmark will return the time the callable took to execute back in ms. Because we want to get the returned value of our parent::run() as well we need to store it in a variable outside the callable. To get the current and peak memory usage we simply use the build in php functions memory_get_peak_usage() and memory_get_usage() and convert the output to a nicer to read format. Now we put all of this in a table to get a nice formatted console output and return the value we got from the parent::run(). If we run the command now it should look like this:

php artisan data:import

+--------------+--------------+-------------------+
| time         | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.027612416s | 18.52 mb     | 18.57 mb          |
+--------------+--------------+-------------------+

The first slow implementation

With all this out of the way let’s start with the first unoptimized implementation

app/Console/Commands/DataImportCommand.php

public function handle()  
{  
	$data = json_decode(Storage::get('./photos.json'), true);  
	  
	$this->withProgressBar(  
		$data,  
		fn(array $photo) => Photo::create([  
			'title' => $photo['title'],  
			'url' => $photo['url'],  
			'thumbnail_url' => $photo['thumbnailUrl'],  
		])  
	);  
}  

First of all we start to load our dummy data from our storage/app/photos.json file and parse the json into an array. Because we want a nice output in the console we use $this->withProgressBar() function to loop through our data set. This will create a nice progress bar that gives us realtime feedback on our import status.
In the second parameter we use the static create() method of our Model to create a new Model instance and save it directly into the database. Now let’s see how well our import performs:

php artisan data:import

 20000/20000 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
+--------------+--------------+-------------------+
| time         | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 5.017469125s | 21.69 mb     | 43.45 mb          |
+--------------+--------------+-------------------+

This is our baseline, let’s improve it.

Making it faster

Our Import took about 5s, not the worst, but we could do better with some small adjustments. First of all we need to identify the part that takes so long and in this case our offender is the create() method. If you think about it the create() method creates every model on its own with a dedicated SQL call. These communication between database and application take time, it would be better if we could prepare all the data upfront and then send it with only one single SQL statement into our Database.

app/Console/Commands/DataImportCommand.php

public function handle()
{
    $data = collect(json_decode(Storage::get('./photos.json'), true));

    $bar = $this->output->createProgressBar($data->count());
    $bar->start();

    $photoData = $data->map(function (array $photo) use ($bar) {
        $bar->advance();

        return Photo::make([
            'title' => $photo['title'],
            'url' => $photo['url'],
            'thumbnail_url' => $photo['thumbnailUrl'],
        ])->attributesToArray();
    });

    Photo::query()->insert($photoData->toArray());

    $bar->finish();
}  

First of all we need to build up our progress bar manual because now we want to map through our dataset and not just loop through it. I won’t explain it here in detail, but you can read all about it here. Then we convert our data array into a collection with the collect() helper and map over it.

You could also use the built in array_map() function, but I like to work with collections more

Instead of creating the model and immediately saving it into the database we only instantiate the model with the static make() method and put it into the collection. Now comes the trick, we use the insert() function to put multiple rows of data into the database at once. The insert() function expects arrays that contain the attributes as keys and the data as values, so we use the toArray() method on the Collection to convert the Collection and all the Models inside to arrays.

php artisan data:import

 20000/20000 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
+--------------+--------------+-------------------+
| time         | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.661941375s | 29.92 mb     | 86.46 mb          |
+--------------+--------------+-------------------+

This is a huge time improvement from 5s down to 0.6s! But we see a new problem, our memory usage increased substantially, If we wanted to import even larger datasets this could become a huge problem. So lets fix it.

Fix the Memory creep

Let`s start with a really simple improvement. At the moment we create a collection of 20000 Eloquent Models, but we never really use them and just convert them down to an array later so why not do it immediately convert them into an array which needs way less memory than a complete eloquent model:

app/Console/Commands/DataImportCommand.php

public function handle()
{
    // ...
    $photoData = $data->map(function (array $photo) use ($bar) {
        $bar->advance();

        return Photo::make([
            'title' => $photo['title'],
            'url' => $photo['url'],
            'thumbnail_url' => $photo['thumbnailUrl'],
        ])->attributesToArray();
    });
    //...
}  
php artisan data:import

 20000/20000 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
+--------------+--------------+-------------------+
| time         | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.602062459s | 29.69 mb     | 74.02 mb          |
+--------------+--------------+-------------------+

Instantly we save bout 12mb, not bad for such a small change, but we can do even better! At the moment we process all 20000 entries at once, this does not scale well at some time we will run out of memory. The solution is chunking our input into smaller pieces and process them one by one.

app/Console/Commands/DataImportCommand.php

public function handle()
{
   //...
    $data->chunk(500)
        ->map(function (Collection $chunk) use ($bar) {
            $photoData = $chunk->map(function (array $photo) use ($bar) {
                $bar->advance();

                return Photo::make([
                    'title' => $photo['title'],
                    'url' => $photo['url'],
                    'thumbnail_url' => $photo['thumbnailUrl'],
                ])->toArray();
            });

            Photo::query()->insert($photoData->toArray());
        });
   //...
}  

We use the chunk() method on the Collection and then do the same we have done on the big dataset on the smaller chunks.

php artisan data:import

 20000/20000 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
+--------------+--------------+-------------------+
| time         | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.797036583s | 30.2 mb      | 43.46 mb          |
+--------------+--------------+-------------------+

Now we save additional 31mb and only added 0.1s to our overall time.

You can get probably even more performance out of the Importer if you use Generators but that is out of the scope of this Blog Post.

And that is pretty much it. Now you can import huge amounts of data without it taking multiple hours and resources.

Bonus Tipp: Update or Create if not already exist

One quick tipp, if you want your Importer to update an existing model or create a new model if no matching model exists you can use the upsert() method instead of insert().

app/Console/Commands/DataImportCommand.php

Photo::query()  
    ->upsert(  
        values: $photoData->toArray(),  
        uniqueBy: ['id'],  
        update: [
            'id',  
            'title',  
            'url',  
            'thumbnail_url'
        ]
    );