An accessible introduction to PHP optimization

2009-09-07 22:09 by Ian

This was a ticket that I had from a customer who wanted us to increase the upper-limit on PHP script execution because his crappy script wouldn't finish in time before our server-wide limit shut down his PHP process.
Um. No.

But in this case, I did decide to fix his badly-written script.
The goal was to upload a CSV file and rake it apart into columns for insertion into a database. Apparently, he uses this regularly, and it has only now been giving him problems as his CSV file passed about 50,000 lines. Both of the following scripts do the same job, but my version has noticeably less fail in it.

Here is the PHP code I found:
02:  ini_set('memory_limit','1000M');
03:  ini_set('upload_max_filesize','20M');
04:  ini_set('max_execution_time','3600');
05:  ini_set('max_input_time','120');
06:  ini_set('default_socket_timeout','120');
07:  set_time_limit(0);
08:  error_reporting(E_ALL);
09:  //include_once("checklog.php");
10:  //require_once "../Spreadsheet/Excel/reader.php";
11:  //require_once("globals.php");
12:  include "";
13:  //require_once("globals.php");
14:  $file="uploaded_csv_files/".$_GET['file'];
15:  $data_array = file($file);
16:     $query2 = mysql_query("truncate table jumak_quotation");
18:     echo "Transferring data...<div id='mydiv'>";
19:     for ( $i = 0; $i <  count($data_array); $i++ )
20:     {
21:        $delimiter=",";
22:        $parts_array[$i] = explode($delimiter,$data_array[$i]);
23:        //if ($i <> "0" && $parts_array[$i][0] <> "" && $parts_array[$i][0] <> "0") {
24:           //$part=str_replace("'","\'",$parts_array[$i][1]);
25:           //$desc=str_replace("'","\'",);
27:           $part = str_replace("'", "\'", $parts_array[$i][0]);
28:           $description = str_replace("'", "\'", $parts_array[$i][1]);
29:           $qty_available = $parts_array[$i][2];
30:           $price = $parts_array[$i][3];
31:           $isSystem = $parts_array[$i][4];
32:           $query = "INSERT INTO jumak_quotation(part_number, description, qty_available, price, isSystem) \
                  VALUES('". $part  ."', '". $description ."', '". $qty_available ."', '". $price ."','". $isSystem."')";
33:           //mysql_query($query) or die(mysql_error());
35:           //$query = "INSERT INTO jumak_quotation(part_number, description, qty_available, price)
36:                 //VALUES('". $part  ."', '','". $parts_array[$i][3] ."','". $parts_array[$i][2] ."')";
33:  //         echo $query ."<br><br>";
39:           echo "<script language='javascript'>mydiv.innerHTML='';</script>";
40:           echo ($i+1)."<br>";
41:           mysql_query($query) or die(mysql_error());
42:        //}
43:     }
44:     echo "</div>";
45:     echo "<br>Rows inserted : ".$i."<br><br><a href='products.php'>Click here to back</a>";
46:  //   header("location: products.php");
49:  //$file = $_REQUEST['file_name'];
51:  //read_csv($file,",");

This is the script I left him with:
02:  error_reporting(E_ALL);
03:  if(!trim($_GET['file']))
04:  {
05:  header("location:products.php");
06:  exit;
07:  }
08:  //include "";
09:  require_once("globals.php");
10:  $file="uploaded_csv_files_compressed/".$_GET['file'];
11:  $data_array = file($file);
12:  $query2 = mysql_query("truncate table jumak_quotation",GetMyConnection());
14:  $total_records =   count($data_array);
15:  $block_size    =   200;
16:  $block_count   =   $total_records / $block_size;
17:  $k             =   0;
18:  $delimiter     =   ",";
19:  while($block_count >= $k){
20:     $query = "INSERT INTO jumak_quotation(part_number, description, qty_available, price, isSystem) VALUES ";
21:     $i = 0;
22:     while(($i < $block_size) && (($block_size*$k)+$i < $total_records)){
23:        $index = ($block_size*$k)+$i;
24:        $parts_array[$index] = explode($delimiter,$data_array[$index]);
26:        $part          = addslashes($parts_array[$index][0]);
27:        $description   = addslashes($parts_array[$index][1]);
28:        $qty_available = $parts_array[$index][2];
29:        $price         = $parts_array[$index][3];
30:        $isSystem      = $parts_array[$index][4];
31:        $query .= "('". $part  ."', '". $description ."', '". $qty_available ."', '". $price ."','". $isSystem."'), ";
32:        $i++;
33:     }
34:     $query = substr($query, 0, strlen($query)-2);
35:     $query .= ";";
36:     mysql_query($query,GetMyConnection());
37:     $k++;
33:  }
39:  //echo "<br>Rows inserted : ".$total_records."<br><br><a href='products.php'>Click here to back</a>";
40:  header("location: products.php?tr=$total_records");

What follows is the ticket response I sent back. I'm posting it with the hope that it will be of help to someone new to PHP trying to write a cleaner script.


After talking with David regarding your goal and the imposed limitations, I am going to first recommend trying to condense the data that has to move between the webserver and the MySQL server. David tells me that he has measured about 50,000 records that made it out of the script before the FastCGI timeout expiration.
If we can increase the runtime efficiency by 33% or more, your script will run without major changes to the methods used. If we can get an extra 45% or so, it will ensure that the script will still execute within the alloted time frame, even under high server load.

First off, I am going to create a copy of your script in the same directory: old_import.php

What follows will be a running commentary on what I am changing and why....

I am removing lines 2 through 7 because they won't have any effect on our shared servers.

I'm guessing that some of these other lines are important normally, but you have commented them out for the sake of testing. I will remove them. Please correct me if this is a bad assumption.

You might also like to know that there is a function called addslashes that performs the same function as your str_replace calls that are formed this way... str_replace("'", "\'", $parts_array[$i][0]);

...can be replaced with...

It's a bit less work, and makes the code easier to read.

Alright... line 32 is where I'll begin optimizing.
Typically, when you send large amounts of data across a network, you will see a performance gain by sending fewer large chunks of data rather than many small chunks. This is going to be my over-arching strategy.
MySQL allows for multiple record insertions with a single INSERT query. I'm going to create a buffer variable and pack it full of records before closing the INSERT.

The more code that you can remove from the loops, the faster your code will run. So I'm also moving lots of static code outside the loops. For instance....
$delimiter = ",";

Because the value $delimiter never changes, by moving it outside the loop, I will not have to assign it each iteration. For a dataset that is 77,000 records long, that's equivalent to saving 77,000 lines of code!

In this same way, I am also going to assign the size of the dataset to a static variable rather than calling the count($var) function each iteration of the for loop. Another 77,000 lines saved.

There are some new variables...

$block_size = 50;
This is the number of records that gets sent to the MySQL server per-query. Increasing this number will cause fewer queries that are larger, thereby reducing overhead.

$block_count = $total_records / $block_size;
This is the number of blocks that gets sent. You shouldn't touch this line. The code will calculate this constant automatically.

$k = 0;
Just an iterator variable.

$query = substr($query, 0, strlen($query)-2);
Once the INSERT block has been completed, this line trims away the unnecessary comma and terminates the statement with a semicolon (MySQL best practices).

Also, I cut out the echo($string) calls.

End result:
Using a $block_size of 200, I cut the execution time down to 16.330 seconds (including a 3.4MB upload!!).
I have improved your script's efficiency by roughly 300%. It will now work without problems for datasets much larger than what you currently have. Let me know if it gives you problems.