RFR 8193832: Performance of InputStream.readAllBytes() could be improved

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Brian Burkhalter-2
https://bugs.openjdk.java.net/browse/JDK-8193832
http://cr.openjdk.java.net/~bpb/8193832/webrev.00/

The implementation of InputStream.readAllBytes() can be modified to be in general faster and to require less memory. The current implementation starts with an internal buffer of size 8192 bytes and doubles the size of the buffer each time more capacity is required resulting in a geometric progression in the buffer size. At the end if the buffer size is not exactly equal to the total number of bytes read, then another allocation equal to the total number read is performed. The amount of memory N required can be calculated for initial buffer size B and total bytes read L as

        M for L == B*(2^n)
N =
        M + L for L != B*(2^n)

where M = B*(2^(n + 1) - 1) and n = ceil(log2(L/B)).

An alternative implementation is not to increase the size of the internal buffer each time more capacity is needed, but to instead maintain a list of buffers of up to B bytes each and gather those into an output buffer at the end. If there is only a single buffer in the list then gathering is not needed and it can be returned directly. The amount of memory N required in this case is

        B + L for L <= B
N =
        B + 2*L for L > B

A comparison of memory usage for a number of lengths L with a buffer size B of 8192 is:

L       N_old      N_new      
8191       16383      16383    
8192       8192       16384    
8193       32769      24578    
16383      40959      40958    
16384      24576      40960    
16385      73729      40962    
32767      90111      73726    
32768      57344      73728    
32769      155649     73730    
65535      188415     139262    
65536      122880     139264    
65537      319489     139266    
131071     385023     270334    
131072     253952     270336    
131073     647169     270338    
262143     778239     532478    
262144     516096     532480    
262145     1302529    532482

In general if the size of the last intermediate buffer in the old version does not equal the total number of bytes read, then the old version will require more memory thaw the new. The performance for the same set of lengths was measured using JMH:

Benchmark                                     (base)  (offset)   Mode  Cnt        Score       Error  Units
BenchReadAllBytesParam.readAllBytes             8192        -1  thrpt    5   578064.253 ± 18955.667  ops/s
BenchReadAllBytesParam.readAllBytesArrayList    8192        -1  thrpt    5   530963.634 ±  4799.923  ops/s
BenchReadAllBytesParam.readAllBytes             8192         0  thrpt    5  1414243.593 ± 68520.245  ops/s
BenchReadAllBytesParam.readAllBytesArrayList    8192         0  thrpt    5   532019.149 ±  7051.738  ops/s
BenchReadAllBytesParam.readAllBytes             8192         1  thrpt    5   293586.365 ±  8196.939  ops/s
BenchReadAllBytesParam.readAllBytesArrayList    8192         1  thrpt    5   361682.265 ± 27081.111  ops/s
BenchReadAllBytesParam.readAllBytes            16384        -1  thrpt    5   216809.517 ±  4853.852  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   16384        -1  thrpt    5   212346.244 ±  2980.916  ops/s
BenchReadAllBytesParam.readAllBytes            16384         0  thrpt    5   379757.470 ± 14524.249  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   16384         0  thrpt    5   218802.256 ±  4361.080  ops/s
BenchReadAllBytesParam.readAllBytes            16384         1  thrpt    5   123570.712 ±  1665.661  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   16384         1  thrpt    5   208801.577 ±  8005.196  ops/s
BenchReadAllBytesParam.readAllBytes            32768        -1  thrpt    5    93083.146 ±  1024.309  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   32768        -1  thrpt    5   104398.304 ±  1310.509  ops/s
BenchReadAllBytesParam.readAllBytes            32768         0  thrpt    5   151104.878 ±  3461.359  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   32768         0  thrpt    5   104849.088 ±  3841.224  ops/s
BenchReadAllBytesParam.readAllBytes            32768         1  thrpt    5    58039.827 ±   398.908  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   32768         1  thrpt    5   104489.685 ±  2118.496  ops/s
BenchReadAllBytesParam.readAllBytes            65536        -1  thrpt    5    43144.695 ±   440.349  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   65536        -1  thrpt    5    55583.338 ±  2115.162  ops/s
BenchReadAllBytesParam.readAllBytes            65536         0  thrpt    5    67216.536 ±  2055.057  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   65536         0  thrpt    5    55680.238 ±  1235.571  ops/s
BenchReadAllBytesParam.readAllBytes            65536         1  thrpt    5    27908.000 ±   568.473  ops/s
BenchReadAllBytesParam.readAllBytesArrayList   65536         1  thrpt    5    55938.779 ±   813.991  ops/s
BenchReadAllBytesParam.readAllBytes           131072        -1  thrpt    5    20299.014 ±   568.616  ops/s
BenchReadAllBytesParam.readAllBytesArrayList  131072        -1  thrpt    5    28115.036 ±   842.392  ops/s
BenchReadAllBytesParam.readAllBytes           131072         0  thrpt    5    31617.475 ±   467.378  ops/s
BenchReadAllBytesParam.readAllBytesArrayList  131072         0  thrpt    5    28274.289 ±   259.699  ops/s
BenchReadAllBytesParam.readAllBytes           131072         1  thrpt    5    13640.406 ±   303.125  ops/s
BenchReadAllBytesParam.readAllBytesArrayList  131072         1  thrpt    5    28221.298 ±   515.030  ops/s
BenchReadAllBytesParam.readAllBytes           262144        -1  thrpt    5     9995.183 ±   249.368  ops/s
BenchReadAllBytesParam.readAllBytesArrayList  262144        -1  thrpt    5    14043.194 ±   138.026  ops/s
BenchReadAllBytesParam.readAllBytes           262144         0  thrpt    5    15309.238 ±   353.752  ops/s
BenchReadAllBytesParam.readAllBytesArrayList  262144         0  thrpt    5    14048.569 ±   348.699  ops/s
BenchReadAllBytesParam.readAllBytes           262144         1  thrpt    5     5940.442 ±   206.855  ops/s
BenchReadAllBytesParam.readAllBytesArrayList  262144         1  thrpt    5    14055.357 ±   412.470  ops/s

In each case the number of bytes read is the sum of the base and offset parameters. Similar behavior with respect to data length is observed for performance as for memory usage: if the data length is not equal to the size of the last intermediate buffer used, then the performance of the old version is in general worse than that of the new.

Thanks,

Brian
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Paul Sandoz
Hi,

For the case of reading 2^N bytes i believe you can avoid doing a last copy by checking if “n < 0" within the “nread > 0” block when “nread == DEAFULT_BUFFER_SIZE”. That might close the perf gap for smaller cases. You can also move "nread = 0” to the same block e.g.:

  var copy = (n < 0 && nread == DEAFULT_BUFFER_SIZE) ? buf : Arrays.copyOf(buf, nread);
  list.add(copy)
  nread = 0;


 262         byte[] output = new byte[total];
 263         int offset = 0;
 264         int numCached = list.size();
 265         for (int i = 0; i < numCached; i++) {
 266             byte[] b = list.get(i);
 267             System.arraycopy(b, 0, output, offset, b.length);
 268             offset += b.length;
 269         }

You can simplify to:

var result = new byte[total];
int offset = 0;
for (buf : list) {
  System.arraycopy(buf, 0, result, offset, buf.length);
  offset += buf.length;
}

s/list/bufs and then you can use var for the declarations at the start of the method.

Paul.


> On 19 Dec 2017, at 11:57, Brian Burkhalter <[hidden email]> wrote:
>
> https://bugs.openjdk.java.net/browse/JDK-8193832
> http://cr.openjdk.java.net/~bpb/8193832/webrev.00/
>
> The implementation of InputStream.readAllBytes() can be modified to be in general faster and to require less memory. The current implementation starts with an internal buffer of size 8192 bytes and doubles the size of the buffer each time more capacity is required resulting in a geometric progression in the buffer size. At the end if the buffer size is not exactly equal to the total number of bytes read, then another allocation equal to the total number read is performed. The amount of memory N required can be calculated for initial buffer size B and total bytes read L as
>
> M for L == B*(2^n)
> N =
> M + L for L != B*(2^n)
>
> where M = B*(2^(n + 1) - 1) and n = ceil(log2(L/B)).
>
> An alternative implementation is not to increase the size of the internal buffer each time more capacity is needed, but to instead maintain a list of buffers of up to B bytes each and gather those into an output buffer at the end. If there is only a single buffer in the list then gathering is not needed and it can be returned directly. The amount of memory N required in this case is
>
> B + L for L <= B
> N =
> B + 2*L for L > B
>
> A comparison of memory usage for a number of lengths L with a buffer size B of 8192 is:
>
> L       N_old      N_new
> 8191       16383      16383
> 8192       8192       16384
> 8193       32769      24578
> 16383      40959      40958
> 16384      24576      40960
> 16385      73729      40962
> 32767      90111      73726
> 32768      57344      73728
> 32769      155649     73730
> 65535      188415     139262
> 65536      122880     139264
> 65537      319489     139266
> 131071     385023     270334
> 131072     253952     270336
> 131073     647169     270338
> 262143     778239     532478
> 262144     516096     532480
> 262145     1302529    532482
>
> In general if the size of the last intermediate buffer in the old version does not equal the total number of bytes read, then the old version will require more memory thaw the new. The performance for the same set of lengths was measured using JMH:
>
> Benchmark                                     (base)  (offset)   Mode  Cnt        Score       Error  Units
> BenchReadAllBytesParam.readAllBytes             8192        -1  thrpt    5   578064.253 ± 18955.667  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList    8192        -1  thrpt    5   530963.634 ±  4799.923  ops/s
> BenchReadAllBytesParam.readAllBytes             8192         0  thrpt    5  1414243.593 ± 68520.245  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList    8192         0  thrpt    5   532019.149 ±  7051.738  ops/s
> BenchReadAllBytesParam.readAllBytes             8192         1  thrpt    5   293586.365 ±  8196.939  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList    8192         1  thrpt    5   361682.265 ± 27081.111  ops/s
> BenchReadAllBytesParam.readAllBytes            16384        -1  thrpt    5   216809.517 ±  4853.852  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   16384        -1  thrpt    5   212346.244 ±  2980.916  ops/s
> BenchReadAllBytesParam.readAllBytes            16384         0  thrpt    5   379757.470 ± 14524.249  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   16384         0  thrpt    5   218802.256 ±  4361.080  ops/s
> BenchReadAllBytesParam.readAllBytes            16384         1  thrpt    5   123570.712 ±  1665.661  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   16384         1  thrpt    5   208801.577 ±  8005.196  ops/s
> BenchReadAllBytesParam.readAllBytes            32768        -1  thrpt    5    93083.146 ±  1024.309  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   32768        -1  thrpt    5   104398.304 ±  1310.509  ops/s
> BenchReadAllBytesParam.readAllBytes            32768         0  thrpt    5   151104.878 ±  3461.359  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   32768         0  thrpt    5   104849.088 ±  3841.224  ops/s
> BenchReadAllBytesParam.readAllBytes            32768         1  thrpt    5    58039.827 ±   398.908  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   32768         1  thrpt    5   104489.685 ±  2118.496  ops/s
> BenchReadAllBytesParam.readAllBytes            65536        -1  thrpt    5    43144.695 ±   440.349  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   65536        -1  thrpt    5    55583.338 ±  2115.162  ops/s
> BenchReadAllBytesParam.readAllBytes            65536         0  thrpt    5    67216.536 ±  2055.057  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   65536         0  thrpt    5    55680.238 ±  1235.571  ops/s
> BenchReadAllBytesParam.readAllBytes            65536         1  thrpt    5    27908.000 ±   568.473  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   65536         1  thrpt    5    55938.779 ±   813.991  ops/s
> BenchReadAllBytesParam.readAllBytes           131072        -1  thrpt    5    20299.014 ±   568.616  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  131072        -1  thrpt    5    28115.036 ±   842.392  ops/s
> BenchReadAllBytesParam.readAllBytes           131072         0  thrpt    5    31617.475 ±   467.378  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  131072         0  thrpt    5    28274.289 ±   259.699  ops/s
> BenchReadAllBytesParam.readAllBytes           131072         1  thrpt    5    13640.406 ±   303.125  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  131072         1  thrpt    5    28221.298 ±   515.030  ops/s
> BenchReadAllBytesParam.readAllBytes           262144        -1  thrpt    5     9995.183 ±   249.368  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  262144        -1  thrpt    5    14043.194 ±   138.026  ops/s
> BenchReadAllBytesParam.readAllBytes           262144         0  thrpt    5    15309.238 ±   353.752  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  262144         0  thrpt    5    14048.569 ±   348.699  ops/s
> BenchReadAllBytesParam.readAllBytes           262144         1  thrpt    5     5940.442 ±   206.855  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  262144         1  thrpt    5    14055.357 ±   412.470  ops/s
>
> In each case the number of bytes read is the sum of the base and offset parameters. Similar behavior with respect to data length is observed for performance as for memory usage: if the data length is not equal to the size of the last intermediate buffer used, then the performance of the old version is in general worse than that of the new.
>
> Thanks,
>
> Brian

Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Remi Forax
----- Mail original -----
> De: "Paul Sandoz" <[hidden email]>
> À: "Brian Burkhalter" <[hidden email]>
> Cc: "core-libs-dev" <[hidden email]>
> Envoyé: Mardi 19 Décembre 2017 21:52:03
> Objet: Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

> Hi,
>
> For the case of reading 2^N bytes i believe you can avoid doing a last copy by
> checking if “n < 0" within the “nread > 0” block when “nread ==
> DEAFULT_BUFFER_SIZE”. That might close the perf gap for smaller cases. You can
> also move "nread = 0” to the same block e.g.:
>
>  var copy = (n < 0 && nread == DEAFULT_BUFFER_SIZE) ? buf : Arrays.copyOf(buf,
>  nread);
>  list.add(copy)
>  nread = 0;
>
>
> 262         byte[] output = new byte[total];
> 263         int offset = 0;
> 264         int numCached = list.size();
> 265         for (int i = 0; i < numCached; i++) {
> 266             byte[] b = list.get(i);
> 267             System.arraycopy(b, 0, output, offset, b.length);
> 268             offset += b.length;
> 269         }
>
> You can simplify to:
>
> var result = new byte[total];
> int offset = 0;
> for (buf : list) {
>  System.arraycopy(buf, 0, result, offset, buf.length);
>  offset += buf.length;
> }
>
> s/list/bufs and then you can use var for the declarations at the start of the
> method.
>
> Paul.

About using var, IMO var declaration makes usually the code more readable apart if you mix var declaration and classical declaration and if you call a method that has several overloads.

Is there a usage guide somewhere ?

Rémi
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Brian Burkhalter-2
In reply to this post by Paul Sandoz
On Dec 19, 2017, at 12:52 PM, Paul Sandoz <[hidden email]> wrote:

> For the case of reading 2^N bytes i believe you can avoid doing a last copy by checking if “n < 0" within the “nread > 0” block when “nread == DEAFULT_BUFFER_SIZE”. That might close the perf gap for smaller cases. You can also move "nread = 0” to the same block e.g.:
>
>  var copy = (n < 0 && nread == DEAFULT_BUFFER_SIZE) ? buf : Arrays.copyOf(buf, nread);
>  list.add(copy)
>  nread = 0;

Definitely improves performance and memory footprint for this case.

> 262         byte[] output = new byte[total];
> 263         int offset = 0;
> 264         int numCached = list.size();
> 265         for (int i = 0; i < numCached; i++) {
> 266             byte[] b = list.get(i);
> 267             System.arraycopy(b, 0, output, offset, b.length);
> 268             offset += b.length;
> 269         }
>
> You can simplify to:
>
> var result = new byte[total];
> int offset = 0;
> for (buf : list) {
>  System.arraycopy(buf, 0, result, offset, buf.length);
>  offset += buf.length;
> }

Oh, yes, of course.

> s/list/bufs and then you can use var for the declarations at the start of the method.

Done. Updated: http://cr.openjdk.java.net/~bpb/8193832/webrev.01/

Good suggestions! Not sure however about line 237 as with var it has to be “var n = 0;” instead of simply “int n;” with no initialization. Also I’ve not tested the effect of the initial List capacity at line 233: the current value of 128 is arbitrary - might be better to go with the default?

Thanks,

Brian
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Paul Sandoz


> On 19 Dec 2017, at 13:43, Brian Burkhalter <[hidden email]> wrote:
>
> On Dec 19, 2017, at 12:52 PM, Paul Sandoz <[hidden email] <mailto:[hidden email]>> wrote:
>
>> For the case of reading 2^N bytes i believe you can avoid doing a last copy by checking if “n < 0" within the “nread > 0” block when “nread == DEAFULT_BUFFER_SIZE”. That might close the perf gap for smaller cases. You can also move "nread = 0” to the same block e.g.:
>>
>>  var copy = (n < 0 && nread == DEAFULT_BUFFER_SIZE) ? buf : Arrays.copyOf(buf, nread);
>>  list.add(copy)
>>  nread = 0;
>
> Definitely improves performance and memory footprint for this case.
>
>> 262         byte[] output = new byte[total];
>> 263         int offset = 0;
>> 264         int numCached = list.size();
>> 265         for (int i = 0; i < numCached; i++) {
>> 266             byte[] b = list.get(i);
>> 267             System.arraycopy(b, 0, output, offset, b.length);
>> 268             offset += b.length;
>> 269         }
>>
>> You can simplify to:
>>
>> var result = new byte[total];
>> int offset = 0;
>> for (buf : list) {
>>  System.arraycopy(buf, 0, result, offset, buf.length);
>>  offset += buf.length;
>> }
>
> Oh, yes, of course.
>
>> s/list/bufs and then you can use var for the declarations at the start of the method.
>
> Done. Updated: http://cr.openjdk.java.net/~bpb/8193832/webrev.01/ <http://cr.openjdk.java.net/~bpb/8193832/webrev.01/>
>

You can also simplify the “for(;;) + break" into a do while loop:

do {
  int nread = 0;
  ...
} while (n > 0);


> Good suggestions! Not sure however about line 237 as with var it has to be “var n = 0;” instead of simply “int n;” with no initialization.

I was only suggesting it’s use for the byte[] and ArrayList<byte[]>. IMHO it’s a little subjective but there is less upside for int, although one can argue consistent application and explicit initialization is a good thing here.


> Also I’ve not tested the effect of the initial List capacity at line 233: the current value of 128 is arbitrary - might be better to go with the default?
>

My inclination would be to use 8 instead of 128, that allows 2^16 in size by default, rather than 2^20.

Paul.
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Brian Burkhalter-2
On Dec 19, 2017, at 2:28 PM, Paul Sandoz <[hidden email]> wrote:

>> Done. Updated: http://cr.openjdk.java.net/~bpb/8193832/webrev.01/
>>
>
> You can also simplify the “for(;;) + break" into a do while loop:
>
> do {
>   int nread = 0;
>   ...
> } while (n > 0);

Good suggestion but I think that this needs to be "while (n >= 0)."

>> Good suggestions! Not sure however about line 237 as with var it has to be “var n = 0;” instead of simply “int n;” with no initialization.
>
> I was only suggesting it’s use for the byte[] and ArrayList<byte[]>. IMHO it’s a little subjective but there is less upside for int, although one can argue consistent application and explicit initialization is a good thing here.

I had left the ints as ints in an intermediate copy before Remi’s message. I also prefer to leave them as ints.

Brian
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Brian Burkhalter-2
On Dec 19, 2017, at 2:36 PM, Brian Burkhalter <[hidden email]> wrote:

>> You can also simplify the “for(;;) + break" into a do while loop:
>>
>> do {
>>  int nread = 0;
>>  ...
>> } while (n > 0);
>
> Good suggestion but I think that this needs to be "while (n >= 0)."

Updated version here:

http://cr.openjdk.java.net/~bpb/8193832/webrev.02/

Thanks,

Brian
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Peter Levart
Hi Brian,

On 12/20/2017 12:22 AM, Brian Burkhalter wrote:

> On Dec 19, 2017, at 2:36 PM, Brian Burkhalter <[hidden email]> wrote:
>
>>> You can also simplify the “for(;;) + break" into a do while loop:
>>>
>>> do {
>>>   int nread = 0;
>>>   ...
>>> } while (n > 0);
>> Good suggestion but I think that this needs to be "while (n >= 0)."
> Updated version here:
>
> http://cr.openjdk.java.net/~bpb/8193832/webrev.02/
>
> Thanks,
>
> Brian

Looks good. There is one case that could be further optimized. When
result.length <= DEFAULT_BUFFER_SIZE, the allocation of ArrayList could
be avoided. Imagine a use case where lots of small files are read into
byte[] arrays. For exmaple:

     public byte[] readAllBytes() throws IOException {
         List<byte[]> bufs = null;
         byte[] result = null;
         byte[] buf = new byte[DEFAULT_BUFFER_SIZE];
         int total = 0;
         int n;
         do {
             int nread = 0;

             // read to EOF which may read more or less than buffer size
             while ((n = read(buf, nread, buf.length - nread)) > 0) {
                 nread += n;
             }

             if (nread > 0) {
                 if (MAX_BUFFER_SIZE - total < nread) {
                     throw new OutOfMemoryError("Required array size too
large");
                 }
                 total += nread;
                 byte[] copy = (n < 0 && nread == DEFAULT_BUFFER_SIZE) ?
                     buf : Arrays.copyOf(buf, nread);
                 if (result == null) {
                     result = copy;
                 } else {
                     bufs = new ArrayList<>(8);
                     bufs.add(result);
                     bufs.add(copy);
                 }
             }
         } while (n >= 0); // if the last call to read returned -1, then
break

         if (bufs == null) {
             return result == null ? new byte[0] : result;
         }

         result = new byte[total];
         int offset = 0;
         for (byte[] b : bufs) {
             System.arraycopy(b, 0, result, offset, b.length);
             offset += b.length;
         }

         return result;
     }


There is a possibility that JIT already avoids allocating ArrayList
utilizing EA if all involved ArrayList methods inline, so this potential
optimization should be tested 1st to see if it actually helps improve
the "small file" case.

Regards, Peter

Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Peter Levart
Hi Brian,

There's also a variation of copy-ing fragment possible, that replaces
copying with allocation:

                 byte[] copy;
                 if (nread == DEFAULT_BUFFER_SIZE) {
                     copy = buf;
                     if (n >= 0) {
                         buf = new byte[DEFAULT_BUFFER_SIZE];
                     }
                 } else {
                     copy = Arrays.copyOf(buf, nread);
                 }

For big FileInputStream(s), the buf will be fully read (nread ==
DEFAULT_BUFFER_SIZE) most of the times. So this might be an improvement
if allocation (involving pre-zeroing) is faster than Arrays.copyOf()
which avoids pre-zeroing, but involves copying.

Regards, Peter

On 12/20/2017 12:45 PM, Peter Levart wrote:

> Hi Brian,
>
> On 12/20/2017 12:22 AM, Brian Burkhalter wrote:
>> On Dec 19, 2017, at 2:36 PM, Brian Burkhalter
>> <[hidden email]> wrote:
>>
>>>> You can also simplify the “for(;;) + break" into a do while loop:
>>>>
>>>> do {
>>>>   int nread = 0;
>>>>   ...
>>>> } while (n > 0);
>>> Good suggestion but I think that this needs to be "while (n >= 0)."
>> Updated version here:
>>
>> http://cr.openjdk.java.net/~bpb/8193832/webrev.02/
>>
>> Thanks,
>>
>> Brian
>
> Looks good. There is one case that could be further optimized. When
> result.length <= DEFAULT_BUFFER_SIZE, the allocation of ArrayList
> could be avoided. Imagine a use case where lots of small files are
> read into byte[] arrays. For exmaple:
>
>     public byte[] readAllBytes() throws IOException {
>         List<byte[]> bufs = null;
>         byte[] result = null;
>         byte[] buf = new byte[DEFAULT_BUFFER_SIZE];
>         int total = 0;
>         int n;
>         do {
>             int nread = 0;
>
>             // read to EOF which may read more or less than buffer size
>             while ((n = read(buf, nread, buf.length - nread)) > 0) {
>                 nread += n;
>             }
>
>             if (nread > 0) {
>                 if (MAX_BUFFER_SIZE - total < nread) {
>                     throw new OutOfMemoryError("Required array size
> too large");
>                 }
>                 total += nread;
>                 byte[] copy = (n < 0 && nread == DEFAULT_BUFFER_SIZE) ?
>                     buf : Arrays.copyOf(buf, nread);
>                 if (result == null) {
>                     result = copy;
>                 } else {
>                     bufs = new ArrayList<>(8);
>                     bufs.add(result);
>                     bufs.add(copy);
>                 }
>             }
>         } while (n >= 0); // if the last call to read returned -1,
> then break
>
>         if (bufs == null) {
>             return result == null ? new byte[0] : result;
>         }
>
>         result = new byte[total];
>         int offset = 0;
>         for (byte[] b : bufs) {
>             System.arraycopy(b, 0, result, offset, b.length);
>             offset += b.length;
>         }
>
>         return result;
>     }
>
>
> There is a possibility that JIT already avoids allocating ArrayList
> utilizing EA if all involved ArrayList methods inline, so this
> potential optimization should be tested 1st to see if it actually
> helps improve the "small file" case.
>
> Regards, Peter
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Peter Levart
Hi Brian,

I found another improvement. If you are reading from files, there's no
difference. But if you read from say socket(s), there may be short reads
(read() returning 0). Your current code bails out from inner loop when
this happens and consequently does not fill the buf up to the top when
it could fill it if it retried the inner loop. This is a variant where
inner loop guarantees that either buf is filled to the top or stream is
at EOF:

     public byte[] readAllBytes() throws IOException {
         List<byte[]> bufs = null;
         byte[] result = null;
         byte[] buf = new byte[DEFAULT_BUFFER_SIZE];
         int total = 0;
         int remaining; // # of bytes remaining to fill buf
         do {
             remaining = buf.length;
             int n;
             // read to fill the buf or until EOF. Loop ends when either:
             // - remaining == 0 and there's possibly more to be read
from stream; or
             // - remaining > 0 and the stream is at EOF
             while (remaining > 0 &&
                    (n = read(buf, buf.length - remaining, remaining))
 >= 0) {
                 remaining -= n;
             }

             int nread = buf.length - remaining;
             if (nread > 0) {
                 if (MAX_BUFFER_SIZE - total < nread) {
                     throw new OutOfMemoryError("Required array size too
large");
                 }
                 total += nread;
                 byte[] copy;
                 if (remaining == 0) { // buf is filled to the top
                     copy = buf;
                     buf = new byte[DEFAULT_BUFFER_SIZE];
                 } else {
                     copy = Arrays.copyOf(buf, nread);
                 }
                 if (result == null) {
                     result = copy;
                 } else {
                     bufs = new ArrayList<>(8);
                     bufs.add(result);
                     bufs.add(copy);
                 }
             }
         } while (remaining == 0); // there may be more bytes in stream

         if (bufs == null) {
             return result == null ? new byte[0] : result;
         }

         result = new byte[total];
         int offset = 0;
         for (byte[] b : bufs) {
             System.arraycopy(b, 0, result, offset, b.length);
             offset += b.length;
         }

         return result;
     }



Regards, Peter

On 12/20/2017 12:59 PM, Peter Levart wrote:

> Hi Brian,
>
> There's also a variation of copy-ing fragment possible, that replaces
> copying with allocation:
>
>                 byte[] copy;
>                 if (nread == DEFAULT_BUFFER_SIZE) {
>                     copy = buf;
>                     if (n >= 0) {
>                         buf = new byte[DEFAULT_BUFFER_SIZE];
>                     }
>                 } else {
>                     copy = Arrays.copyOf(buf, nread);
>                 }
>
> For big FileInputStream(s), the buf will be fully read (nread ==
> DEFAULT_BUFFER_SIZE) most of the times. So this might be an
> improvement if allocation (involving pre-zeroing) is faster than
> Arrays.copyOf() which avoids pre-zeroing, but involves copying.
>
> Regards, Peter
>
> On 12/20/2017 12:45 PM, Peter Levart wrote:
>> Hi Brian,
>>
>> On 12/20/2017 12:22 AM, Brian Burkhalter wrote:
>>> On Dec 19, 2017, at 2:36 PM, Brian Burkhalter
>>> <[hidden email]> wrote:
>>>
>>>>> You can also simplify the “for(;;) + break" into a do while loop:
>>>>>
>>>>> do {
>>>>>   int nread = 0;
>>>>>   ...
>>>>> } while (n > 0);
>>>> Good suggestion but I think that this needs to be "while (n >= 0)."
>>> Updated version here:
>>>
>>> http://cr.openjdk.java.net/~bpb/8193832/webrev.02/
>>>
>>> Thanks,
>>>
>>> Brian
>>
>> Looks good. There is one case that could be further optimized. When
>> result.length <= DEFAULT_BUFFER_SIZE, the allocation of ArrayList
>> could be avoided. Imagine a use case where lots of small files are
>> read into byte[] arrays. For exmaple:
>>
>>     public byte[] readAllBytes() throws IOException {
>>         List<byte[]> bufs = null;
>>         byte[] result = null;
>>         byte[] buf = new byte[DEFAULT_BUFFER_SIZE];
>>         int total = 0;
>>         int n;
>>         do {
>>             int nread = 0;
>>
>>             // read to EOF which may read more or less than buffer size
>>             while ((n = read(buf, nread, buf.length - nread)) > 0) {
>>                 nread += n;
>>             }
>>
>>             if (nread > 0) {
>>                 if (MAX_BUFFER_SIZE - total < nread) {
>>                     throw new OutOfMemoryError("Required array size
>> too large");
>>                 }
>>                 total += nread;
>>                 byte[] copy = (n < 0 && nread == DEFAULT_BUFFER_SIZE) ?
>>                     buf : Arrays.copyOf(buf, nread);
>>                 if (result == null) {
>>                     result = copy;
>>                 } else {
>>                     bufs = new ArrayList<>(8);
>>                     bufs.add(result);
>>                     bufs.add(copy);
>>                 }
>>             }
>>         } while (n >= 0); // if the last call to read returned -1,
>> then break
>>
>>         if (bufs == null) {
>>             return result == null ? new byte[0] : result;
>>         }
>>
>>         result = new byte[total];
>>         int offset = 0;
>>         for (byte[] b : bufs) {
>>             System.arraycopy(b, 0, result, offset, b.length);
>>             offset += b.length;
>>         }
>>
>>         return result;
>>     }
>>
>>
>> There is a possibility that JIT already avoids allocating ArrayList
>> utilizing EA if all involved ArrayList methods inline, so this
>> potential optimization should be tested 1st to see if it actually
>> helps improve the "small file" case.
>>
>> Regards, Peter
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Alan Bateman


On 20/12/2017 12:40, Peter Levart wrote:
> Hi Brian,
>
> I found another improvement. If you are reading from files, there's no
> difference. But if you read from say socket(s), there may be short
> reads (read() returning 0).
InputStreams are blocking so if someone creates an InputStream over a
socket configured non-blocking then they have to emulate blocking
behavior. So assuming a non-zero byte array, then read should return a
positive value or -1.

-Alan.
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Chris Hegarty
In reply to this post by Brian Burkhalter-2

On 19/12/17 23:22, Brian Burkhalter wrote:
> ..
> Updated version here:
>
> http://cr.openjdk.java.net/~bpb/8193832/webrev.02/

Looks good to me. This is a nice improvement on the
original implementation that I added in 9. Thanks.

-Chris.
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Peter Levart
In reply to this post by Alan Bateman


On 12/20/2017 03:09 PM, Alan Bateman wrote:

>
>
> On 20/12/2017 12:40, Peter Levart wrote:
>> Hi Brian,
>>
>> I found another improvement. If you are reading from files, there's
>> no difference. But if you read from say socket(s), there may be short
>> reads (read() returning 0).
> InputStreams are blocking so if someone creates an InputStream over a
> socket configured non-blocking then they have to emulate blocking
> behavior. So assuming a non-zero byte array, then read should return a
> positive value or -1.
>
> -Alan.

You are right Alan. I don't know why I assumed that 0 is a valid return
value (for non-empty array). So my last suggestion is unnecessary. Each
buf will be filled to the top before inner loop exits or the stream will
be at EOF.

Regards, Peter


Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Brian Burkhalter-2
Hi Peter,

On Dec 20, 2017, at 8:04 AM, Peter Levart <[hidden email]> wrote:

>>>  found another improvement. If you are reading from files, there's no difference. But if you read from say socket(s), there may be short reads (read() returning 0).
>> InputStreams are blocking so if someone creates an InputStream over a socket configured non-blocking then they have to emulate blocking behavior. So assuming a non-zero byte array, then read should return a positive value or -1.
>>
>> -Alan.
>
> You are right Alan. I don't know why I assumed that 0 is a valid return value (for non-empty array). So my last suggestion is unnecessary. Each buf will be filled to the top before inner loop exits or the stream will be at EOF.

Thanks for the suggestions. I’ll take a look at the first two later today.

Brian
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

John Rose-3
In reply to this post by Peter Levart
On Dec 20, 2017, at 3:45 AM, Peter Levart <[hidden email]> wrote:
>
> allocation of ArrayList could be avoided. Imagine a use case where lots of small files are read into byte[] arrays

+1 I was going to write a similar suggestion; thanks for sending it out.

These sorts of things often need to be designed to perform well at all
scales, not just large scales.

The new ArrayList(8) is dead weight if the data fits in the buffer.
So it's bad for scale < buffer length.

The earlier new ArrayList(128) is a noticeable overhead if the data
fits in two buffers.  So it's not so good for scale = M * (buffer length)
where M is about 2.

For a large scale result (M > 10), the footprint difference between
ArrayList(N) for various N is a second-order fraction.

— John

P.S. Road not taken:  Instead of new ArrayList(8) you could use
the default ArrayList constructor, and allocate it unconditionally.
It has a very low overhead if the ArrayList remains empty.  Using
that constructor could motivate you to simplify the code to use
the ArrayList unconditionally, since that would be just a single
heap node if it is not used to gather multiple results.  But I like
the "null" formulation despite its complexity.  So I'd probably
keep it the way Peter wrote it.

Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Paul Sandoz


> On 20 Dec 2017, at 11:04, John Rose <[hidden email]> wrote:
>
> On Dec 20, 2017, at 3:45 AM, Peter Levart <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> allocation of ArrayList could be avoided. Imagine a use case where lots of small files are read into byte[] arrays
>
> +1 I was going to write a similar suggestion; thanks for sending it out.
>

I was a little lassiaz-faire given that 8K bytes were anyway being allocated upfront. Peter’s changes look good.

Brian, i would double check the tests to make sure the various code paths are tested.

Paul.

> These sorts of things often need to be designed to perform well at all
> scales, not just large scales.
>
> The new ArrayList(8) is dead weight if the data fits in the buffer.
> So it's bad for scale < buffer length.
>
> The earlier new ArrayList(128) is a noticeable overhead if the data
> fits in two buffers.  So it's not so good for scale = M * (buffer length)
> where M is about 2.
>
> For a large scale result (M > 10), the footprint difference between
> ArrayList(N) for various N is a second-order fraction.
>
> — John
>
> P.S. Road not taken:  Instead of new ArrayList(8) you could use
> the default ArrayList constructor, and allocate it unconditionally.
> It has a very low overhead if the ArrayList remains empty.  Using
> that constructor could motivate you to simplify the code to use
> the ArrayList unconditionally, since that would be just a single
> heap node if it is not used to gather multiple results.  But I like
> the "null" formulation despite its complexity.  So I'd probably
> keep it the way Peter wrote it.
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Brian Burkhalter-2
In reply to this post by Peter Levart
Hi Peter,

On Dec 20, 2017, at 3:45 AM, Peter Levart <[hidden email]> wrote:

>                 if (result == null) {
>                     result = copy;
>                 } else {
>                     bufs = new ArrayList<>(8); // <— ?
>                     bufs.add(result);
>                     bufs.add(copy);
>                 }

I am probably missing something here, but if the do-while loop iterates three or more times with nread > 0 each time won’t data be lost? Should this not instead be:

                if (result == null) {
                    result = copy;
                } else {
                    if (bufs == null) {
                        bufs = new ArrayList<>(8);
                        bufs.add(result);
                    }
                    bufs.add(copy);
                }

Thanks,

Brian
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Brian Burkhalter-2
In reply to this post by Paul Sandoz
On Dec 20, 2017, at 11:52 AM, Paul Sandoz <[hidden email]> wrote:

> I was a little lassiaz-faire given that 8K bytes were anyway being allocated upfront. Peter’s changes look good.
>
> Brian, i would double check the tests to make sure the various code paths are tested.


http://cr.openjdk.java.net/~bpb/8193832/webrev.03/

The patch is updated to:

* use Peter’s approach to avoid allocating an ArrayList when length <= DEFAULT_BUFFER_SIZE;
* use the default ArrayList constructor instead of that with a specific initial capacity;
* update the test to ensure that lengths which require three buffers are covered.

Thanks,

Brian
Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Peter Levart
In reply to this post by Brian Burkhalter-2
Hi Brian,

Brian Burkhalter je 20. 12. 2017 ob 22:54 napisal:

> Hi Peter,
>
> On Dec 20, 2017, at 3:45 AM, Peter Levart <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>>                 if (result == null) {
>> result = copy;
>>                 } else {
>> bufs = new ArrayList<>(8); // <— ?
>> bufs.add(result);
>> bufs.add(copy);
>>                 }
>
> I am probably missing something here, but if the do-while loop
> iterates three or more times with nread > 0 each time won’t data be
> lost? Should this not instead be:
>
>                 if (result == null) {
>                     result = copy;
>                 } else {
>                     if (bufs == null) {
>                         bufs = new ArrayList<>(8);
>                         bufs.add(result);
>                     }
>                     bufs.add(copy);
>                 }
>
> Thanks,
>
> Brian

Yes, of course. Good catch. Next time I should try running the code
before proposing it...

webrev.03 looks good.

Regards, Peter

Reply | Threaded
Open this post in threaded view
|

Re: RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Peter Levart
In reply to this post by Brian Burkhalter-2
Hi Brian,

On 12/20/2017 11:30 PM, Brian Burkhalter wrote:

> On Dec 20, 2017, at 11:52 AM, Paul Sandoz <[hidden email]> wrote:
>
>> I was a little lassiaz-faire given that 8K bytes were anyway being allocated upfront. Peter’s changes look good.
>>
>> Brian, i would double check the tests to make sure the various code paths are tested.
>
> http://cr.openjdk.java.net/~bpb/8193832/webrev.03/
>
> The patch is updated to:
>
> * use Peter’s approach to avoid allocating an ArrayList when length <= DEFAULT_BUFFER_SIZE;
> * use the default ArrayList constructor instead of that with a specific initial capacity;
> * update the test to ensure that lengths which require three buffers are covered.
>
> Thanks,
>
> Brian

This is OK as is, but I see another possible improvement to the logic.
You decide whether it is worth trying to implement it. Currently the
logic reads stream into buffers of DEFAULT_BUFFER_SIZE and adds them to
an ArrayList, except the last buffer which is 1st copied into a shorter
buffer before being appended to the list. This copying is unnecessary.
The copied buffer has the same content, but shorter length. But the
information about the length of final buffer is contained elsewhere too
(for example implicitly in 'total'). So you copuld change the final
"gathering" loop to extract this information for the final buffer and
there would be no redundant copying of final buffer necessary.

What do you think?

Regards, Peter


12