One of my colleague was running a batch to insert records in AWS firehose stream and getting service unavailable exception continuously. Probably, batch was inserting more data than allowed per second. So We thought of adding a log statement which prints the size of records which are sent to firehose.
After adding the log statement, we restarted the process and when we checked the logs for object size, all the objects were of same size !!! 2464 bytes. We were using put_record_batch API and we fixed the batch size to 300 but all the objects were not of the same size in a given batch. This is certainly not correct because the size should be somewhere close to 1 MB.
This was happening because of how getsizeof works.
This was the log statement added to print the size of records in bytes
logger.info('batch size = ' + str(getsizeof(batch))) firehoseClient.put_record_batch(DeliveryStreamName='my_delivery_stream', Records=batch)
here, variable batch is a list which contains the objects to be sent to firehose.
The way getsizeof works is - it gets you the size of objects it contains and not the one it refers to. In this case, since list contained 300 objects, it was returning size of 300, 8 byte pointers added with the default size of list.
#size of an empty list >>> from sys import getsizeof >>> getsizeof() 64 # a temporary class for demo >>> class Temp(): ... pass #size of list containing a single object >>> getsizeof([Temp()]) 72
Size of an empty list is 64 bytes and when an object is added to it, size of that list only increased by 8 bytes. because it contains only the pointer to this object and not the actual object.
To get the size including the all the objects which this list refers to, you can use this recipe which gets the size of each object in the list and add that to the size of list.
If you are interested, some more reading on the same