The CSV file size doubles if the data type is converted to numpy.float64, which is the default type of numpy.array, compared to numpy.float32. Therefore, float32 is one of the optimal ones to use (Pytorch datatype is also float32).

As the default data type numpy.float() is float64 and numpy.int() is int64, remember to define the dtype when creating numpy array will save a huge amount of memory space.

When working with DataFrame, there will be another usual type which is “object”. Converting from object to category for feature having various repetitions will help computation time faster.

Below is an example function to optimize the pd.DataFrame data type for scalars and strings.

Another way to easily and efficiently reduce pd.DataFrame memory footprint is to import data with specific columns using usercols parameters in pd.read_csv()

3. Avoid using global variables, instead utilize local objects

Python is faster at retrieving a local variable than a global one. Moreover, declaring too many variables as global leads to Out of Memory issue as these remain in the memory till program execution is completed while local variables are deleted as soon as the function is over and release the memory space which it occupies. Read more at The real-life skill set that data scientists must master

4. Use yield keyword

Python yield returns a generator object, which converts the expression given into a generator function. To get the values of the object, it has to be iterated to read the values given to the yield. To read the generator’s values, you can use list(), for loop, or next().

>>> def say_hello():
>>> yield "HELLO!"
>>> SENTENCE = say_hello()
>>> print(next(SENTENCE))
HELLO!

However, generators are for one-time use objects. If you access it the second time, it returns empty.

>>> def say_hello():
>>> yield "HELLO!"
>>> SENTENCE = say_hello()
>>> print(next(SENTENCE))
HELLO!
>>> print("calling the generator again: ", list(SENTENCE))
calling the generator again: []

As there is no value returned unless the generator object is iterated, no memory is used when the Yield function is defined, while calling Return in a function leads to the allocation in memory.

Hence, Yield is suitable for large datasets, or when you don’t need to store all the output values but just one value for each iteration of the main function.

>>> import sys
>>> my_generator_list = (i*2 for i in range(100000))
>>> print(f"My generator is {sys.getsizeof(my_generator_list)} bytes")
My generator is 128 bytes
>>>...

Continue reading: https://towardsdatascience.com/optimize-memory-tips-in-python-3bbb44512937?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com