PySpark tips for beginners

Be careful when you use .collect()Do not call .collect() on RDD or data frame. Your driver may go out of memory if RDD or data frame is too large to fit on a node. Use take() function instead. You can specify the count with take that reduces the number…

Create and initialize a list in python

Create a list and initialize it with some default values. #create a list of 10 elements with default value as 0 >>> my_list = [0]*10 >>> my_list [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]Later you can assign the value to…