Spark Streaming for beginners

Whether you are running an eCommerce store and want ¬†to put up a dash board which shows the number of ¬†orders processed every minute or run a very popular blog and would like to display trending articles on your web site or any other scenarios like this, all of these…

PySpark tips for beginners

Be careful when you use .collect()Do not call .collect() on RDD or data frame. Your driver may go out of memory if RDD or data frame is too large to fit on a node. Use take() function instead. You can specify the count with take that reduces the number…

Pyspark on AWS Fargate

I'm using AWS Batch to run a few pyspark batch jobs and many times, when a job is submitted, it takes a few minutes to start the job. This delay may range from 2-15 minutes depending on the availability of EC2 machine and on the configuration provided. This is something…

Write your first spark application

Apache spark is a framework with which you can process huge amount of data with lightening fast speed. You can run it on a single node or in a cluster where task is distributed among nodes. One of the usage of spark is in ETL process where you extract data…