Spark Streaming for beginners

Whether you are running an eCommerce store and want ¬†to put up a dash board which shows the number of ¬†orders processed every minute or run a very popular blog and would like to display trending articles on your web site or any other scenarios like this, all of these…

PySpark tips for beginners

Be careful when you use .collect()Do not call .collect() on RDD or data frame. Your driver may go out of memory if RDD or data frame is too large to fit on a node. Use take() function instead. You can specify the count with take that reduces the number…

Pyspark on AWS Fargate

I'm using AWS Batch to run a few pyspark batch jobs and many times, when a job is submitted, it takes a few minutes to start the job. This delay may range from 2-15 minutes depending on the availability of EC2 machine and on the configuration provided. This is something…

Cost of workarounds

When we use a product for something other than it was intended, we have to make some workarounds. Otherwise it wouldn't work. People working on hacking or making the product fit in new scope may get the intellectual satisfaction but ultimately it is going to cost a lot that we…

Capture text from web with flask and JavaScript

We read a lot of stuff from web and sometimes would like to make a note of some of it so that we can refer to it later. There are a few products which charge few $/month and provide this service but we can easily write a browser extension to…

How not to use Athena !!!

For those who don't know, AWS Athena is a query service that makes it easy to read data stored in S3 bucket using SQL queries. It is optimized for querying huge amount of data and you don't even need to set up any infrastructure. But little did I know, it…

Write Your personal money manager for fun and free !!!

At some point in time you may have used a money manager to track your expenses, categorize them etc. All these work great but the only issue is that you need to share very sensitive information with a third party. If they are tracking SMS, then they know all about…

How to process nested arrays in json with Athena.

Suppose you are writing an application for a library. Instead of storing book inventory in traditional db, you decided to use s3. Each book record is converted to json, stringified, written to a file and stored in S3 as an object. To read this, you create tables in Athena and…