I documented our journey with AWS Glue pretty heavily https://tymlez-au.medium.com/
I had been interested in AWS Glue ever since I saw the keynote on it at AWS re:invent a few years back, it seemed to have a really nice balance of “tools you know” vs “no setup”, and for a while it worked nicely.
It kind of filled the missing gap of Orchestration & Serverless that these tools often have.
Unfortunately in order to do this it has to wipe out some of the USPs that those tools have, and in doing so it creates complications.
AWS Glue served us well for our batch ingestions but it became clear very quickly that it wouldn’t scale to real time ingestion in it’s current version despite AWS Kinesis being a supported input.
We needed an alternative solution that was less painful.
In a nutshell out glue pipeline took in batch data that our Behind the Meter technology and Smart Meter readings dumped into S3 (more on this later as its important) and ran it through various scripts to ETL the data into what we require.
You can read more about the specific setup here: https://tymlez-au.medium.com/smart-meter-data-etl-systems-at-tymlez-5643e232dfbb
Orchestration was done serially via AWS Glue Triggers (e.g. when this step is done run the next step), unfortunately this meant that everything required deployment to use them each time, this slows down the development process massively and due to other problems (discussed here: https://tymlez-au.medium.com/aws-glue-is-a-mess-886cc1d13ca9) there was no easy way to set up an easy local environment to run these items manually.
In short — it was producing more hassles than it was solving.
Simply — because of BigQuery. There is nothing on the market like Google BigQuery when it comes to processing huge amounts of data easily.
It comes down to simplicity, our workflow is quite standard when it comes to reading from devices, a device emits JSON every x minutes — we need to take that JSON and write it to our data warehouse.
Device -> “Processor*” -> Kinesis -> AWS Glue Table -> Athena
Getting Kinesis working with AWS Glue Tables was a near impossibility (and trying to develop this locally even harder).
Device -> “Processor*” -> BigQuery
*Processor is some form of processing capability like EC2, Lambda, Cloud Functions etc
The GCP flow just works, it works so easily that I can include the code right here that I use to ingest for Dev:
Continue reading: https://towardsdatascience.com/migrating-from-aws-glue-to-bigquery-for-etl-ac12980f2036?source=rss—-7f60cf5620c9—4