AWS re:Invent 2020: Serverless data preparation with AWS Glue
The new AWS Glue engine which provides 10 times faster job start times and enhanced support for the data extraction process.
In this video, Mehul Shah, General Manager of AWS Glue talks about the new AWS Glue engine which provides 10 times faster job start times and enhanced support for the data extraction process.
At 0.40 Mehul begins the session by discussing the need for data preparation. Data preparation is the process of collecting all the required data, transforming it, cleaning it and normalizing it.
Through these processes, we can ensure quality in the data for running the analytics, to get insights about business and to build machine learning models. He adds that data preparation is a hard process as there is lots of data.
In addition, customizing data and infrastructure management makes the data preparation process hard.
In 2017 it was decided to build and launch AWS Glue. AWS Glue is a serverless extract, transform and load (ETL) service. The main purpose of AWS Glue is to make the data preparation process simpler, cheaper, and faster.
At 5.01 Mehul introduces a few modern use cases that drive the growth for Glue. The major central piece of Glue is it’s a serverless ETL engine based on Apache Spark.
Glue spins up the necessary servers to run your scripts, providing a number of visual tools to interactively develop your jobs. It also provides crawlers that automatically scan your S3 bucket, databases, schema, the table structures and automatically loads the data catalog for you.
At 6.42 he talks about the major use case which is to build data lakes on AWS. Customers break their data silos and operational databases, then with Glue, they can ingest the data from those data silos into S3.
The data can then be processed from stage to stage and refined. AWS Glue crawlers load and maintain the data catalog. At 8:09 Mehul mentions the other common use cases where Glue is used. One is while building data warehouses and the other is during the data preparation for AI/ML and data science.
Trend: More demanding workloads
At 10:02 Mehul discusses the recent AWS Glue innovations. Prominently, there are three major trends driving the new AWS Glue features. The first trend is ‘More demanding workloads’. In recent years, customers began to put a lot more real-time micro-batch workloads on Glue.
These workloads are latency-sensitive and require continuous operation. To support this use case, a brand new engine AWS Glue 2.0 was built for real-time workloads. This new engine enables micro-batching, latency-sensitive workloads, makes the jobs run 10 times faster and is cost-effective (1-minute minimum billing 45% cost-savings on average).
At 11:45 he talks about the AWS Glue execution model. AWS Spark runs data-parallel jobs. Here the jobs are divided into stages and data is divided into shards which are processed concurrently.
Trend: More personas want to prepare data
At 17:37 Mehul explains the next trend which is ‘More personas want to prepare data’. Customers have many more types of personas. Previously Glue was primarily used by developers. In recent years, ETL developers, data engineers, data scientists and business analysts had a lot of requirements that could be fulfilled by Glue.
To support this use case, AWS introduces the AWS Glue studio which is a new visual ETL interface. This new ETL interface makes it easy to author, run and monitor AWS Glue ETL jobs. This feature supports advanced transforms through code snippets and helps in monitoring 1000’s of jobs through a single pane of glass. AWS Glue studio reduces the time spent learning Apache spark. This new studio makes it faster to write and deploy jobs for ETL.
Trend: Highly partitioned data
At 23:59 he talks about the next trend, which is ‘Highly partitioned data’. As the micro-batch workload grows and more data comes into the system, customers start storing millions of partitions with the data catalog with an additional level of granularity.
A new feature, ‘Partition indexes’ was added to support this use-case. Partition indexes help in improving the query performance. These indexes also support range-based predicates. EMR hive and EMR Spark use partition indexes today. With these indexes, overall query execution time is reduced by up to 75%. At 28:23 Mehul talks about the latest feature which is ‘Glue custom connectors’.
You can build your own connector and reuse it in Glue studio. You can also choose a connector from the AWS marketplace. With the custom connectors, you can easily create a job using Glue studio, which helps to filter data at source and helps in integration with AWS secret manager.
In this video, the speaker has clearly briefed on the concept of data preparation, the purpose of AWS Glue, the trends behind the evolution of new features, and their benefits.