{"id":6540,"date":"2020-07-22T10:07:25","date_gmt":"2020-07-22T10:07:25","guid":{"rendered":"https:\/\/bluetab.net\/bluetab\/"},"modified":"2020-07-22T10:07:25","modified_gmt":"2020-07-22T10:07:25","slug":"bluetab-2","status":"publish","type":"post","link":"https:\/\/bluetab.org\/en\/2020\/07\/bluetab-2\/","title":{"rendered":"Basic AWS Glue concepts"},"content":{"rendered":"<h1>Basic AWS Glue concepts<\/h1>\n<figure><a href=\"https:\/\/www.linkedin.com\/in\/alvsanand\/\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-150x150.jpg\" alt=\"\" loading=\"lazy\" srcset=\"https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-150x150.jpg 150w, https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-300x300.jpg 300w, https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-75x75.jpg 75w, https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos.jpg 500w\" sizes=\"(max-width: 150px) 100vw, 150px\" \/><\/a><\/figure>\n<h4><a href=\"https:\/\/www.linkedin.com\/in\/alvsanand\/\" target=\"_blank\" rel=\"noopener\">\u00c1lvaro Santos<\/a><\/h4>\n<p>Senior Cloud Solution Architect\u200b<\/p>\n<p>Share on twitter<br \/>\nShare on linkedin<\/p>\n<p>At\u00a0<strong>Cloud Practice<\/strong>\u00a0we aim to encourage adoption of the cloud as a way of working in the IT world. To help with this task, we are going to publish numerous good practice articles and use cases and others will talk about those key services within the cloud.<\/p>\n<p>We present the basic concepts\u00a0<strong><a href=\"\/\/aws.amazon.com\/es\/glue\">AWS Glue<\/a><\/strong>\u00a0below.<\/p>\n<h2>What is AWS Glue?<\/h2>\n<p><strong>AWS Glue<\/strong>\u00a0is one of those\u00a0<em>AWS<\/em>\u00a0services that are relatively new but have enormous potential. In particular, this service could be very useful to all those companies that work with data and do not yet have powerful Big Data infrastructure.<\/p>\n<p>Basically, Glue is a fully\u00a0<em>AWS<\/em>-managed pay-as-you-go ETL service without the need for provisioning instances. To achieve this, it combines the speed and power of\u00a0<em>Apache Spark<\/em>\u00a0with the data organisation offered by\u00a0<em>Hive Metastore<\/em>.<\/p>\n<p><img decoding=\"async\" width=\"800\" height=\"454\" src=\"https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/harmonize_glue_1.gif\" alt=\"\" loading=\"lazy\" \/><\/p>\n<h2>AWS Glue Data Catalogue<\/h2>\n<p>The Glue Data Catalogue is where all the data sources and destinations for Glue jobs are stored.<\/p>\n<ul>\n<li><strong>Table<\/strong>\u00a0is the definition of a metadata table on the data sources and not the data itself.\u00a0<em>AWS Glue<\/em>\u00a0tables can refer to data based on files stored in S3 (such as Parquet, CSV, etc.), RDBMS tables\u2026<\/li>\n<li><strong>Database<\/strong>\u00a0refers to a grouping of data sources to which the tables belong.<\/li>\n<li><strong>Connection<\/strong>\u00a0is a link configured between\u00a0<em>AWS Glue<\/em>\u00a0and\u00a0<em>an RDS<\/em>,\u00a0<em>Redshift<\/em>\u00a0or other\u00a0<em>JDBC<\/em>-compliant database cluster. These allow Glue to access their data.<\/li>\n<li><strong>Crawler<\/strong>\u00a0is the service that connects to a data store, it progresses through a prioritised list of classifiers to determine the schema for the data and to generate the metadata tables. They support determining the schema of complex unstructured or semi-structured data. This is especially important when working with\u00a0<em>Parquet<\/em>,\u00a0<em>AVRO<\/em>, etc. data sources.<\/li>\n<\/ul>\n<h2>ETL<\/h2>\n<p>An\u00a0<em>ETL<\/em>\u00a0in\u00a0<em>AWS Glue<\/em>\u00a0consists primarily of scripts and other tools that use the data configured in the\u00a0<em>Data Catalogue<\/em>\u00a0to extract, transform and load the data into a defined site.<\/p>\n<ul>\n<li><strong>Job<\/strong>\u00a0is the main ETL engine. A job consists of a script that loads data from the sources defined in the catalogue and performs transformations on them. Glue can generate a script automatically or you can create a customised one using the\u00a0<em>Apache Spark<\/em>\u00a0API in\u00a0<em>Python<\/em>\u00a0(<em>PySpark<\/em>) or\u00a0<em>Scala<\/em>. It also allows the use of external libraries which will be linked to the job by means of a zip file in S3.<\/li>\n<li><strong>Triggers<\/strong>\u00a0are responsible for running the\u00a0<em>Jobs<\/em>. They can be run according to a timetable, a CloudWatch event or even a cron command.<\/li>\n<li><strong>Workflows<\/strong>\u00a0is a set of\u00a0<em>triggers<\/em>,\u00a0<em>crawlers<\/em>\u00a0and\u00a0<em>jobs<\/em>\u00a0related to each other in\u00a0<em>AWS Glue<\/em>. You can use them to create a workflow to perform a complex multi-step ETL, but that\u00a0<em>AWS Glue<\/em>\u00a0can run as a single entity.<\/li>\n<li><strong>ML Transforms<\/strong>\u00a0are specific jobs that use\u00a0<em>Machine Learning<\/em>\u00a0models to create custom transforms for data cleaning, such as identifying duplicate data, for example.<\/li>\n<li>Finally, you can also use\u00a0<strong>Dev Endpoints<\/strong>\u00a0and\u00a0<strong>Notebooks<\/strong>, which make it faster and easier to develop and test scripts.<\/li>\n<\/ul>\n<h2>Examples<\/h2>\n<p>Sample ETL script in Python:<\/p>\n<pre><code class='language-python'>import sys\nfrom awsglue.transforms import *\nfrom awsglue.utils import getResolvedOptions\nfrom pyspark.context import SparkContext\nfrom awsglue.context import GlueContext\nfrom awsglue.dynamicframe import DynamicFrame\nfrom awsglue.job import Job\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\nsc = SparkContext()\nglueContext = GlueContext(sc)\nspark = glueContext.spark_session\njob = Job(glueContext)\n## Read Data from a RDS DB using JDBC driver\nconnection_option = {\n&quot;url&quot;: &quot;jdbc:mysql:\/\/mysql&ndash;instance1.123456789012.us-east-1.rds.amazonaws.com:3306\/database&quot;,\n&quot;user&quot;: &quot;test&quot;,\n&quot;password&quot;: &quot;password&quot;,\n&quot;dbtable&quot;: &quot;test_table&quot;,\n&quot;hashexpression&quot;: &quot;column_name&quot;,\n&quot;hashpartitions&quot;: &quot;10&quot;\n}\nsource_df = glueContext.create_dynamic_frame.from_options('mysql', connection_options = connection_option, transformation_ctx = &quot;source_df&quot;)\njob.init(args['JOB_NAME'], args)\n## Convert DataFrames to *AWS Glue* 's DynamicFrames Object\ndynamic_df = DynamicFrame.fromDF(source_df, glueContext, &quot;dynamic_df&quot;)\n## Write Dynamic Frame to S3 in CSV format\ndatasink = glueContext.write_dynamic_frame.from_options(frame = dynamic_df, connection_type = &quot;s3&quot;, connection_options = {\n&quot;path&quot;: &quot;s3:\/\/glueuserdata&quot;\n}, format = &quot;csv&quot;, transformation_ctx = &quot;datasink&quot;)\njob.commit() <\/code><\/pre>\n<p>Creating a Job using a command line:<\/p>\n<pre><code class='language-python'>aws glue create-job --name python-job-cli --role Glue_DefaultRole \n--command '{&quot;Name&quot; : &quot;my_python_etl&quot;, &quot;ScriptLocation&quot; : &quot;s3:\/\/SOME_BUCKET\/etl\/my_python_etl.py&quot;}' <\/code><\/pre>\n<p>Running a Job using a command line:<\/p>\n<pre><code class='language-python'>aws glue start-job-run --job-name my_python_etl <\/code><\/pre>\n<p><em>AWS<\/em>\u00a0has also published a\u00a0<a href=\"\/\/github.com\/aws-samples\/aws-glue-samples\">repository<\/a>\u00a0with numerous example ETLs for\u00a0<em>AWS Glue<\/em>.<\/p>\n<h2>Security<\/h2>\n<p>Like all\u00a0<em>AWS<\/em>\u00a0services, it is designed and implemented to provide the greatest possible security. Here are some of the security features that AWS\u00a0<em>GLUE<\/em>\u00a0offers:<\/p>\n<ul>\n<li><strong>Encryption at Rest<\/strong>: this service supports data encryption (<em>SSE-S3<\/em>\u00a0or\u00a0<em>SSE-KMS<\/em>) at rest for all objects it works with (metadata catalogue, connection password, writing or reading of ETL data, etc.).<\/li>\n<li><strong>Encryption in Transit<\/strong>:\u00a0<em>AWS<\/em>\u00a0offers Secure Sockets Layer (SSL) encryption for all data in motion,\u00a0<em>AWS Glue<\/em>\u00a0API calls and all\u00a0<em>AWS<\/em>\u00a0services, such as S3, RDS\u2026<\/li>\n<li><strong>Logging and monitoring<\/strong>: is tightly integrated with\u00a0<em>AWS CloudTrail<\/em>\u00a0and\u00a0<em>AWS CloudWatch<\/em>.<\/li>\n<li><strong>Network security<\/strong>: is capable of enabling connections within a private\u00a0<em>VPC<\/em>\u00a0and working with\u00a0<em>Security Groups<\/em>.<\/li>\n<\/ul>\n<h2>Price<\/h2>\n<p>AWS bills for the execution time of the ETL\u00a0<em>crawlers<\/em>\u00a0\/\u00a0<em>jobs<\/em>\u00a0and for the use of the\u00a0<em>Data Catalogue<\/em>.<\/p>\n<ul>\n<li><strong>Crawlers<\/strong>: only\u00a0<em>crawler<\/em>\u00a0run time is billed. The price is $0.44 (eu-west-1) per hour of DPU (4\u00a0vCPUs and 16\u00a0GB RAM), billed in hourly increments.<\/li>\n<li><strong>Data Catalogue<\/strong>: you can store up to one million objects at no cost and at $1.00 (eu-west-1) per 100,000 objects thereafter. In addition, $1 (eu-west-1) is billed for every 1,000,000 requests to the\u00a0<em>Data Catalogue<\/em>, of which first million is free.<\/li>\n<li><strong>ETL Jobs<\/strong>: billed only for the time the ETL job takes to run. The price is $0.44 (eu-west-1) per hour of DPU (4\u00a0vCPUs and 16\u00a0GB RAM), billed by the second.<\/li>\n<\/ul>\n<h2>Benefits<\/h2>\n<p>Although it is a young service, it is quite mature and is being used a lot by clients all over the\u00a0<em>AWS<\/em>\u00a0world. The most important features it offers us are:<\/p>\n<ul>\n<li>It automatically manages resource escalation, task retries and error handling.<\/li>\n<li>It is a Serverless service,\u00a0<em>AWS<\/em>\u00a0manages the provisioning and scaling of resources to execute the commands or queries in the Apache Spark environment.<\/li>\n<li>The crawlers are able to track your data, suggest schemas and store them in a centralised catalogue. They also detect changes in them.<\/li>\n<li>The Glue ETL engine automatically generates Python \/ Scala code and has a programmer including dependencies. This facilitates development of the ETLs.<\/li>\n<li>You can directly query the S3 data using Athena and Redshift Spectrum using the Glue catalogue.<\/li>\n<\/ul>\n<h2>Conclusions<\/h2>\n<p>Like any database, tool, or service offered,\u00a0<em>AWS Glue<\/em>\u00a0has certain limitations that would need to be considered to adopt it as an ETL service. You therefore need to bear in mind that:<\/p>\n<ul>\n<li>It is highly focused on working with data sources in\u00a0<em>S3<\/em>\u00a0(CSV, Parquet, etc.) and\u00a0<em>JDBC<\/em>\u00a0(MySQL, Oracle, etc.).<\/li>\n<li>The learning curve is steep. If your team comes from the traditional ETL world, you will need to wait for them to pick up understanding of\u00a0<em>Apache Spark<\/em>.<\/li>\n<li>Unlike other ETL tools, it lacks default compatibility with many third-party services.<\/li>\n<li>It is not a 100% ETL tool in use and, as it uses Spark, code optimisations need to be performed manually.<\/li>\n<li>Until recently (April 2020),\u00a0<em>AWS Glue<\/em>\u00a0did not support streaming data. It is too early to use\u00a0<em>AWS Glue<\/em>\u00a0as an ETL tool for real-time data.<\/li>\n<\/ul>\n<h5>Do you want to know more about what we offer and to see other success stories?<\/h5>\n<p><a href=\"\/\" role=\"button\"><br \/>\nDISCOVER BLUETAB<br \/>\n<\/a><br \/>\nShare on twitter<br \/>\nShare on linkedin<\/p>\n<figure><a href=\"https:\/\/www.linkedin.com\/in\/elipajares\/\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-150x150.jpg\" alt=\"\" loading=\"lazy\" srcset=\"https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-150x150.jpg 150w, https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-300x300.jpg 300w, https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos-75x75.jpg 75w, https:\/\/bluetab.net\/wp-content\/uploads\/2020\/07\/Alvaro-Santos.jpg 500w\" sizes=\"(max-width: 150px) 100vw, 150px\" \/><\/a><\/figure>\n<p>\u00c1lvaro Santos<br \/>\nSenior Cloud Solution Architect<\/p>\n<p>My name is \u00c1lvaro Santos and I have been working as Solution Architect for over 5 years. I am certified in\u00a0<em>AWS<\/em>,\u00a0<em>GCP<\/em>,\u00a0<em>Apache Spark<\/em>\u00a0and a few others. I joined Bluetab in October 2018, and since then I have been involved in cloud Banking and Energy projects and I am also involved as a Cloud Master Partitioner. I am passionate about new distributed patterns, Big Data, open-source software and anything else cool in the IT world.<\/p>\n<p><b>SOLUTIONS<\/b>, WE ARE EXPERTS<\/p>\n<p><a href=\"\/soluciones\/data-strategy\/\"><\/p>\n<h5>\n\t\t\t\t\t\tDATA STRATEGY<\/h5>\n<p><\/a><br \/>\n<a href=\"\/soluciones\/data-fabric\/\"><\/p>\n<h5>\n\t\t\t\t\t\tDATA FABRIC<\/h5>\n<p><\/a><br \/>\n<a href=\"\/soluciones\/augmented-analytics\/\"><\/p>\n<h5>\n\t\t\t\t\t\tAUGMENTED ANALYTICS<\/h5>\n<p><\/a><\/p>\n<p>You may be interested in<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Basic AWS Glue concepts \u00c1lvaro Santos Senior Cloud Solution Architect\u200b Share on twitter Share on linkedin At\u00a0Cloud Practice\u00a0we aim to encourage adoption of the cloud as a way of working in the IT world. To help with this task, we are going to publish numerous good practice articles and use cases and others will talk [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":17832,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"elementor_header_footer","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[7,29,30],"tags":[],"class_list":["post-6540","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog-es","category-practices-en","category-tech-en"],"acf":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/posts\/6540","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/comments?post=6540"}],"version-history":[{"count":0,"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/posts\/6540\/revisions"}],"wp:attachment":[{"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/media?parent=6540"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/categories?post=6540"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bluetab.org\/en\/wp-json\/wp\/v2\/tags?post=6540"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}