Explore essential AWS data and analytics services with this quiz on Glue, Athena, and Redshift. Challenge your understanding of cloud-based data integration, querying, and warehousing concepts, ideal for foundational learning and exam preparation.
Which capability best describes the primary use of AWS Glue in a data workflow?
Explanation: The main functionality of AWS Glue is to catalog data and perform ETL (extract, transform, load) tasks. This helps in organizing, cleaning, and preparing data for analytics. Running SQL queries on static files is a core feature of Athena, not Glue. Data warehousing for structured data is better associated with Redshift. Managing NoSQL databases is unrelated to Glue's primary purpose.
If you need to run SQL queries on data stored in S3 buckets without loading it into a database, which tool should you use?
Explanation: Athena enables users to directly query data in S3 using SQL syntax, making it ideal for analyzing raw or semi-structured files quickly. Glue is designed for ETL and cataloging. Redshift is a data warehouse solution requiring data to be loaded first. ElasticSearch, while used for searching log data, is not used for SQL queries on S3 files.
Redshift is primarily suited for which of the following tasks in a cloud environment?
Explanation: Redshift is designed for massively parallel processing, making it excellent for large-scale data warehousing and analytics. It is not intended for indexing JSON documents, which is handled by different tools. Scheduling ETL jobs is primarily a function of Glue. Real-time messaging is outside the core scope of Redshift.
What is the role of a crawler in Glue when connected to a new data source?
Explanation: Crawlers in Glue scan data sources to automatically discover schemas and create metadata tables in the data catalog. They do not delete duplicate files or optimize SQL query performance directly. Encrypting data in transit is usually managed by security configurations, not by crawlers themselves.
Which AWS analytics service allows you to apply schema-on-read when analyzing S3 data?
Explanation: Athena applies the schema to the data only when you read or query it, known as schema-on-read. Glue assists in cataloging schemas but does not perform the querying itself. Redshift typically uses schema-on-write by loading fully structured data. DynamoDB is a NoSQL database and not an analytics tool.
Why does Redshift use columnar storage for its data tables?
Explanation: Columnar storage allows Redshift to read only the relevant columns needed for a query, leading to faster processing of analytical queries on large datasets. This approach does not increase storage costs; it often reduces them. Redshift is designed for structured data, not unstructured data. Its compatibility is not limited to particular file types because of this storage method.
You need to automate a daily ETL process. Which Glue feature handles this scheduling?
Explanation: Glue triggers can be set up to initiate ETL jobs on a schedule or based on events, automating regular data workflows. Athena workgroups are used for managing query execution. Redshift clusters refer to compute resources for warehousing, not scheduling. Lambda policies are related to permissions for serverless compute functions.
If you have diverse data sources and want a unified metadata repository, which service's catalog should you use?
Explanation: The Glue Data Catalog provides a unified metadata repository, accessible by multiple analytics services for consistent data discovery and management. Athena uses the Glue Data Catalog but does not create one of its own. Redshift Engine relates to data warehousing, not cataloging. Workgroup Metadata is not a standard service feature in this context.
How is pricing typically calculated when using Athena to query data in S3?
Explanation: Athena charges users according to the volume of data scanned for each query, making query optimization important. Number of tables or schema complexity do not directly impact cost. There is no monthly subscription required for Athena usage; costs are pay-per-query.
Which Redshift feature enables querying data stored directly in S3, extending the warehouse's capacity?
Explanation: Redshift Spectrum allows users to run SQL queries on both data in Redshift and data stored directly in S3, increasing flexibility. Redshift Copy is used for loading data into Redshift, not querying S3. Glue Jobs are for ETL operations, not direct querying by Redshift. Athena Connector is not a feature of Redshift.