What is Amazon Athena?
AWS Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (AWS S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Athena simplifies the process of querying large-scale datasets and integrates with various data formats, enabling quick analysis without the need to set up complex ETL processes. This service is ideal for quick, ad-hoc querying and also for complex analysis, such as large joins and window functions.
Benefits of Amazon Athena
Serverless Architecture:
Ease of Use & Quick Setup:
Built on Amazon S3 for an underlying data store:
Integrates with AWS QuickSight and Glue:
Pay Per Query:
Open Architecture:
Encryption Support:
Considerations & Limitations of Athena
You might avoid using AWS Athena when your workload requires complex data manipulation, indexing, or consistent high performance, as these are better handled by solutions like Amazon Redshift, which offers optimized query performance and advanced database features.
Here are some limitations of AWS Athena to keep in mind:
Limited Data Optimization: Athena’s optimization is restricted to the query level. Data already stored in S3 cannot be optimized, limiting potential performance improvements.
Lack of Indexing: Unlike traditional databases, Athena does not support indexing. This absence can increase the operational load, especially when dealing with large datasets, leading to potential performance issues during query execution.
Partitioning Requirements: Efficient queries in Athena often require data partitioning. However, managing these partitions to ensure optimal performance can be complex and time-consuming. For example, scanning every 500 partitions can add a second to your query time.
No Data Manipulation Language (DML): Athena lacks built-in support for Data Manipulation Language operations such as INSERT, UPDATE, and DELETE. This limitation means Athena is purely a query service, and any data manipulation needs to be handled externally.
Resource Sharing: Athena operates on a multi-tenant model, meaning resources are shared among all users. This can lead to fluctuating performance, particularly during peak usage times.
Time-Outs on Large Tables: Queries on tables with thousands of partitions can time out, especially if partitions are not of the string type. This can require additional management to prevent such issues.
Unsupported SQL Statements: Several standard SQL features and statements, such as CREATE TABLE LIKE, UPDATE, MERGE (for non-transactional table formats), and stored procedures, are not supported in Athena. This limits its functionality compared to more traditional databases.
Hidden Files: Files in S3 that begin with a dot (.) or an underscore (_) are treated as hidden by Athena, potentially causing issues if these files contain relevant data.
Row and Column Size Limitations: The maximum size for a row or column in Athena is 32 megabytes. Exceeding this limit can result in errors, particularly when working with large datasets in formats like CSV or JSON.
When do I need Athena vs Redshift vs EMR?
When to use Amazon Athena: interactive query service
When to use Amazon EMR to analyze data
When to use Amazon Redshift to analyze data
Use Amazon Redshift when you need to aggregate data from multiple sources, like inventory, financial, and sales systems, into a unified format for long-term storage and detailed analysis. Redshift is optimized for running complex queries on highly structured data, particularly those involving multiple joins across large tables. It’s the best choice when you need to build sophisticated business reports from historical data, leveraging its powerful query engine to handle large-scale, multi-table queries efficiently. For in-depth analytics on structured data over long periods, Redshift is your go-to service.
Athena vs Microsoft SQL server
How to get started with Amazon Athena
To get started with Amazon Athena, create a bucket in Amazon S3 using the same AWS Region and account as Athena (e.g., US East (N. Virginia)) to hold your query results and configure it as your query output location. For the specific steps, you can consult the AWS documentation linked below.
Amazon Athena pricing, simplified
Let’s break down exactly how much Amazon Athena will cost you, depending on your use case.
SQL Queries:
- What it is: Amazon Athena allows you to run SQL queries directly on data stored in Amazon S3 without needing to set up servers.
- Cost Basis: You are charged based on the amount of data scanned by each query.
- Price: $5.00 per terabyte (TB) of data scanned (note: depends on AWS region)
- Cost-Saving Tips: Compressing, partitioning, and converting your data to columnar formats (like Parquet) can reduce the amount of data scanned, potentially saving up to 90% on costs.
- Minimum Charge: Each query has a minimum charge of 10 MB.
Provisioned Capacity:
- What it is: Provisioned Capacity in Athena provides dedicated compute resources (DPUs) for consistent performance in running SQL queries.
- Cost Basis: You pay for dedicated compute resources (Data Processing Units or DPUs) rather than per query.
- Price: $0.30 per DPU hour, billed per minute.
- Use Case: Ideal for predictable workloads where consistent performance is needed.
Apache Spark:
- What it is: Athena for Apache Spark enables you to run distributed data processing tasks using Apache Spark on data stored in Amazon S3.
- Cost Basis: You pay for the time your Spark application runs, based on the DPUs used.
- Price: $0.35 per DPU hour, billed per minute.
- Minimum Resources: Spark sessions start with at least two nodes (notebook and Spark driver).
Additional Costs
- What it is: Additional costs related to using Athena includes charges for S3 storage and AWS Glue Data Catalog.
- S3 Charges: Athena queries data stored in S3, so you’ll incur standard S3 storage, request, and data transfer fees.
- Glue Data Catalog: If using AWS Glue for metadata, standard Glue pricing applies.
- Federated Queries: When querying data sources not stored in S3 using an Amazon Athena federated query, additional Lambda function costs may apply.
Understand and optimize Amazon Athena with nOps
If you’re looking to save on Amazon RDS, nOps Business Contexts makes it easy and painless to get the info you need to understand your spend, make informed decision on purchase commitments, and reduce Athena costs.
Business Contexts transforms millions of rows of contextless data into the who, what, when and why of cloud spend — making it easy to get 100% visibility of your cloud costs and usage so your bills are never a surprise or mystery.
- Allocate 100% of your AWS costs, including EKS. Kubernetes costs are often a black box — no longer with nOps. Understand and allocate your unified AWS spend in one platform.
- Automated resource tagging. You don’t need to have all your resources tagged to allocate costs. Create dynamic rules by region, tags, operation, accounts, and usage types to allocate costs back to custom cost centers.
- 40+ views & filters. Map hourly costs by any relevant engineering concept (deployment, service, namespace, label, pod, container…) or finance concept (cost unit, purchase type, line item, cost allocation tag…).
- Custom reports & dashboards for the whole team. Monthly reporting and reconciliation can take hours; with nOps only minutes. Tailor dashboards and Slack/email reports to your needs, whether you’re a CFO or VP of Engineering.
The best part? nOps is an all-in-one solution for all of your cloud optimization needs: automated commitment management, rightsizing, resource scheduling, workload management, Spot usage, storage optimization, and more.
Join our customers using nOps to understand your cloud costs and leverage automation with complete confidence by booking a demo today!