This section focuses on a core set of exam objectives related to preparing data for machine learning workloads.
Domain 1: Data Preparation for Machine Learning
Task Statement 1.1: Ingest and store data
Selecting the appropriate data format is a foundational decision that directly affects storage efficiency, query performance, and downstream machine learning workflows. Each format offers trade-offs in terms of structure, performance, and compatibility.
Flat and semi-structured formats such as CSV and JSON are commonly used for initial data ingestion, legacy system integration, and API-based data exchange. These formats are human-readable and easy to generate but become inefficient at scale due to larger file sizes and slower query performance.
Columnar storage formats such as Parquet and ORC are optimized for analytical workloads. They support advanced compression and predicate pushdown, making them ideal for large-scale data lakes and ML training datasets. However, they are less intuitive to inspect manually and require additional processing tools.
Binary row-based formats such as Avro and RecordIO are frequently used in streaming and ML-specific workflows. Avro supports schema evolution, making it well suited for event-driven pipelines, while RecordIO is optimized for high-throughput ML training workloads in Amazon SageMaker but has limited interoperability outside the AWS ecosystem.
Best Practices for Data Format Selection
CSV and JSON are appropriate for lightweight ingestion and interoperability but should be avoided for large analytical queries. Parquet and ORC are preferred for feature engineering and training due to efficient compression and query performance. Avro is ideal for streaming pipelines where schema evolution is required, while RecordIO is best used when training large models directly in SageMaker.
AWS offers multiple storage services, each optimized for specific performance, scalability, and cost requirements.
Object Storage (Amazon S3)
Amazon S3 is the primary foundation for data lakes, ML datasets, and unstructured data storage. It supports multiple storage classes ranging from low-latency access for frequently used data to archival tiers for long-term retention. Features such as S3 Select and Amazon Athena enable querying data directly in S3 without moving it, while S3 Transfer Acceleration improves global upload performance.
Block Storage (Amazon EBS)
Amazon EBS is designed for low-latency, high-throughput workloads. General-purpose SSD volumes provide balanced performance, while Provisioned IOPS volumes are optimized for performance-sensitive databases. Throughput-optimized and cold HDD volumes offer cost-effective options for large sequential workloads and backups.
File Storage (Amazon EFS and Amazon FSx)
Amazon EFS provides a fully managed, elastic file system that automatically scales and supports shared access across compute instances. Amazon FSx offers specialized file systems, including NetApp ONTAP for enterprise NAS workloads and Lustre for high-performance ML training and HPC use cases with native S3 integration.
Streaming ingestion is critical for real-time analytics and event-driven machine learning pipelines.
Amazon Kinesis enables real-time data ingestion through Kinesis Data Streams for custom streaming applications, Kinesis Data Firehose for managed delivery to storage and analytics services, and Kinesis Video Streams for video-based ML workloads.
Apache Flink is commonly used with Amazon Kinesis Data Analytics to process streaming data with stateful, event-driven logic. For Kafka-based architectures, Amazon MSK provides a fully managed Apache Kafka service that scales streaming ML pipelines with minimal operational overhead.
AWS provides multiple mechanisms to extract and process data from storage services for ML workflows.
Data stored in Amazon S3 can be queried directly using Amazon Athena or transformed using AWS Glue. Amazon RDS supports read replicas to offload ML training queries, while AWS Database Migration Service can be used to continuously replicate relational data into S3.
Amazon DynamoDB supports data extraction through DynamoDB Streams and can be queried using PartiQL. For shared file systems, Amazon EFS provides high availability through mount targets, while FSx for Lustre delivers high-throughput access for ML training jobs.
The optimal data format depends on the workload characteristics. Real-time streaming pipelines typically use JSON or Avro, while analytical queries and feature engineering workloads benefit from Parquet or ORC. API-based data exchange commonly relies on JSON, legacy systems favor CSV, and pipelines requiring schema evolution should use Avro.
Amazon SageMaker provides native tools to streamline ML data ingestion and preparation.
SageMaker Data Wrangler simplifies data exploration, transformation, and visualization while integrating directly with Amazon S3, Amazon Redshift, and third-party data sources such as Snowflake. SageMaker Feature Store enables centralized feature management with an online store for low-latency inference and an offline store for batch training, fully integrated with SageMaker Pipelines and AWS Glue.
Data from multiple sources is commonly combined during feature engineering and model preparation.
AWS Glue supports large-scale ETL using PySpark, Amazon Athena enables SQL-based transformations directly on S3 data, and Apache Spark on Amazon EMR provides flexible, distributed processing.
A typical workflow involves extracting CSV logs from S3, converting them to Parquet using AWS Glue, storing the transformed data back in S3, and querying it with Athena for ML training.
Performance and reliability issues can occur at various stages of ingestion and storage.
Slow data retrieval can be mitigated by converting datasets to Parquet and enabling S3 Transfer Acceleration. Capacity limitations in Amazon RDS can be addressed using read replicas or migrating analytical workloads to Amazon Redshift. Kinesis ingestion lag is often resolved by increasing shard count or optimizing consumer applications. Data corruption risks can be reduced by enforcing schemas and validation rules using AWS Glue Schema Registry.
Different workloads require different balances between cost and performance. High-speed, low-latency workloads benefit from Amazon FSx for Lustre or Provisioned IOPS EBS volumes. Long-term archival storage is best served by S3 Glacier or S3 Intelligent-Tiering. Scalable data lake architectures typically rely on S3 with Parquet, while real-time analytics pipelines often combine Amazon Kinesis with DynamoDB.
Understand storage service trade-offs: Amazon S3 emphasizes scalability and cost efficiency, Amazon EBS provides low-latency block storage, Amazon EFS supports scalable shared file systems, and Amazon FSx targets high-performance workloads.
Be fluent in streaming ingestion concepts: Kinesis Data Streams power real-time pipelines, Kinesis Firehose enables managed batch delivery to S3, and Amazon MSK supports Kafka-based architectures.
Optimize data formats for analytics: Parquet and ORC accelerate analytical queries, Avro supports schema evolution, and CSV and JSON are simple but inefficient at scale.
Leverage SageMaker tools effectively: Data Wrangler simplifies preprocessing, while Feature Store supports both training and inference workflows.
Know how to resolve ingestion bottlenecks: scaling Kinesis shards improves throughput, and combining Athena with Parquet significantly reduces query latency.