This section maps directly to the following AWS Certified Machine Learning Engineer – Associate exam objectives:
Domain 1: Data Engineering
Task Statement 1.2: Identify and Implement a Data Ingestion Solution
◆◆◆◆◆◆
Data ingestion refers to the process of collecting, transforming, and loading data into storage systems for downstream machine learning workflows. AWS offers multiple ingestion approaches depending on whether data arrives periodically in large volumes or continuously in real time.
Batch ingestion is best suited for workloads that process data at scheduled intervals or in large file-based transfers. Streaming ingestion is designed for real-time data pipelines that support continuous processing and low-latency ML inference.
Batch ingestion is commonly implemented using services such as AWS Glue, Amazon EMR, AWS Lambda for event-triggered batch jobs, and the Amazon Redshift COPY command. Streaming ingestion typically relies on Amazon Kinesis, Amazon Kinesis Data Firehose, and Amazon Managed Service for Apache Flink.
Batch ingestion is appropriate when machine learning models depend on periodically refreshed datasets rather than real-time inputs.
AWS Glue is a fully managed ETL service designed to process both structured and unstructured data at scale.
Common use cases include cleansing and normalizing ML datasets, extracting data from Amazon S3, DynamoDB, or Amazon RDS, and loading transformed data into Amazon Redshift. Glue also registers metadata in the AWS Glue Data Catalog, enabling query access through Amazon Athena.
Key capabilities include schema discovery using crawlers, job orchestration through Glue Workflows, and no-code data preparation using AWS Glue DataBrew.
Amazon EMR provides a managed environment for big data frameworks such as Apache Spark, Hadoop, and Hive.
It is frequently used for training ML models on very large datasets, performing feature engineering at scale, and processing terabytes or petabytes of training data before storing results in Amazon S3.
EMR supports decoupled storage using S3 or HDFS, automatic cluster scaling for cost efficiency, and native integration with Amazon SageMaker for Spark-based ML workflows.
AWS Lambda is ideal for lightweight, event-driven batch ingestion scenarios triggered by S3 uploads or database updates.
Typical use cases include transforming CSV files before storage, validating incoming data, or performing lightweight feature extraction before persisting data to S3.
The Amazon Redshift COPY command is the most efficient way to load large volumes of data into Redshift from sources such as Amazon S3, DynamoDB, or Kinesis.
It is commonly used for ingesting historical datasets for ML training and for analyzing large feature sets prior to model inference.
Streaming ingestion is essential for real-time ML use cases such as fraud detection, anomaly detection, and continuous model updates.
Amazon Kinesis provides scalable and durable streaming ingestion services optimized for low-latency ML pipelines.
Amazon Kinesis Data Streams supports high-throughput, real-time ingestion with fine-grained control over scaling and parallel processing. Amazon Kinesis Data Firehose enables near real-time ETL and automatic delivery to downstream AWS services. Amazon Managed Service for Apache Flink supports stateful stream processing using Apache Flink for advanced ML use cases.
Kinesis Data Streams is commonly used for ingesting sensor data, application logs, IoT telemetry, and social media feeds.
It supports data retention for up to seven days, allows multiple consumers using the Kinesis Client Library, and integrates directly with AWS Lambda for real-time processing and feature extraction.
Kinesis Data Firehose simplifies streaming ingestion by automatically scaling and delivering data to destinations such as Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk.
It supports Lambda-based transformations for enrichment, built-in compression, and near real-time delivery with minimal operational overhead.
Amazon Managed Service for Apache Flink is a fully managed platform for advanced stream processing.
It is widely used for real-time fraud detection, anomaly detection in IoT data, and stream-based log aggregation before long-term storage in Amazon S3.
Key features include stateful event-time processing, exactly-once delivery guarantees, and seamless integration with Kinesis Data Streams, Lambda, and Amazon S3.
Reliable ML ingestion pipelines require orchestration to manage dependencies, retries, and conditional logic.
AWS Step Functions is well suited for coordinating complex ML workflows that involve multiple services.
Common use cases include chaining ingestion, transformation, and training steps, as well as automating ETL pipelines that combine AWS Glue and Kinesis.
Amazon Managed Workflows for Apache Airflow (MWAA) is ideal for orchestrating complex batch pipelines with interdependent tasks.
It is commonly used to coordinate Glue jobs, Redshift queries, and S3-based feature engineering workflows using DAG-based scheduling.
Amazon EventBridge enables event-driven ingestion architectures.
It is often used to trigger Glue jobs when new data arrives in S3 or to automate serverless ingestion pipelines for ML workloads.
AWS provides multiple scheduling options depending on pipeline complexity and execution model.
AWS Glue Workflows support time-based and event-driven scheduling for batch ETL jobs. Amazon MWAA enables DAG-based scheduling for complex ML pipelines. Amazon EventBridge supports rules-based execution for event-driven workflows, while AWS Lambda enables lightweight scheduling through S3 events and DynamoDB Streams.
Access to ingestion services should be restricted using IAM policies and roles.
Sensitive data stored in Amazon S3 should be monitored using Amazon Macie.
All data should be encrypted in transit and at rest using AWS KMS and SSL/TLS.
Kinesis Data Streams shards should be sized appropriately based on throughput requirements.
The AWS Glue Data Catalog should be used for centralized metadata management.
Compression should be enabled in Kinesis Data Firehose to improve throughput and reduce storage costs.
S3 lifecycle policies should transition raw ingestion data to lower-cost tiers such as Glacier.
Kinesis capacity modes should be selected based on traffic predictability.
AWS Spot Instances should be used for large batch processing jobs whenever possible.
Be clear on when to use Kinesis Data Streams, Kinesis Data Firehose, and Apache Flink.
Understand how AWS Glue supports batch ETL pipelines for ML workloads.
Know when Amazon EMR is the right choice for large-scale batch data processing.
Be familiar with orchestration and scheduling using Glue Workflows, Step Functions, and EventBridge.
Understand encryption, access control, and monitoring best practices for secure ingestion pipelines.