Data Ingestion

AWS Certified Data Engineer Associate Exam Notes and Practice Tests Data Ingestion

This section focuses on the core exam objectives related to ingesting data into AWS environments for processing, storage, and analytics.

Domain 1: Data Ingestion and Transformation
Task Statement 1.1: Perform Data Ingestion

Understanding Data Ingestion in AWS

Data ingestion is the process of collecting, transferring, and loading data from diverse sources into AWS services where it can be stored, transformed, and analyzed. These sources may include applications, databases, IoT devices, SaaS platforms, or on-premises systems. AWS supports multiple ingestion patterns to accommodate different latency, volume, and processing requirements.

Types of Data Ingestion

Data ingestion approaches vary based on how frequently data arrives and how quickly it must be processed.

Streaming ingestion handles continuous, high-velocity data flows with minimal latency. This model is ideal for real-time analytics, monitoring, and event processing. Common services include Amazon Kinesis, Amazon MSK, and DynamoDB Streams.
Batch ingestion processes data at scheduled intervals, making it suitable for reporting, analytics, and data warehousing workloads. AWS Glue, Amazon EMR, AWS DMS, Amazon S3, and Amazon Redshift COPY are commonly used for batch ingestion.
Event-driven ingestion reacts to specific system events or data changes, enabling near-real-time processing without continuous polling. Typical services include Amazon S3 event notifications, Amazon EventBridge, and AWS Lambda.

Streaming Data Ingestion

AWS Streaming Services

AWS provides several managed services for ingesting and processing streaming data:

Amazon Kinesis Data Streams supports real-time data capture with low latency and horizontal scalability.
Amazon Kinesis Data Firehose enables continuous delivery of streaming data to destinations such as Amazon S3, Amazon Redshift, and Amazon OpenSearch Service with minimal operational overhead.
Amazon Managed Streaming for Apache Kafka (MSK) provides a fully managed Kafka environment for pub-sub messaging and distributed stream processing.
AWS Database Migration Service (DMS) supports ongoing database replication using change data capture (CDC).

Configuring Kinesis for Data Ingestion

The following example demonstrates how to publish a record to a Kinesis data stream using the AWS SDK for Python (Boto3):

import boto3

kinesis = boto3.client("kinesis")
kinesis.put_record(
    StreamName="my-data-stream",
    Data="sample data",
    PartitionKey="key1"
)

Fan-In and Fan-Out Streaming Patterns

Streaming architectures commonly use fan-in and fan-out patterns:

Fan-in allows multiple producers to send data into a single stream, which is typical for centralized logging or telemetry pipelines.
Fan-out enables a single data producer to feed multiple downstream consumers, often implemented using Kinesis Data Streams with AWS Lambda.

Batch Data Ingestion

Batch Ingestion Services

Batch ingestion workloads are commonly implemented using the following AWS services:

AWS Glue for serverless ETL pipelines with schema discovery and transformation capabilities.
Amazon EMR for large-scale batch processing using frameworks such as Apache Spark and Hadoop.
AWS Lambda for lightweight, event-driven batch tasks.
AWS Database Migration Service (DMS) for bulk data migrations and CDC-based replication.
Amazon Redshift COPY for high-performance bulk data loading into data warehouses.

Configuring Batch Ingestion with AWS Glue

The following example illustrates how to create an AWS Glue job programmatically:

import boto3

glue = boto3.client("glue")

glue.create_job(
    Name="batch-ingestion-job",
    Role="GlueServiceRole",
    Command={
        "Name": "glueetl",
        "ScriptLocation": "s3://scripts/batch-job.py"
    }
)

Scheduling and Automating Data Ingestion

Automation plays a key role in reliable data ingestion pipelines. AWS offers several scheduling and orchestration options:

Amazon EventBridge enables time-based scheduling for services such as AWS Glue, Lambda, and Step Functions.
Amazon Managed Workflows for Apache Airflow (MWAA) orchestrates complex ETL and data pipelines.
AWS Glue Workflows automate multi-step ETL processes within Glue.

Example: Creating an EventBridge rule to trigger an AWS Glue job

aws events put-rule \
  --name "daily-glue-job" \
  --schedule-expression "rate(24 hours)"

Event-Driven Data Ingestion

Event-driven ingestion enables systems to react instantly to data changes or system events.

Amazon S3 event notifications can trigger processing when objects are created or modified.
Amazon EventBridge routes custom events to AWS services based on defined rules.
AWS Lambda processes events from services such as S3, Kinesis, and DynamoDB without managing infrastructure.

Example: Triggering a Lambda function from an S3 object creation event

{
  "LambdaFunctionConfigurations": [
    {
      "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456789012:function:ProcessS3Event",
      "Events": ["s3:ObjectCreated:*"]
    }
  ]
}

Managing Data API Consumption

AWS also supports data ingestion through managed APIs:

AWS AppFlow securely transfers data between SaaS applications and AWS services.
Amazon Redshift Data API allows querying Redshift clusters without managing JDBC or ODBC connections.

Example: Querying Amazon Redshift using the Data API

import boto3

redshift = boto3.client("redshift-data")
response = redshift.execute_statement(
    ClusterIdentifier="my-cluster",
    Database="dev",
    Sql="SELECT * FROM my_table"
)

Security Best Practices for Data Ingestion

IAM and Access Control

Strong security controls are essential when ingesting data:

Apply least-privilege IAM roles and policies to restrict unauthorized access.
Protect S3 data using bucket policies, encryption, and access logging.
Implement throttling and retries to handle DynamoDB API rate limits.

Example: IAM policy allowing AWS Glue access

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["glue:*"],
      "Resource": "*"
    }
  ]
}

Throttling and Rate Limits

AWS services enforce rate limits that must be managed to ensure reliable ingestion:

DynamoDB capacity limits can be mitigated using auto scaling.
Kinesis throughput limits can be addressed by increasing shard count.
Amazon RDS connection limits can be optimized using RDS Proxy.

Example: Implementing retry logic for Kinesis API throttling

import time
import boto3

kinesis = boto3.client("kinesis")

for _ in range(10):
    try:
        kinesis.put_record(
            StreamName="my-stream",
            Data="data",
            PartitionKey="key"
        )
        break
    except Exception:
        print("Rate limit exceeded, retrying...")
        time.sleep(2)

Key Exam Tips

Understand ingestion patterns and service selection

Real-time ingestion: Kinesis, MSK
Batch ingestion: AWS Glue, Amazon Redshift COPY
Event-driven ingestion: S3 event notifications, AWS Lambda

Know scheduling and orchestration tools

Use MWAA for complex workflows
Use EventBridge for time-based scheduling

Expect scenario-based questions

High-velocity streaming → Kinesis or MSK
Database migration → AWS DMS
Processing CSV or JSON files in S3 → AWS Glue or Lambda

Focus on security and reliability

Enforce IAM-based access control and encryption
Use retries and exponential backoff for API limits

Optimize cost and performance

Use Kinesis Data Firehose for cost-efficient streaming to S3
Leverage Spot Instances with EMR for batch processing

Final Thoughts

To perform well on the exam, develop a strong understanding of AWS data ingestion services, ingestion patterns, and architectural trade-offs. Be comfortable choosing between batch, streaming, and event-driven approaches, and expect real-world scenarios that test both technical knowledge and service selection strategy.

Back to Course

Next Lesson