Creating Data Repositories for ML Workloads

AWS Certified Machine Learning Specialty Exam Notes and Practice Tests Creating Data Repositories for ML Workloads

This section aligns with the following AWS Certified Machine Learning Engineer – Associate exam objectives:

Domain 1: Data Engineering
Task Statement 1.1: Create Data Repositories for ML

◆◆◆◆◆◆

1. Understanding Data Sources

Designing effective machine learning solutions begins with identifying the right data sources. Data selection should consider the nature of the content, where the data originates, and how it will be accessed and processed throughout the ML lifecycle.

1.1 Types of Data Sources

Machine learning workloads commonly rely on several categories of data:

Structured data is typically stored in relational systems and follows a predefined schema. Examples include transactional records stored in Amazon RDS or analytical datasets stored in Amazon Redshift.

Semi-structured data includes formats such as JSON, XML, and CSV. These datasets are often stored in Amazon S3, DynamoDB, or other NoSQL data stores that support flexible schemas.

Unstructured data consists of text documents, images, audio, and video files. These datasets are commonly stored in scalable object or file systems such as Amazon S3 or Amazon EFS.

Streaming data represents real-time data flows generated continuously, such as logs or clickstreams. Services like Amazon Kinesis Data Streams and Amazon MSK are commonly used to ingest and process this data.

IoT data originates from sensors and connected devices and is typically processed using AWS IoT Core and AWS IoT Analytics before being stored for ML analysis.

1.2 Location- and Content-Based Data Sources

Different workloads rely on data generated by various systems and platforms:

Application logs are commonly collected from Amazon CloudWatch, AWS Lambda, or Amazon OpenSearch Service.
Web analytics and event data are often ingested using Amazon Kinesis Data Firehose and transformed using AWS Glue.
Transactional data is typically sourced from Amazon RDS or Amazon DynamoDB.
Clickstream data is captured using Amazon Kinesis Data Streams and processed through AWS Glue.
Social media data is commonly ingested using AWS Lambda and API Gateway to integrate with external platforms.
IoT device data is processed and stored using AWS IoT Core and AWS IoT Analytics.

2. Choosing the Right Storage Medium for ML

Selecting an appropriate storage solution is critical for achieving the right balance between performance, scalability, durability, and cost in machine learning workloads.

2.1 Storage Options for ML

Object storage is ideal for large-scale datasets, including images, videos, logs, and backups. Amazon S3 is the most commonly used service in this category.

File storage supports shared access and concurrent reads and writes, making it suitable for collaborative ML training environments. Amazon EFS is a common choice for these use cases.

Block storage provides low-latency, high-performance storage for compute-intensive ML workloads. Amazon EBS is typically used with EC2 instances for this purpose.

Relational databases are best suited for structured data requiring transactions and joins. Amazon RDS and Amazon Aurora support SQL-based access patterns.

NoSQL databases are optimized for low-latency key-value access and flexible schemas. Amazon DynamoDB is frequently used for real-time ML applications.

Data warehouses are designed for large-scale analytical workloads and batch processing. Amazon Redshift is widely used for analytics-driven ML workflows.

2.2 Object Storage with Amazon S3

Amazon S3 is the most widely used storage service for ML workloads due to its scalability, durability, and cost efficiency.

S3 Standard supports frequently accessed datasets.
S3 Intelligent-Tiering automatically moves data between access tiers based on usage patterns.
S3 Glacier and Glacier Deep Archive are used for long-term archival of ML datasets.
S3 Select and Amazon Athena enable SQL-based querying of structured data directly in S3 without full data extraction.
S3 supports multiple encryption options, including SSE-S3, SSE-KMS, and SSE-C, to protect sensitive ML datasets.

2.3 File Storage with Amazon EFS

Amazon EFS is well suited for ML workloads that require shared file access across multiple training instances.

It provides elastic scaling, high availability, and concurrent access.
General Purpose mode is optimized for low-latency workloads, while Max I/O mode supports higher throughput for large-scale training jobs.

2.4 Block Storage with Amazon EBS

Amazon EBS is commonly used for high-performance ML training jobs that require persistent storage.

Provisioned IOPS SSD volumes are ideal for performance-critical workloads.
General Purpose SSD volumes provide a balance between cost and performance.
Throughput HDD and Cold HDD volumes are typically used for large-scale batch processing and log archival.

2.5 Databases for Machine Learning

Relational databases such as Amazon RDS and Amazon Aurora are best suited for transactional and structured datasets. Aurora integrates with Amazon SageMaker, enabling ML inference directly from SQL queries.

Amazon DynamoDB is optimized for real-time ML applications that require fast lookups and horizontal scalability. Features such as on-demand capacity and DynamoDB Accelerator (DAX) improve performance for latency-sensitive workloads.

Amazon Redshift is designed for large-scale analytical processing. It uses columnar storage, supports querying data in Amazon S3 through Redshift Spectrum, and enables model training and inference using Redshift ML.

3. Best Practices for Managing ML Data Storage

3.1 Data Governance and Security

Access to ML data repositories should be tightly controlled using IAM roles and S3 bucket policies.
Data should be encrypted both at rest and in transit using AWS KMS.
Amazon Macie can be used to detect and protect sensitive data such as PII.
AWS Config helps track changes to storage configurations and access policies.

3.2 Performance Optimization

Large dataset uploads can be accelerated using S3 Transfer Acceleration.
Partitioning data in Amazon Redshift and Athena improves query performance.
Caching services such as DynamoDB DAX and Amazon ElastiCache reduce latency for frequently accessed data.

3.3 Cost Optimization

S3 lifecycle policies automatically transition data to lower-cost storage tiers.
Selecting the appropriate EBS volume type helps balance performance and cost.
Spot Instances can significantly reduce compute costs for ML training workloads.

4. Key Exam Tips

Understand when to use different storage services: Amazon S3 for object storage, Amazon EFS for shared file systems, and Amazon EBS for high-performance block storage.

Choose the correct database based on workload requirements. Amazon Redshift is ideal for analytics and ML inference, DynamoDB supports real-time applications, and Amazon RDS is best for transactional data.

Be familiar with S3 encryption options, including SSE-S3, SSE-KMS, SSE-C, and client-side encryption.

Know how AWS Glue supports ML data pipelines through ETL jobs, crawlers, and the Glue Data Catalog. Glue DataBrew simplifies data preparation for ML training.

Understand Amazon Redshift ML capabilities, including SQL-based model training and AutoML features.

Apply strong security practices by using IAM roles, enabling CloudTrail logging, and managing permissions with AWS Lake Formation.

Use Amazon Athena for serverless SQL querying of ML datasets stored in Amazon S3, integrated with the AWS Glue Data Catalog.

Back to Course

Next Lesson