This section covers essential exam objectives related to preparing raw data for machine learning models through transformation and feature engineering.
Domain 1: Data Preparation for Machine Learning
Task Statement 1.2: Transform data and perform feature engineering
Effective data preprocessing is critical to improving model accuracy, reliability, and generalization. It ensures consistency, reduces noise, and minimizes bias in training data.
Missing values must be addressed carefully to avoid skewed results or model instability.
Common approaches include removing incomplete records when the dataset is small and the percentage of missing values is minimal. Statistical imputation techniques such as replacing missing values with the mean, median, or mode are widely used for numeric features when outliers have limited impact. For time-series data, forward fill and backward fill methods preserve temporal continuity, while interpolation estimates missing values based on observed trends, making it suitable for sensor and financial data. In more complex datasets, ML-based imputation techniques such as k-nearest neighbors or regression models can predict missing values by learning underlying patterns.
Outliers can significantly distort model performance and must be handled appropriately.
The Z-score method identifies values that fall beyond a specified number of standard deviations and is most effective for normally distributed data. The interquartile range (IQR) method detects extreme values in skewed distributions by evaluating data outside expected percentile boundaries. Winsorization limits extreme values rather than removing them, which is common in financial datasets, while clipping caps values at predefined thresholds and is frequently used in image processing and transaction analysis.
Duplicate records can introduce bias and inflate feature importance. AWS Glue DataBrew provides built-in transformations for identifying and removing duplicates in tabular data. For large-scale datasets, Apache Spark on Amazon EMR enables custom deduplication logic using PySpark. In streaming pipelines, AWS Lambda can be used to apply rule-based deduplication in real time.
Feature engineering transforms raw attributes into meaningful inputs that improve model learning.
Feature scaling and standardization ensure that numeric features contribute equally to model training, particularly for distance-based and gradient-based algorithms. Feature splitting extracts components such as day, month, and year from timestamps, which is especially useful for time-series analysis and is natively supported by AWS Glue DataBrew.
Feature binning, also known as discretization, converts continuous variables into categorical ranges and is commonly used in decision trees and classification models. Log transformations reduce skewness in highly imbalanced data distributions and are frequently applied in financial and transactional datasets. Normalization ensures consistent value ranges and is essential in neural networks and image preprocessing workflows.
Machine learning models require numerical inputs, making categorical encoding a fundamental step.
One-hot encoding is suitable for low-cardinality categorical variables without inherent ordering and is widely supported in SageMaker Data Wrangler. Label encoding is used when categories have a natural order, such as severity levels, and is commonly applied through AWS Glue DataBrew. Binary encoding efficiently represents high-cardinality categorical features and integrates well with feature stores. Feature hashing is effective for large-scale text data in NLP pipelines, while word embeddings capture semantic relationships and are commonly generated using models such as SageMaker BlazingText.
AWS provides a rich ecosystem of tools to support feature engineering workflows.
SageMaker Data Wrangler enables low-code, interactive data preparation and feature creation. AWS Glue DataBrew simplifies cleaning and deduplication of structured data, while Apache Spark on Amazon EMR supports large-scale, distributed feature engineering. AWS Lambda is commonly used for lightweight transformations in event-driven pipelines. Amazon Athena enables SQL-based transformations directly on data stored in S3, and Amazon QuickSight supports visualization and exploratory analysis of engineered features.
Real-time ML pipelines often require feature transformation on continuously arriving data.
AWS Lambda supports event-driven transformations with minimal latency. Amazon Kinesis Data Analytics, powered by Apache Flink, enables windowed aggregations and real-time feature computation. For high-throughput or complex streaming workloads, Apache Spark on Amazon EMR provides scalable real-time feature transformation capabilities.
High-quality labeled data is essential for supervised learning.
Amazon SageMaker Ground Truth provides human-assisted labeling workflows for images, text, and audio. Amazon Mechanical Turk enables crowdsourced labeling at scale. Automated services such as Amazon Rekognition Custom Labels, Amazon Transcribe, and Amazon Comprehend generate labels for images, speech, and text, accelerating dataset preparation for ML models.
Amazon SageMaker Feature Store centralizes feature storage and supports both online retrieval for inference and offline access for training. AWS Glue DataBrew enables no-code feature transformations, while Apache Spark on Amazon EMR supports large-scale feature extraction. Amazon Athena allows SQL-based transformations on S3 data, and Amazon Kinesis Data Analytics enables real-time feature engineering for streaming workloads.
Focus on mastering data cleaning and transformation fundamentals. Imputation is generally preferred over deletion unless missing values are extensive. Use the IQR method for skewed distributions and Z-score analysis for normally distributed data. AWS Glue DataBrew is ideal for tabular deduplication, while Spark on EMR is better suited for large-scale datasets.
Understand core feature engineering techniques. Min–Max scaling is commonly used in deep learning, while Z-score standardization is preferred for PCA and regression models. Binning converts continuous features into categories, log transformation reduces skewness, and feature splitting enhances time-series analysis.
Be comfortable with encoding strategies. One-hot encoding applies to unordered categories, label encoding applies to ordered categories, and binary encoding handles high-cardinality features efficiently.
Know when to use each AWS feature engineering tool. SageMaker Feature Store supports both training and inference, Glue DataBrew simplifies no-code transformations, EMR with Spark scales feature engineering, and Kinesis Data Analytics enables real-time transformations.
Understand data annotation options. SageMaker Ground Truth is the preferred choice for supervised datasets, Mechanical Turk supports crowdsourcing, and Rekognition, Transcribe, and Comprehend enable automated labeling pipelines.