Feature engineering involves manipulation of your dataset to improve the training of a machine learning model for greater accuracy and better performance. The basis of feature engineering is knowing the business problem and the data source. Feature engineering gives a deeper understanding of your data. This leads to more valuable insights. Feature engineering is a valuable part of data science. It involves transforming raw data into formats that enhance model performance.
Steps involved in Feature Engineering
1. Explore the dataset – Understand your dataset and its shape.
2. Handle missing data – Impute or remove missing data.
3. Encode variables – Convert categorical variables to numerical form.
4. Scale features – Standardize and normalize the numerical features.
5. Create features – Generate new feature by combining existing features to capture relationships.
6. Handle outliers – Identify and address outliers through tranforming the data or trimming.
7. Normalization – Normalize feature and bring them to a common scale.
8. Binning or Discretization – Convert continous features into discrete bins to capture pattern.
9. Test data processing – Tokenization, stemming & removal of stop words.
10. Time series features – Extract the relevant timebased fetures. E.g Rolling statistics or lag features.
11. Vector features – They are used for training in machine learning.
12. Feature selection – Identify and select the most relevant features to improve model interpretability and efficiency using techniques like univariate feature selection or recursive feature elimination.
13. Feature extraction – Reduces data complexity while retaining relevant information as much as possible.
14. Cross validation – Evaluate impact of feature engineering on model performance using cross validation techniques.
Types of features
-
Numerical features – Numerical values. E.g Float, Int.
-
Categorical features – Take one of a limited number of values. E.g Gender, Color.
-
Binary features – Special case of categorical features with only two categories. E.g is-smoker (Yes/No).
-
Text features – Textual data.
Normalization
Data can be measured on different scales, it is therefore necessary to standardize the data when using algorithms that are sensitive to the magnitude and scale of variables. Normalization standardizes the range of independent variables or features.
Normalization helps in:
Source link
lol