Class DataCleaningService

java.lang.Object
com.leumanuel.woozydata.service.DataCleaningService

public class DataCleaningService extends Object
Service class responsible for data cleaning and transformation operations on DataFrames. Provides methods for handling missing values, removing duplicates, standardizing data, and performing various data cleaning tasks.
Version:
1.0
Author:
Leu A. Manuel
  • Constructor Details

    • DataCleaningService

      public DataCleaningService()
  • Method Details

    • dropNa

      public DataFrame dropNa(DataFrame df)
      Removes all rows containing missing values from the DataFrame. Missing values are identified as null, empty strings, "null", or "nan" (case insensitive).
      Parameters:
      df - the DataFrame to clean
      Returns:
      new DataFrame with rows containing missing values removed
      Throws:
      IllegalArgumentException - if df is null
    • fillNa

      public DataFrame fillNa(DataFrame df, Object value)
      Replaces all missing values in the DataFrame with a specified value. Missing values are identified as null, empty strings, "null", or "nan" (case insensitive).
      Parameters:
      df - the DataFrame to process
      value - the value to use for replacing missing values
      Returns:
      new DataFrame with missing values replaced
      Throws:
      IllegalArgumentException - if df is null
    • fillNaColumns

      public DataFrame fillNaColumns(DataFrame df, Object value, String... columns)
      Replaces missing values in specified columns with a given value.
      Parameters:
      df - the DataFrame to process
      value - the value to use for replacing missing values
      columns - array of column names where missing values should be replaced
      Returns:
      new DataFrame with missing values replaced in specified columns
      Throws:
      IllegalArgumentException - if df is null or if any specified column doesn't exist
    • dropDuplicates

      public DataFrame dropDuplicates(DataFrame df, String... columns)
      Removes duplicate rows from the DataFrame based on specified columns. If no columns are specified, checks all columns for duplicates.
      Parameters:
      df - the DataFrame to process
      columns - array of column names to check for duplicates
      Returns:
      new DataFrame with duplicate rows removed
      Throws:
      IllegalArgumentException - if df is null or if any specified column doesn't exist
    • replace

      public DataFrame replace(DataFrame df, Object oldValue, Object newValue)
      Replaces all occurrences of a specific value with a new value throughout the DataFrame.
      Parameters:
      df - the DataFrame to process
      oldValue - the value to be replaced
      newValue - the replacement value
      Returns:
      new DataFrame with values replaced
      Throws:
      IllegalArgumentException - if df is null
    • isna

      public DataFrame isna(DataFrame df)
      Creates a boolean mask indicating missing values in the DataFrame.
      Parameters:
      df - the DataFrame to analyze
      Returns:
      new DataFrame containing boolean values (true for missing values)
      Throws:
      IllegalArgumentException - if df is null
    • notna

      public DataFrame notna(DataFrame df)
      Creates a boolean mask indicating non-missing values in the DataFrame.
      Parameters:
      df - the DataFrame to analyze
      Returns:
      new DataFrame containing boolean values (true for non-missing values)
      Throws:
      IllegalArgumentException - if df is null
    • astype

      public DataFrame astype(DataFrame df, Map<String,Class<?>> typeMap)
      Converts columns to specified data types.
      Parameters:
      df - the DataFrame to process
      typeMap - map of column names to their target Java types
      Returns:
      new DataFrame with converted column types
      Throws:
      IllegalArgumentException - if df is null or if conversion fails
    • groupBy

      public Map<List<Object>,DataFrame> groupBy(DataFrame df, String... columns)
      Groups DataFrame by specified columns.
      Parameters:
      df - the DataFrame to group
      columns - array of column names to group by
      Returns:
      Map of group keys to corresponding DataFrames
      Throws:
      IllegalArgumentException - if df is null or if any specified column doesn't exist
    • clean

      public DataFrame clean(DataFrame df)
      Performs comprehensive data cleaning including removing nulls, duplicates, and filling remaining nulls.
      Parameters:
      df - the DataFrame to clean
      Returns:
      cleaned DataFrame
      Throws:
      IllegalArgumentException - if df is null
    • standardize

      public DataFrame standardize(DataFrame df, String... columns)
      Standardizes specified numeric columns using z-score normalization (x - mean) / std.
      Parameters:
      df - the DataFrame to process
      columns - array of column names to standardize
      Returns:
      new DataFrame with standardized columns
      Throws:
      IllegalArgumentException - if df is null or if any specified column isn't numeric
    • normalize

      public DataFrame normalize(DataFrame df, String... columns)
      Normalizes specified numeric columns to range [0,1].
      Parameters:
      df - the DataFrame to process
      columns - array of column names to normalize
      Returns:
      new DataFrame with normalized columns
      Throws:
      IllegalArgumentException - if df is null or if any specified column isn't numeric
    • stats

      public Map<String,Double> stats(DataFrame df, String column)
      Calculates basic statistics for a specified column.
      Parameters:
      df - the DataFrame to analyze
      column - the column name to analyze
      Returns:
      Map containing statistics (mean, std, min, max, median)
      Throws:
      IllegalArgumentException - if df is null or column isn't numeric
    • convert

      public DataFrame convert(DataFrame df, Map<String,Class<?>> typeMap)
      Converts column data types according to the provided type map.
      Parameters:
      df - the DataFrame to process
      typeMap - map of column names to target types
      Returns:
      new DataFrame with converted types
      Throws:
      IllegalArgumentException - if conversion fails or type is unsupported
    • interpolate

      public DataFrame interpolate(DataFrame df, String method, String... columns)
      Interpolates missing values in specified columns using various methods.
      Parameters:
      df - the DataFrame to process
      method - interpolation method ("linear", "spline", "loess", or "neville")
      columns - array of column names to interpolate
      Returns:
      new DataFrame with interpolated values
      Throws:
      IllegalArgumentException - if method is invalid or interpolation fails