com.leumanuel.woozydata.service.DataCleaningService

public class DataCleaningService extends Object

Service class responsible for data cleaning and transformation operations on DataFrames. Provides methods for handling missing values, removing duplicates, standardizing data, and performing various data cleaning tasks.

Version:: 1.0
Author:: Leu A. Manuel

Constructor Summary

Constructors

Constructor

Description

DataCleaningService()
Method Summary

Modifier and Type

Method

Description

DataFrame

astype(DataFrame df, Map<String,Class<?>> typeMap)

Converts columns to specified data types.

DataFrame

clean(DataFrame df)

Performs comprehensive data cleaning including removing nulls, duplicates, and filling remaining nulls.

DataFrame

convert(DataFrame df, Map<String,Class<?>> typeMap)

Converts column data types according to the provided type map.

DataFrame

dropDuplicates(DataFrame df, String... columns)

Removes duplicate rows from the DataFrame based on specified columns.

DataFrame

dropNa(DataFrame df)

Removes all rows containing missing values from the DataFrame.

DataFrame

fillNa(DataFrame df, Object value)

Replaces all missing values in the DataFrame with a specified value.

DataFrame

fillNaColumns(DataFrame df, Object value, String... columns)

Replaces missing values in specified columns with a given value.

Map<List<Object>,DataFrame>

groupBy(DataFrame df, String... columns)

Groups DataFrame by specified columns.

DataFrame

interpolate(DataFrame df, String method, String... columns)

Interpolates missing values in specified columns using various methods.

DataFrame

isna(DataFrame df)

Creates a boolean mask indicating missing values in the DataFrame.

DataFrame

normalize(DataFrame df, String... columns)

Normalizes specified numeric columns to range [0,1].

DataFrame

notna(DataFrame df)

Creates a boolean mask indicating non-missing values in the DataFrame.

DataFrame

replace(DataFrame df, Object oldValue, Object newValue)

Replaces all occurrences of a specific value with a new value throughout the DataFrame.

DataFrame

standardize(DataFrame df, String... columns)

Standardizes specified numeric columns using z-score normalization (x - mean) / std.

Map<String,Double>

stats(DataFrame df, String column)

Calculates basic statistics for a specified column.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- DataCleaningService
  
  public DataCleaningService()
Method Details
- dropNa
  
  public DataFrame dropNa(DataFrame df)
  
  Removes all rows containing missing values from the DataFrame. Missing values are identified as null, empty strings, "null", or "nan" (case insensitive).
  
  Parameters:
  
  df - the DataFrame to clean
  
  Returns:
  
  new DataFrame with rows containing missing values removed
  
  Throws:
  
  IllegalArgumentException - if df is null
- fillNa
  
  public DataFrame fillNa(DataFrame df, Object value)
  
  Replaces all missing values in the DataFrame with a specified value. Missing values are identified as null, empty strings, "null", or "nan" (case insensitive).
  
  Parameters:
  
  df - the DataFrame to process
  
  value - the value to use for replacing missing values
  
  Returns:
  
  new DataFrame with missing values replaced
  
  Throws:
  
  IllegalArgumentException - if df is null
- fillNaColumns
  
  public DataFrame fillNaColumns(DataFrame df, Object value, String... columns)
  
  Replaces missing values in specified columns with a given value.
  
  Parameters:
  
  df - the DataFrame to process
  
  value - the value to use for replacing missing values
  
  columns - array of column names where missing values should be replaced
  
  Returns:
  
  new DataFrame with missing values replaced in specified columns
  
  Throws:
  
  IllegalArgumentException - if df is null or if any specified column doesn't exist
- dropDuplicates
  
  public DataFrame dropDuplicates(DataFrame df, String... columns)
  
  Removes duplicate rows from the DataFrame based on specified columns. If no columns are specified, checks all columns for duplicates.
  
  Parameters:
  
  df - the DataFrame to process
  
  columns - array of column names to check for duplicates
  
  Returns:
  
  new DataFrame with duplicate rows removed
  
  Throws:
  
  IllegalArgumentException - if df is null or if any specified column doesn't exist
- replace
  
  public DataFrame replace(DataFrame df, Object oldValue, Object newValue)
  
  Replaces all occurrences of a specific value with a new value throughout the DataFrame.
  
  Parameters:
  
  df - the DataFrame to process
  
  oldValue - the value to be replaced
  
  newValue - the replacement value
  
  Returns:
  
  new DataFrame with values replaced
  
  Throws:
  
  IllegalArgumentException - if df is null
- isna
  
  public DataFrame isna(DataFrame df)
  
  Creates a boolean mask indicating missing values in the DataFrame.
  
  Parameters:
  
  df - the DataFrame to analyze
  
  Returns:
  
  new DataFrame containing boolean values (true for missing values)
  
  Throws:
  
  IllegalArgumentException - if df is null
- notna
  
  public DataFrame notna(DataFrame df)
  
  Creates a boolean mask indicating non-missing values in the DataFrame.
  
  Parameters:
  
  df - the DataFrame to analyze
  
  Returns:
  
  new DataFrame containing boolean values (true for non-missing values)
  
  Throws:
  
  IllegalArgumentException - if df is null
- astype
  
  public DataFrame astype(DataFrame df, Map<String,Class<?>> typeMap)
  
  Converts columns to specified data types.
  
  Parameters:
  
  df - the DataFrame to process
  
  typeMap - map of column names to their target Java types
  
  Returns:
  
  new DataFrame with converted column types
  
  Throws:
  
  IllegalArgumentException - if df is null or if conversion fails
- groupBy
  
  public Map<List<Object>,DataFrame> groupBy(DataFrame df, String... columns)
  
  Groups DataFrame by specified columns.
  
  Parameters:
  
  df - the DataFrame to group
  
  columns - array of column names to group by
  
  Returns:
  
  Map of group keys to corresponding DataFrames
  
  Throws:
  
  IllegalArgumentException - if df is null or if any specified column doesn't exist
- clean
  
  public DataFrame clean(DataFrame df)
  
  Performs comprehensive data cleaning including removing nulls, duplicates, and filling remaining nulls.
  
  Parameters:
  
  df - the DataFrame to clean
  
  Returns:
  
  cleaned DataFrame
  
  Throws:
  
  IllegalArgumentException - if df is null
- standardize
  
  public DataFrame standardize(DataFrame df, String... columns)
  
  Standardizes specified numeric columns using z-score normalization (x - mean) / std.
  
  Parameters:
  
  df - the DataFrame to process
  
  columns - array of column names to standardize
  
  Returns:
  
  new DataFrame with standardized columns
  
  Throws:
  
  IllegalArgumentException - if df is null or if any specified column isn't numeric
- normalize
  
  public DataFrame normalize(DataFrame df, String... columns)
  
  Normalizes specified numeric columns to range [0,1].
  
  Parameters:
  
  df - the DataFrame to process
  
  columns - array of column names to normalize
  
  Returns:
  
  new DataFrame with normalized columns
  
  Throws:
  
  IllegalArgumentException - if df is null or if any specified column isn't numeric
- stats
  
  public Map<String,Double> stats(DataFrame df, String column)
  
  Calculates basic statistics for a specified column.
  
  Parameters:
  
  df - the DataFrame to analyze
  
  column - the column name to analyze
  
  Returns:
  
  Map containing statistics (mean, std, min, max, median)
  
  Throws:
  
  IllegalArgumentException - if df is null or column isn't numeric
- convert
  
  public DataFrame convert(DataFrame df, Map<String,Class<?>> typeMap)
  
  Converts column data types according to the provided type map.
  
  Parameters:
  
  df - the DataFrame to process
  
  typeMap - map of column names to target types
  
  Returns:
  
  new DataFrame with converted types
  
  Throws:
  
  IllegalArgumentException - if conversion fails or type is unsupported
- interpolate
  
  public DataFrame interpolate(DataFrame df, String method, String... columns)
  
  Interpolates missing values in specified columns using various methods.
  
  Parameters:
  
  df - the DataFrame to process
  
  method - interpolation method ("linear", "spline", "loess", or "neville")
  
  columns - array of column names to interpolate
  
  Returns:
  
  new DataFrame with interpolated values
  
  Throws:
  
  IllegalArgumentException - if method is invalid or interpolation fails

Class DataCleaningService

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

DataCleaningService

Method Details

dropNa

fillNa

fillNaColumns

dropDuplicates

replace

isna

notna

astype

groupBy

clean

standardize

normalize

stats

convert

interpolate