Package com.leumanuel.woozydata.service
Class DataCleaningService
java.lang.Object
com.leumanuel.woozydata.service.DataCleaningService
Service class responsible for data cleaning and transformation operations on DataFrames.
Provides methods for handling missing values, removing duplicates, standardizing data,
and performing various data cleaning tasks.
- Version:
- 1.0
- Author:
- Leu A. Manuel
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionConverts columns to specified data types.Performs comprehensive data cleaning including removing nulls, duplicates, and filling remaining nulls.Converts column data types according to the provided type map.dropDuplicates(DataFrame df, String... columns) Removes duplicate rows from the DataFrame based on specified columns.Removes all rows containing missing values from the DataFrame.Replaces all missing values in the DataFrame with a specified value.fillNaColumns(DataFrame df, Object value, String... columns) Replaces missing values in specified columns with a given value.Groups DataFrame by specified columns.interpolate(DataFrame df, String method, String... columns) Interpolates missing values in specified columns using various methods.Creates a boolean mask indicating missing values in the DataFrame.Normalizes specified numeric columns to range [0,1].Creates a boolean mask indicating non-missing values in the DataFrame.Replaces all occurrences of a specific value with a new value throughout the DataFrame.standardize(DataFrame df, String... columns) Standardizes specified numeric columns using z-score normalization (x - mean) / std.Calculates basic statistics for a specified column.
-
Constructor Details
-
DataCleaningService
public DataCleaningService()
-
-
Method Details
-
dropNa
Removes all rows containing missing values from the DataFrame. Missing values are identified as null, empty strings, "null", or "nan" (case insensitive).- Parameters:
df- the DataFrame to clean- Returns:
- new DataFrame with rows containing missing values removed
- Throws:
IllegalArgumentException- if df is null
-
fillNa
Replaces all missing values in the DataFrame with a specified value. Missing values are identified as null, empty strings, "null", or "nan" (case insensitive).- Parameters:
df- the DataFrame to processvalue- the value to use for replacing missing values- Returns:
- new DataFrame with missing values replaced
- Throws:
IllegalArgumentException- if df is null
-
fillNaColumns
Replaces missing values in specified columns with a given value.- Parameters:
df- the DataFrame to processvalue- the value to use for replacing missing valuescolumns- array of column names where missing values should be replaced- Returns:
- new DataFrame with missing values replaced in specified columns
- Throws:
IllegalArgumentException- if df is null or if any specified column doesn't exist
-
dropDuplicates
Removes duplicate rows from the DataFrame based on specified columns. If no columns are specified, checks all columns for duplicates.- Parameters:
df- the DataFrame to processcolumns- array of column names to check for duplicates- Returns:
- new DataFrame with duplicate rows removed
- Throws:
IllegalArgumentException- if df is null or if any specified column doesn't exist
-
replace
Replaces all occurrences of a specific value with a new value throughout the DataFrame.- Parameters:
df- the DataFrame to processoldValue- the value to be replacednewValue- the replacement value- Returns:
- new DataFrame with values replaced
- Throws:
IllegalArgumentException- if df is null
-
isna
Creates a boolean mask indicating missing values in the DataFrame.- Parameters:
df- the DataFrame to analyze- Returns:
- new DataFrame containing boolean values (true for missing values)
- Throws:
IllegalArgumentException- if df is null
-
notna
Creates a boolean mask indicating non-missing values in the DataFrame.- Parameters:
df- the DataFrame to analyze- Returns:
- new DataFrame containing boolean values (true for non-missing values)
- Throws:
IllegalArgumentException- if df is null
-
astype
Converts columns to specified data types.- Parameters:
df- the DataFrame to processtypeMap- map of column names to their target Java types- Returns:
- new DataFrame with converted column types
- Throws:
IllegalArgumentException- if df is null or if conversion fails
-
groupBy
Groups DataFrame by specified columns.- Parameters:
df- the DataFrame to groupcolumns- array of column names to group by- Returns:
- Map of group keys to corresponding DataFrames
- Throws:
IllegalArgumentException- if df is null or if any specified column doesn't exist
-
clean
Performs comprehensive data cleaning including removing nulls, duplicates, and filling remaining nulls.- Parameters:
df- the DataFrame to clean- Returns:
- cleaned DataFrame
- Throws:
IllegalArgumentException- if df is null
-
standardize
Standardizes specified numeric columns using z-score normalization (x - mean) / std.- Parameters:
df- the DataFrame to processcolumns- array of column names to standardize- Returns:
- new DataFrame with standardized columns
- Throws:
IllegalArgumentException- if df is null or if any specified column isn't numeric
-
normalize
Normalizes specified numeric columns to range [0,1].- Parameters:
df- the DataFrame to processcolumns- array of column names to normalize- Returns:
- new DataFrame with normalized columns
- Throws:
IllegalArgumentException- if df is null or if any specified column isn't numeric
-
stats
Calculates basic statistics for a specified column.- Parameters:
df- the DataFrame to analyzecolumn- the column name to analyze- Returns:
- Map containing statistics (mean, std, min, max, median)
- Throws:
IllegalArgumentException- if df is null or column isn't numeric
-
convert
Converts column data types according to the provided type map.- Parameters:
df- the DataFrame to processtypeMap- map of column names to target types- Returns:
- new DataFrame with converted types
- Throws:
IllegalArgumentException- if conversion fails or type is unsupported
-
interpolate
Interpolates missing values in specified columns using various methods.- Parameters:
df- the DataFrame to processmethod- interpolation method ("linear", "spline", "loess", or "neville")columns- array of column names to interpolate- Returns:
- new DataFrame with interpolated values
- Throws:
IllegalArgumentException- if method is invalid or interpolation fails
-