java.lang.Object

com.leumanuel.woozydata.service.DataAnalysisService

public class DataAnalysisService extends Object

Service class for data analysis operations. Provides statistical calculations and data transformations.

Constructor Summary

Constructors

Constructor

Description

DataAnalysisService()
Method Summary

Modifier and Type

Method

Description

DataFrame

analyze(DataFrame df, String column)

Analyzes a specified column in the DataFrame and calculates basic statistical measures.

double

avg(DataFrame df, String column)

Calculates the arithmetic mean (average) of values in a specified column.

double

calculateCoefficientOfVariation(DataFrame dataFrame, String column)

Calculates the coefficient of variation (CV) for a specified column.

double

calculateMean(DataFrame dataFrame, String column)

Calculates the arithmetic mean (average) of numeric values in a specified column.

double

calculateMedian(DataFrame dataFrame, String column)

Calculates the median value of a specified column.

Object

calculateMode(DataFrame dataFrame, String column)

Calculates the mode (most frequent value) of a specified column.

List<Object>

calculateMultipleMode(DataFrame dataFrame, String column)

Calculates multiple modes if they exist in the specified column.

double

calculatePopulationStandardDeviation(DataFrame dataFrame, String column)

Calculates the population standard deviation of a specified column.

double

calculateStandardDeviation(DataFrame dataFrame, String column)

Calculates the standard deviation of a specified column.

double

calculateVariance(DataFrame dataFrame, String column)

Calculates the variance of a specified column.

DataFrame

correlation(DataFrame df)

Performs correlation analysis between numeric columns.

Map<String,Double>

getDispersionStatistics(DataFrame dataFrame, String column)

Generates a summary of dispersion statistics for a specified column.

Map<Object,Long>

getFrequencyDistribution(DataFrame dataFrame, String column)

Gets the frequency distribution of values in a specified column.

Map<String,Double>

stats(DataFrame df, String column)

Calculates statistical measures for a specified column.

double

sum(DataFrame df, String column)

Calculates the sum of all values in a specified column.

DataFrame

timeSeriesAnalysis(DataFrame df, String timeColumn, String valueColumn, int windowSize)

Performs time series analysis.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- DataAnalysisService
  
  public DataAnalysisService()
Method Details
- analyze
  
  public DataFrame analyze(DataFrame df, String column)
  
  Analyzes a specified column in the DataFrame and calculates basic statistical measures.
  
  Parameters:
  
  df - DataFrame containing the data to analyze
  
  column - Name of the column to analyze
  
  Returns:
  
  DataFrame containing analysis results including count, mean, std, min, max, median, skewness, and kurtosis
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data
- stats
  
  public Map<String,Double> stats(DataFrame df, String column)
  
  Calculates statistical measures for a specified column.
  
  Parameters:
  
  df - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  Map containing basic statistical measures (mean, std, min, max, median)
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data
- sum
  
  public double sum(DataFrame df, String column)
  
  Calculates the sum of all values in a specified column.
  
  Parameters:
  
  df - DataFrame containing the data
  
  column - Name of the column to sum
  
  Returns:
  
  Sum of all numeric values in the column
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data
- avg
  
  public double avg(DataFrame df, String column)
  
  Calculates the arithmetic mean (average) of values in a specified column.
  
  Parameters:
  
  df - DataFrame containing the data
  
  column - Name of the column to average
  
  Returns:
  
  Average of all numeric values in the column
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data
- correlation
  
  public DataFrame correlation(DataFrame df)
  
  Performs correlation analysis between numeric columns.
  
  Parameters:
  
  df - DataFrame to analyze
  
  Returns:
  
  Correlation matrix as DataFrame
- timeSeriesAnalysis
  
  public DataFrame timeSeriesAnalysis(DataFrame df, String timeColumn, String valueColumn, int windowSize)
  
  Performs time series analysis.
  
  Parameters:
  
  df - DataFrame with time series data
  
  timeColumn - Column containing time values
  
  valueColumn - Column containing values to analyze
  
  windowSize - Rolling window size
  
  Returns:
  
  DataFrame with analysis results
- calculateMedian
  
  public double calculateMedian(DataFrame dataFrame, String column)
  
  Calculates the median value of a specified column. The median is the value separating the higher half from the lower half of a data sample.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  The median value of the specified column
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data
- calculateVariance
  
  public double calculateVariance(DataFrame dataFrame, String column)
  
  Calculates the variance of a specified column. Variance measures how far a set of numbers are spread out from their average value.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  The variance of the specified column
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data
- calculateMode
  
  public Object calculateMode(DataFrame dataFrame, String column)
  
  Calculates the mode (most frequent value) of a specified column. If multiple modes exist, returns the first one found. Works with both numeric and non-numeric data.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  The mode value of the specified column
  
  Throws:
  
  IllegalArgumentException - if the column is empty or contains only null values
- calculateMultipleMode
  
  public List<Object> calculateMultipleMode(DataFrame dataFrame, String column)
  
  Calculates multiple modes if they exist in the specified column. Returns all values that appear with the highest frequency.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  List of mode values
  
  Throws:
  
  IllegalArgumentException - if the column is empty or contains only null values
- getFrequencyDistribution
  
  public Map<Object,Long> getFrequencyDistribution(DataFrame dataFrame, String column)
  
  Gets the frequency distribution of values in a specified column.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  Map with values and their frequencies
- calculateStandardDeviation
  
  public double calculateStandardDeviation(DataFrame dataFrame, String column)
  
  Calculates the standard deviation of a specified column. Standard deviation measures the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  The standard deviation of the specified column
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data or is empty
  
  NullPointerException - if dataFrame or column is null
- calculatePopulationStandardDeviation
  
  public double calculatePopulationStandardDeviation(DataFrame dataFrame, String column)
  
  Calculates the population standard deviation of a specified column. Similar to sample standard deviation but uses n instead of (n-1) in the denominator. Use this when working with complete populations rather than samples.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  The population standard deviation of the specified column
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data or is empty
  
  NullPointerException - if dataFrame or column is null
- calculateCoefficientOfVariation
  
  public double calculateCoefficientOfVariation(DataFrame dataFrame, String column)
  
  Calculates the coefficient of variation (CV) for a specified column. CV is the ratio of the standard deviation to the mean, often expressed as a percentage. It shows the extent of variability in relation to the mean of the population.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  The coefficient of variation as a percentage
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data or is empty
  
  NullPointerException - if dataFrame or column is null
- getDispersionStatistics
  
  public Map<String,Double> getDispersionStatistics(DataFrame dataFrame, String column)
  
  Generates a summary of dispersion statistics for a specified column. Includes standard deviation, variance, coefficient of variation, and range.
  
  Parameters:
  
  dataFrame - DataFrame containing the data
  
  column - Name of the column to analyze
  
  Returns:
  
  Map containing various dispersion statistics
  
  Throws:
  
  IllegalArgumentException - if column contains non-numeric data or is empty
  
  NullPointerException - if dataFrame or column is null
- calculateMean
  
  public double calculateMean(DataFrame dataFrame, String column)
  
  Calculates the arithmetic mean (average) of numeric values in a specified column. The mean is calculated by summing all numeric values and dividing by the count of values. Non-numeric values and null values in the column are ignored during calculation.
  Parameters:
  
  dataFrame - the DataFrame containing the data to analyze. Must not be null.
  
  column - the name of the column to calculate mean for. Must not be null.
  
  Returns:
  
  the arithmetic mean of all numeric values in the column. Returns 0.0 if no numeric values are found.
  
  Throws:
  
  IllegalArgumentException - if the specified column does not exist in the DataFrame
  
  NullPointerException - if either dataFrame or column parameter is null
  
  Since:
  
  1.0
  
  See Also:
  
  DescriptiveStatistics.getMean()

Class DataAnalysisService

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

DataAnalysisService

Method Details

analyze

stats

sum

avg

correlation

timeSeriesAnalysis

calculateMedian

calculateVariance

calculateMode

calculateMultipleMode

getFrequencyDistribution

calculateStandardDeviation

calculatePopulationStandardDeviation

calculateCoefficientOfVariation

getDispersionStatistics

calculateMean