Class DataAnalysisService

java.lang.Object
com.leumanuel.woozydata.service.DataAnalysisService

public class DataAnalysisService extends Object
Service class for data analysis operations. Provides statistical calculations and data transformations.
  • Constructor Details

    • DataAnalysisService

      public DataAnalysisService()
  • Method Details

    • analyze

      public DataFrame analyze(DataFrame df, String column)
      Analyzes a specified column in the DataFrame and calculates basic statistical measures.
      Parameters:
      df - DataFrame containing the data to analyze
      column - Name of the column to analyze
      Returns:
      DataFrame containing analysis results including count, mean, std, min, max, median, skewness, and kurtosis
      Throws:
      IllegalArgumentException - if column contains non-numeric data
    • stats

      public Map<String,Double> stats(DataFrame df, String column)
      Calculates statistical measures for a specified column.
      Parameters:
      df - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      Map containing basic statistical measures (mean, std, min, max, median)
      Throws:
      IllegalArgumentException - if column contains non-numeric data
    • sum

      public double sum(DataFrame df, String column)
      Calculates the sum of all values in a specified column.
      Parameters:
      df - DataFrame containing the data
      column - Name of the column to sum
      Returns:
      Sum of all numeric values in the column
      Throws:
      IllegalArgumentException - if column contains non-numeric data
    • avg

      public double avg(DataFrame df, String column)
      Calculates the arithmetic mean (average) of values in a specified column.
      Parameters:
      df - DataFrame containing the data
      column - Name of the column to average
      Returns:
      Average of all numeric values in the column
      Throws:
      IllegalArgumentException - if column contains non-numeric data
    • correlation

      public DataFrame correlation(DataFrame df)
      Performs correlation analysis between numeric columns.
      Parameters:
      df - DataFrame to analyze
      Returns:
      Correlation matrix as DataFrame
    • timeSeriesAnalysis

      public DataFrame timeSeriesAnalysis(DataFrame df, String timeColumn, String valueColumn, int windowSize)
      Performs time series analysis.
      Parameters:
      df - DataFrame with time series data
      timeColumn - Column containing time values
      valueColumn - Column containing values to analyze
      windowSize - Rolling window size
      Returns:
      DataFrame with analysis results
    • calculateMedian

      public double calculateMedian(DataFrame dataFrame, String column)
      Calculates the median value of a specified column. The median is the value separating the higher half from the lower half of a data sample.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      The median value of the specified column
      Throws:
      IllegalArgumentException - if column contains non-numeric data
    • calculateVariance

      public double calculateVariance(DataFrame dataFrame, String column)
      Calculates the variance of a specified column. Variance measures how far a set of numbers are spread out from their average value.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      The variance of the specified column
      Throws:
      IllegalArgumentException - if column contains non-numeric data
    • calculateMode

      public Object calculateMode(DataFrame dataFrame, String column)
      Calculates the mode (most frequent value) of a specified column. If multiple modes exist, returns the first one found. Works with both numeric and non-numeric data.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      The mode value of the specified column
      Throws:
      IllegalArgumentException - if the column is empty or contains only null values
    • calculateMultipleMode

      public List<Object> calculateMultipleMode(DataFrame dataFrame, String column)
      Calculates multiple modes if they exist in the specified column. Returns all values that appear with the highest frequency.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      List of mode values
      Throws:
      IllegalArgumentException - if the column is empty or contains only null values
    • getFrequencyDistribution

      public Map<Object,Long> getFrequencyDistribution(DataFrame dataFrame, String column)
      Gets the frequency distribution of values in a specified column.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      Map with values and their frequencies
    • calculateStandardDeviation

      public double calculateStandardDeviation(DataFrame dataFrame, String column)
      Calculates the standard deviation of a specified column. Standard deviation measures the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      The standard deviation of the specified column
      Throws:
      IllegalArgumentException - if column contains non-numeric data or is empty
      NullPointerException - if dataFrame or column is null
    • calculatePopulationStandardDeviation

      public double calculatePopulationStandardDeviation(DataFrame dataFrame, String column)
      Calculates the population standard deviation of a specified column. Similar to sample standard deviation but uses n instead of (n-1) in the denominator. Use this when working with complete populations rather than samples.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      The population standard deviation of the specified column
      Throws:
      IllegalArgumentException - if column contains non-numeric data or is empty
      NullPointerException - if dataFrame or column is null
    • calculateCoefficientOfVariation

      public double calculateCoefficientOfVariation(DataFrame dataFrame, String column)
      Calculates the coefficient of variation (CV) for a specified column. CV is the ratio of the standard deviation to the mean, often expressed as a percentage. It shows the extent of variability in relation to the mean of the population.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      The coefficient of variation as a percentage
      Throws:
      IllegalArgumentException - if column contains non-numeric data or is empty
      NullPointerException - if dataFrame or column is null
    • getDispersionStatistics

      public Map<String,Double> getDispersionStatistics(DataFrame dataFrame, String column)
      Generates a summary of dispersion statistics for a specified column. Includes standard deviation, variance, coefficient of variation, and range.
      Parameters:
      dataFrame - DataFrame containing the data
      column - Name of the column to analyze
      Returns:
      Map containing various dispersion statistics
      Throws:
      IllegalArgumentException - if column contains non-numeric data or is empty
      NullPointerException - if dataFrame or column is null
    • calculateMean

      public double calculateMean(DataFrame dataFrame, String column)
      Calculates the arithmetic mean (average) of numeric values in a specified column. The mean is calculated by summing all numeric values and dividing by the count of values. Non-numeric values and null values in the column are ignored during calculation.
      Parameters:
      dataFrame - the DataFrame containing the data to analyze. Must not be null.
      column - the name of the column to calculate mean for. Must not be null.
      Returns:
      the arithmetic mean of all numeric values in the column. Returns 0.0 if no numeric values are found.
      Throws:
      IllegalArgumentException - if the specified column does not exist in the DataFrame
      NullPointerException - if either dataFrame or column parameter is null
      Since:
      1.0
      See Also:
      • DescriptiveStatistics.getMean()