Package com.leumanuel.woozydata.service
Class DataAnalysisService
java.lang.Object
com.leumanuel.woozydata.service.DataAnalysisService
Service class for data analysis operations.
Provides statistical calculations and data transformations.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionAnalyzes a specified column in the DataFrame and calculates basic statistical measures.doubleCalculates the arithmetic mean (average) of values in a specified column.doublecalculateCoefficientOfVariation(DataFrame dataFrame, String column) Calculates the coefficient of variation (CV) for a specified column.doublecalculateMean(DataFrame dataFrame, String column) Calculates the arithmetic mean (average) of numeric values in a specified column.doublecalculateMedian(DataFrame dataFrame, String column) Calculates the median value of a specified column.calculateMode(DataFrame dataFrame, String column) Calculates the mode (most frequent value) of a specified column.calculateMultipleMode(DataFrame dataFrame, String column) Calculates multiple modes if they exist in the specified column.doublecalculatePopulationStandardDeviation(DataFrame dataFrame, String column) Calculates the population standard deviation of a specified column.doublecalculateStandardDeviation(DataFrame dataFrame, String column) Calculates the standard deviation of a specified column.doublecalculateVariance(DataFrame dataFrame, String column) Calculates the variance of a specified column.correlation(DataFrame df) Performs correlation analysis between numeric columns.getDispersionStatistics(DataFrame dataFrame, String column) Generates a summary of dispersion statistics for a specified column.getFrequencyDistribution(DataFrame dataFrame, String column) Gets the frequency distribution of values in a specified column.Calculates statistical measures for a specified column.doubleCalculates the sum of all values in a specified column.timeSeriesAnalysis(DataFrame df, String timeColumn, String valueColumn, int windowSize) Performs time series analysis.
-
Constructor Details
-
DataAnalysisService
public DataAnalysisService()
-
-
Method Details
-
analyze
Analyzes a specified column in the DataFrame and calculates basic statistical measures.- Parameters:
df- DataFrame containing the data to analyzecolumn- Name of the column to analyze- Returns:
- DataFrame containing analysis results including count, mean, std, min, max, median, skewness, and kurtosis
- Throws:
IllegalArgumentException- if column contains non-numeric data
-
stats
Calculates statistical measures for a specified column.- Parameters:
df- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- Map containing basic statistical measures (mean, std, min, max, median)
- Throws:
IllegalArgumentException- if column contains non-numeric data
-
sum
Calculates the sum of all values in a specified column.- Parameters:
df- DataFrame containing the datacolumn- Name of the column to sum- Returns:
- Sum of all numeric values in the column
- Throws:
IllegalArgumentException- if column contains non-numeric data
-
avg
Calculates the arithmetic mean (average) of values in a specified column.- Parameters:
df- DataFrame containing the datacolumn- Name of the column to average- Returns:
- Average of all numeric values in the column
- Throws:
IllegalArgumentException- if column contains non-numeric data
-
correlation
Performs correlation analysis between numeric columns.- Parameters:
df- DataFrame to analyze- Returns:
- Correlation matrix as DataFrame
-
timeSeriesAnalysis
public DataFrame timeSeriesAnalysis(DataFrame df, String timeColumn, String valueColumn, int windowSize) Performs time series analysis.- Parameters:
df- DataFrame with time series datatimeColumn- Column containing time valuesvalueColumn- Column containing values to analyzewindowSize- Rolling window size- Returns:
- DataFrame with analysis results
-
calculateMedian
Calculates the median value of a specified column. The median is the value separating the higher half from the lower half of a data sample.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- The median value of the specified column
- Throws:
IllegalArgumentException- if column contains non-numeric data
-
calculateVariance
Calculates the variance of a specified column. Variance measures how far a set of numbers are spread out from their average value.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- The variance of the specified column
- Throws:
IllegalArgumentException- if column contains non-numeric data
-
calculateMode
Calculates the mode (most frequent value) of a specified column. If multiple modes exist, returns the first one found. Works with both numeric and non-numeric data.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- The mode value of the specified column
- Throws:
IllegalArgumentException- if the column is empty or contains only null values
-
calculateMultipleMode
Calculates multiple modes if they exist in the specified column. Returns all values that appear with the highest frequency.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- List of mode values
- Throws:
IllegalArgumentException- if the column is empty or contains only null values
-
getFrequencyDistribution
Gets the frequency distribution of values in a specified column.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- Map with values and their frequencies
-
calculateStandardDeviation
Calculates the standard deviation of a specified column. Standard deviation measures the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- The standard deviation of the specified column
- Throws:
IllegalArgumentException- if column contains non-numeric data or is emptyNullPointerException- if dataFrame or column is null
-
calculatePopulationStandardDeviation
Calculates the population standard deviation of a specified column. Similar to sample standard deviation but uses n instead of (n-1) in the denominator. Use this when working with complete populations rather than samples.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- The population standard deviation of the specified column
- Throws:
IllegalArgumentException- if column contains non-numeric data or is emptyNullPointerException- if dataFrame or column is null
-
calculateCoefficientOfVariation
Calculates the coefficient of variation (CV) for a specified column. CV is the ratio of the standard deviation to the mean, often expressed as a percentage. It shows the extent of variability in relation to the mean of the population.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- The coefficient of variation as a percentage
- Throws:
IllegalArgumentException- if column contains non-numeric data or is emptyNullPointerException- if dataFrame or column is null
-
getDispersionStatistics
Generates a summary of dispersion statistics for a specified column. Includes standard deviation, variance, coefficient of variation, and range.- Parameters:
dataFrame- DataFrame containing the datacolumn- Name of the column to analyze- Returns:
- Map containing various dispersion statistics
- Throws:
IllegalArgumentException- if column contains non-numeric data or is emptyNullPointerException- if dataFrame or column is null
-
calculateMean
Calculates the arithmetic mean (average) of numeric values in a specified column. The mean is calculated by summing all numeric values and dividing by the count of values. Non-numeric values and null values in the column are ignored during calculation.- Parameters:
dataFrame- the DataFrame containing the data to analyze. Must not be null.column- the name of the column to calculate mean for. Must not be null.- Returns:
- the arithmetic mean of all numeric values in the column. Returns 0.0 if no numeric values are found.
- Throws:
IllegalArgumentException- if the specified column does not exist in the DataFrameNullPointerException- if either dataFrame or column parameter is null- Since:
- 1.0
- See Also:
-
DescriptiveStatistics.getMean()
-