Analytics Framework Reporting DSL
Overview
Analytics DSL interpreter is provided as a parameterized Spark job that interprets the DSL resources found in the classpath. For every interpreted report a matching table is created in Hive with the report's output.
Report attribute expressions used for setting values in a single table must evaluate into primitive values. All non-primitive and/or non-single cardinality values can be saved by defining child tables.
The reporting DSL also supports partitioning strategies (see com.eisgroup.genesis.report.partitioning.PartitioningStrategy).
The DSL works in batch, realtime or checksum modes. Realtime reports provide At Least Once fault tolerance semantic.
DSL
The reports are defined using a custom DSL grammar. The filenames must
follow this pattern: reportName
.greport. The filename is used to
find the matching report DSL during run-time. The reports must be
stored in the reports
folder in the classpath. There can be multiple
reports defined in a single .greport file. However, all of the reports
in the same file must be defined for the same models. It is currently
possible to define the following using DSL:
- Domain models for which the reports should be generated
- The expressions for report columns
- Top-level filters for filtering out the root data
- Attribute level filters for filtering out inner entities
- Joins with other tables based on expressions
- Children tables
- Joins using RDF or ext. links
- Kafka event sources for initial data
- Versioning configuration for report view creation
- Variations for which the report is applicable
It is currently assumed that the module containing the report DSL files will be compiled with analytics-fgenerator plugin enabled and that the dependencies to the models referenced in the DSL will be added.
Syntax
The syntax for the DSL is as follows:
Click to expand the DSL Syntax
Import Model {ImportedModelName}
Report {ReportName} {
Partitioning {
{PartitioningColumns}
}
Versioning {
{AggregatorGroup}
{KeyGroup}
}
Custom from {CustomTableName} {
Timestamp {TimestampAttributeName}
}
Modeled using {ModelName}
Variations {
{VariationName}
}
Checksum behavior {ChecksumBehavior}
}
Stream {ReportName} from {StreamPath} in {StreamName},
Batch {ReportName}[{GlobalFilterName}] {
@{FeatureName}({FeatureValue})
Attr {AttributeName}: {AttributeType} is {AttributeExpression}
Filter {AttributeFilterName} {
{AttributeFilterExpression}
}
Ref {ChildTableName} is {MultipleAttributesExpression} {
{RefBody}
}
Join {TableName} using {
{ChildColumn} eq/== {ParentColumn}
}
// Join by RDF relationships
Join {TableName} using Relationships {
{SubjectOrObject} eq/== this
predicate eq/== '{PredicateName}'
}
// Join by ModeledLink
Join {TableName} using ModeledLink {
linkAttribute eq/== {ModeledLinkAttributeName}
}
// Join by InternalLink
Join {TypeName} using InternalLink {
linkAttribute eq/== {InternalLinkAttributeName}
}
// Join by SecurityLink
Join {TableName} using SecurityLink {
linkAttribute eq/== {SecurityLinkAttributeName}
}
}
Filter {GlobalFilterName} {
{GlobalFilterExpression}
}
with the following placeholders in the curly brackets used:
Placeholder ({..}) | Description |
---|---|
ImportedModelName | The name of the model to be imported. When it is used as root for report, the interpreter tries to query specified model and finds applicable table in Cassandra. All variations are queried. |
ReportName | The name of the actual report. |
PartitioningColumns | Optional comma separated list of partitioning columns to be used for report (e.g.: year, weekofyear, modelname). See more information about configuring report partitions here. |
AggregatorGroup | A comma separated collection of Max/Min(colName1 , colName2 , ...colNameN ) groups that specify the aggregation columns for report view. |
KeyGroup | A comma separated collection of Key(attrName1 , attrName2, ...attrNameN ) groups that specify the keys of the report entities. Do note that the keys must be present in all of the entities. |
CustomTableName | Schema name and table name of the Cassandra table to be used as root for report. The interpreter queries specified table in Cassandra. Should be specified only if the report is generated for non modeled entities. |
TimestampAttributeName | An optional parameter that specifies the root entity's attribute containing the date to be used for partitioning and filtering the initial set of data. Can be set for custom data sources only as all modeled entities contain _timestamp attribute. |
ModelName | The name of the model from which the report should be generated. |
VariationName | The name of the variation for which the data should be queried. Should be specified if the report is for modeled entities and for model that has more than 1 variation. |
ChecksumBehavior | Specifies the default behavior for all of the attributes when calculating the checksum. Possible values are: include, exclude . Set to include by default. The default behavior can be overridden by applying features on the attributes. |
StreamPath | The name and the path from the stream event to subscribe to. Global filters are supported anywhere in path. Example: CommandExecutedEvent.output [MatchesCriteria]. |
StreamName | The stream name to subscribe to. |
GlobalFilterName | An optional parameter that specifies the filter name to apply on the root entities |
FeatureName | The name of the feature to apply on the attribute |
FeatureValue | The value of the feature |
AttributeName | The name of the resulting attribute |
AttributeType | Optional, the resulting type of the attribute, can be either of: Date, DateTime, Number, String. If no type is specified, String is used by default. |
AttributeExpression | The expression to resolve the value for the attribute. If no expression is defined, the interpreter tries to get the value from the root entity by the name of the attribute |
AttributeFilterName | The name of the filter that can be used to filter out child entities. Do note that the attribute filters are optional. |
AttributeFilterExpression | The expression for the attribute filter |
ChildTableName | Name of the child table to be created from non-primitive or non-single cardinality attribute of the main table. |
MultipleAttributesExpression | Comma separated list of AttributeExpressions. Be aware that every expression should be calculable from the place it's defined in. |
RefBody | The body of the child table definition, can contain attribute expressions, filters, joins or children. |
TableName | The name of the table to join with. The interpreter automatically assumes the schema based on the root entity for which the report is generated. Do note that the joins are optional. |
ChildColumn | The name of the child column to join on |
ParentColumn | The name of the parent attribute to join on |
SubjectOrObject | "subject" or "object". Used for join by RDF relationships. If it is "subject" - interpreter looks for related "object" to be joined with the root table (and vice-versa). |
PredicateName | Name of the predicate that is used in RDF relationship. |
ModeledLinkAttributeName | Name of the attribute that holds value of modeled link. |
SecurityLinkAttributeName | Name of the attribute that holds value of security link. |
GlobalFilterName | The name of the root entity filter. Do note that the global filters are optional. |
GlobalFilterExpression | The global filter expression. |
Attribute Expressions
Attribute expressions are used to define the actual output attributes for the report.
Currently, the following attribute operations are supported by the interpreter:
- Arithmetic operations that require the resolved attributes to be
numbers:
- Multiplication
- Division
- Addition
- Subtraction
- Aggregation operations that require the resolved attributes to be
collections:
- Sum
- Avg
- Min
- Max
- Count
- First
- Last
- Reduce - reduces values of ReportPrimitive collection to single StringPrimitive separating values by provided delimiter
- Special operations:
- Path - used to resolve the actual value for the subsequent operations. Collections are automatically exploded if found in path
- Precedence - used to denote some operation precedence using brackets
- ExtLinkAsType - resolves the entity type from the link attribute
- ExtLinkAsId - resolves the entity id from the link attribute
- ExtLinkAsRevision - resolves revision number (if any) from the link attribute
- Cast operations:
- (operation) as castType - used to implicitly convert the inner operation to the specified type. All attribute types are supported
- Logical operations:
- :? - the Elvis operator that selects the left operation if it is not null, right operation otherwise
Attr mult is foo * bar
Attr div is foo / bar
Attr add is foo + bar
Attr rem is foo - bar
Attr sum is Sum(foo)
Attr avg is Avg(foo)
Attr min is Min(foo)
Attr max is Max(foo)
Attr count is Count(foo)
Attr first is First(foo)
Attr path is foo.bar[filter].baz
Attr precedence is (foo + bar) * bar
Attr linkType is ExtLinkAsType(foo)
Attr linkId is ExtLinkAsId(foo)
Attr linkRevision is ExtLinkAsRevision(foo)
Attr last is Last(foo)
Attr reduced is Reduce(foo, ",")
Attr casted is (foo) as Date
Attr elvis is foo ?: bar
Attr elvisDate is (foo ?: bar) as Date
The attribute types are automatically inferred from the context. It is possible to mix and match operations to create complex attribute expressions.
Arithmetic operations are not allowed on dates
Filter Expressions
Attribute Filter Expressions
Attribute filter expressions are used to define filters for path operations. It is implicitly assumed that the attribute on which the filter is applied will be an object or an array of objects.
Global Filter Expressions
Global filter expressions are used to define filters for root entity's attributes. Be aware that in batch reports global filter expressions are translated into expression in Cassandra WHERE clause. It means that the global filter expression can contains only operators that both supported by the DSL and Cassandra, for instance != cannot be used due Cassandra doesn't support it.
Patterns
The pattern for both filter expression types is as follows:
{AttributeName} {Operation} {Value}
with {..} placeholder values used
Placeholder ({..}) |
Description |
---|---|
|
The name of the context object's attribute |
Operation | The filter operation type, possible values:
|
Value | The value to compare against |
Example:
Filter Global {
foo eq "bar"
foo == "bar"
bar ne "bar"
bar != "bar"
bar in ("bar", "baz", 'qux')
baz < 5
qux <= 5.5
quux > 5
quuux >= 5.5
}
Features
The following features are supported by the analytics DSL interpreter:
Feature Name | Feature Value | Description |
---|---|---|
Description | The description text | The custom description to set for the attribute in the DSL schema information table. |
ChecksumInclude | - | Specifies that this attribute should be included into a checksum when the default checksum behavior is set to exclude |
ChecksumExclude | - | Specifies that this attribute should be excluded from checksum when the default checksum behavior is set to include |
DSL Example
The following example contains a fully functional DSL that creates a report for presonal auto quote premiums:
Import Model PersonalAuto
Report QuotePremiums {
Modeled using PersonalAuto
Variations {
quote
}
}
Stream QuotePremiums from CommandExecutedEvent.output[IsRatedQuote] in GEvents_PersonalAuto,
Batch QuotePremiums[IsRatedQuote] {
// Attribute declarations
Attr quoteNumber is policyNumber
Attr quoteStatus is state
Attr quoteInitiationDate: DateTime is accessTrackInfo.createdOn
Attr quoteTransactionType is transactionDetails.txType
Attr termPremium is Sum(PremiumComposite.premiumAggregates.premiumEntries[IsGWP].termAmount.amount)
// Attribute filter predicates
Filter IsGWP {
premiumCode == "GWP"
}
// Join-table statements
Join PremiumComposite
}
Filter IsRatedQuote {
state in ("rated", "bound", "proposed")
}
Root Level Join
Simple join statements within the same model (by "rootId" and "revisionNo") can be written more concisely, for example:
Join PremiumAggregate using {
rootId == rootId
revisionNo == revisionNo
}
Is equivalent to:
Join PremiumAggregate
Child Level Join
If a join is defined within a Ref:
Ref LineOfBusiness is AutoLOB {
Attr numberOfRiskItems is Count(AutoVehicle)
// Join-table statements
Join AutoVehicle
}
Then it is equivalent to:
Ref LineOfBusiness is AutoLOB {
Attr numberOfRiskItems is Count(AutoVehicle)
// Join-table statements
Join AutoVehicle using {
rootId == rootId
revisionNo == revisionNo
parentId == id
}
}
Optimizing Children Table Amount
The DSL creates new Hive table for every Ref that in complex reports leads to huge amount of tables with virtually the same structure. For instance, if the report has 10 places where AutoAddressInfo entity is used for Ref, there will be 10 different tables with address info. To avoid such duplication and make report generating faster, for such cases it's recommended to create single Ref with multiple attribute expressions, like:
Ref AutoAddressInfo is
AutoVehicle.garagingAddress,
AutoVehicle.additionalInterests.address,
parties.personInfo.addressInfo,
prefillInfo.address,
billingInfo.address {
...
}
For such definition, single table AutoAddressInfo will be created with information for the specified attributes.
Still this feature has limitation - it should be used only for modeled entities to ensure that parentId is set correctly. In all modeled entities there is a _key.parentId attribute that correctly points to the parent of the entity, so this attribute is used by DSL. But in non-modeled entities there is no such attribute, so parentId is set to id of the entity the Ref is defined in. For the example above, if address is not a modeled entity and the Ref is defined inside of Policy, all rows in AutoAddressInfo table will have parentId == Policy.id despite of the fact that they actually have different parents and the Policy is not a parent for any of them.
Schemas
There are three types of schemas generated for the report output in Hive: for realtime, batch and joined data. In every schema, a new table is created for every entity from the report DSL. The root table has the same name as the report, the child tables have the names assigned to Ref attribute. The report schema contains the views that join the data from both schemas using the Versioning rules in the DSL. By default, the tables are joined by selecting distinct rootId and max revisionNo from the entities. The views are generated if at least one of the report data sources are provided (either Batch or Realtime). If the DSL is updated, the columns are never removed, new columns are added to existing tables and nulls are set in place of old columns for the new data.
The naming rules are as follows:
- For batch mode data - {ReportName} BaseDSL.{TableName}
- For realtime mode data - { ReportName} IncrementalDSL .{TableName}
- For report data - { ReportName} Report DSL .{TableName}
Running the Interpreter
The main entry point for the job is the com.eisgroup.genesis.report.dsl.SparkDslRunner class. This class runs the DSL interpreter. It expects to be provided with the following configuration properties:
Spark configuration parameter | Is required | Default value | Description |
---|---|---|---|
spark.genesis.report.dsl | true | - | The file name of the DSL to interpret from the fat-jar |
spark.genesis.report.type | false | batch | Type of job to start batch, realtime or ****checksum**** |
spark.genesis.schema.prefix | false | - | Specifies a prefix to be added to hive schema name |
Deployment
The interpreter job can be deployed as any other spark job using the existing analytics infrastructure provided by Genesis. A custom module needs to be created that depends on the analytics-api. The module should be packaged as a fat-jar with DSL definitions. The job can then be deployed using the analytics-deployer.
Do not forget to include analytics-fgenerator in the module definition as well as the dependencies on domains for which the report will be created.
Configuring Report Partitions in DSL
In order to specify partitioning columns in DSL file - comma separated partitioning columns should be listed in a "Partitioning" block within "Report" block of a .greport file. Below is example of DSL syntax that should be used for this configuration.
Report SomeReport {
Partitioning {
{TimeColumns}, modelname, {BodyColumns}
}
}
Stream SomeReport from {StreamPath} in {StreamName},
Batch SomeReport {
{BodyColumns}
Ref SomeChildTable {
{BodyColumns}
}
}
There are three types (groups) of columns that can be specified in "Partitioning" block (all of them should be named in lowercase):
- Time columns - (required) these are reserved time columns that have
their values calculated automatically when running spark job. It is
mandatory that a supported set of time columns is used for
partitioning.
Supported combinations of time columns (columns must be specified in given order):year
;year, month
;year, month, dayofmonth
;year, weekofyear
;year, dayofyear
.
- "modelname" column - (required) "
modelname
" is a mandatory column for partitioning and its value is automatically set when running spark job. - Body columns - (optional) these are custom partitioning columns that
can be taken from the body of report. Such columns must:
- be of type Number or String;
- be named in lowercase;
- be also present in "Batch" block (and "Stream" if it is defined as a separate block) and in all inner "Ref" blocks of the report;
- not be empty.
Additional notes:
- Partitioning column groups in "Partitioning" block can be listed in any chosen order (e.g.: modelname -> {BodyColumns} -> {TimeColumns} or {BodyColumns} -> modelname -> {TimeColumns}). However, recommended order of partitioning columns is: {TimeColumns} -> "modelname" -> {BodyColumns}.
- Partitioning columns specified in “Partitioning” block are propagated to all tables of the report and used for partitioning.
- Partitioning columns in Hive are created in the same order as they are listed in the “Partitioning” block.
- Number of partitioning columns is not limited.
- Usage of time-based columns (year, month, weekofyear, dayofyear,
dayofmonth) in "Batch" and "Ref" blocks is not restricted by
validation, but it's not recommended as partitioning value would
take precedence in case of conflict (value assigned in report body
would be lost). For example:
- if "year" is used in report body and in Partitioning block - the value assigned in report body will be overriden by time value calculated in DslPartitioningStrategy ("year" column will hold calculated partitioning value (e.g. 2020)).
- if "dayofmonth" is used in report body, but is not used in Partitioning block - "dayofmonth" will be created as a regular column and will contain the value that was assigned in report body.
Validation
Below is a table with examples of valid and invalid DSL syntax for partition configuration.
Partitioning { <...> } | Batch { <...> } | Is Valid? | Comment |
- No “Partitioning” block is defined in DSL | Attr attributeOne | Yes | If no “Partitioning” block is defined in DSL, then DefaultPartitioningStrategy will be used. |
year, weekofyear, modelname | - No attributes from "Partitioning" block | Yes | If partitioning column name matches one of predefined time column names it does not need to be declared in “Batch” block because its value will be calculated automatically from “timestamp” using SQL functions provided by Spark . |
year, weekofyear, modelname, attributeOne | Attr attributeOne: String | Yes | Same as above, except it has additional partitioning attribute from report body. Such attribute must be of String or Number type. |
attributeOne, year, weekofyear, modelname | Attr attributeOne: String | Yes | Same as above, except columns are in different order. |
year, dayofmonth, modelname | - No attributes from "Partitioning" block | No | In valid combination of time columns. In order to use "dayofmonth" there must also be "month" column (e.g. year, month, dayofmonth). |
year, weekofyear | - No attributes from "Partitioning" block | No | Mandatory attribute "modelname" is missing in "Partitioning" block. |
year, weekofyear, modelname, attributeOne | Attr attributeOne: DateTime | No | DateTime type is not supported for partitioning columns, only String or Number attributes may be used for partitioning. |
year, modelname, attributeOne | - No “attributeOne” attribute | No | “attributeOne” is not one of reserved time column names, so it must be present in the main body of report. |
year, modelname | Attr year | Yes | It will not fail, but the value from "Batch" block will be overriden by automatically calculated "year" value. |
attributeOne | Attr attributeOne | No | Partitioning columns must contain at least one time based column and "modelname" (e.g. year, modelname). |
Below is an example of a valid partitioning column configuration:
Report SomeReport {
Partitioning {
year, month, modelname, bodyattribute2, bodyattribute1
}
}
Stream SomeReport from {StreamPath} in {StreamName},
Batch SomeReport {
Attr bodyattribute1: Number is <...>
Attr bodyattribute2: String is <...>
Ref SomeChildTable {
Attr bodyattribute1: Number is <...>
Attr bodyattribute2: String is <...>
}
}
Such configuration would result in following partitioning columns (in order):
- year - value is calculated automatically from applied timestamp;
- month - value is calculated automatically from applied timestamp;
- modelname - value is set automatically;
- bodyattribute2 - value is taken from the attribute in "Batch" block;
- bodyattribute1 - value is taken from the attribute in "Batch" block.
Measuring DSL Performance
DSL report processing contains the following main stages:
- Updating schema
- Creating schema info
- Loading and transformation
- Saving data the transformed data
- Saving checksums
There is instrumentation for every stage that allows to collect and log execution time. By default the instrumentation is disabled and it can be enabled and tuned by the following Spark configuration parameters:
Spark configuration parameter | Is required | Default value | Description |
---|---|---|---|
spark.genesis.gauge.enabled | No | false | Enables collection and logging execution time of stages. |
spark.genesis.gauge.logging.level | No | info | Measurements logging level. Can be either trace, debug, info, warn, or error. |
spark.genesis.gauge.threshold | No | 0 | The execution time threshold. If execution time is less than the threshold, it isn't logged. |
All records are logged for com.eisgroup.genesis.report.util.ExecutionTimeGauge logger, and they have the following CSV-like format for further processing:
[logging prefix],[log message],[rootId],[revisionNo],[id],[execution time in milliseconds]
There is an example of how such a log can look like:
Example of logged perf measurements
2020-09-28 05:52:56,157 INFO utils.LineBufferedStream: 2020-09-28 05:52:56,157 INFO util.ExecutionTimeGauge: ,Updating schema,,,,13964
2020-09-28 05:53:11,475 INFO utils.LineBufferedStream: 2020-09-28 05:53:11,475 INFO util.ExecutionTimeGauge: ,Creating schema info,,,,15318
2020-09-28 05:53:15,944 INFO utils.LineBufferedStream: 2020-09-28 05:53:15,944 INFO util.ExecutionTimeGauge: ,join with AutoBLOBDgig,e2e9f0a9-4285-4b5d-ab43-353022a626a6,1,,84
2020-09-28 05:53:15,958 INFO utils.LineBufferedStream: 2020-09-28 05:53:15,958 INFO util.ExecutionTimeGauge: ,join with AutoLOBDgig,e2e9f0a9-4285-4b5d-ab43-353022a626a6,1,,14
...
2020-09-28 05:53:16,220 INFO utils.LineBufferedStream: 2020-09-28 05:53:16,220 INFO util.ExecutionTimeGauge: ,root processing,e2e9f0a9-4285-4b5d-ab43-353022a626a6,1,,365
2020-09-28 05:53:16,234 INFO utils.LineBufferedStream: 2020-09-28 05:53:16,234 INFO util.ExecutionTimeGauge: ,exploding child,e2e9f0a9-4285-4b5d-ab43-353022a626a6,1,,4
2020-09-28 05:53:16,252 INFO utils.LineBufferedStream: 2020-09-28 05:53:16,252 INFO util.ExecutionTimeGauge: ,merging child,e2e9f0a9-4285-4b5d-ab43-353022a626a6,,4c26c594-0ee0-486c-abf5-29b050f299bd,1
2020-09-28 05:53:16,253 INFO utils.LineBufferedStream: 2020-09-28 05:53:16,253 INFO util.ExecutionTimeGauge: ,merging child,e2e9f0a9-4285-4b5d-ab43-353022a626a6,,65d7ae64-7386-48f5-8a4b-8cb164a8355a,1
...
2020-09-28 05:53:30,983 INFO utils.LineBufferedStream: 2020-09-28 05:53:30,983 INFO util.ExecutionTimeGauge: ,root processing,d185e197-9c52-485f-8970-923aa7e3f781,2,,30
2020-09-28 05:53:30,985 INFO utils.LineBufferedStream: 2020-09-28 05:53:30,985 INFO util.ExecutionTimeGauge: ,merging child,d185e197-9c52-485f-8970-923aa7e3f781,,95ab3f93-d7d0-4dab-9b06-a3737b0c1b10,1
...
2020-09-28 05:53:30,992 INFO utils.LineBufferedStream: 2020-09-28 05:53:30,992 INFO util.ExecutionTimeGauge: ,exploding child,d185e197-9c52-485f-8970-923aa7e3f781,2,,1
...
2020-09-28 05:54:24,467 INFO utils.LineBufferedStream: 2020-09-28 05:54:24,467 INFO util.ExecutionTimeGauge: ,saving DF for AutoPolicyCoverageCategory,,,,5940
2020-09-28 05:54:29,822 INFO utils.LineBufferedStream: 2020-09-28 05:54:29,822 INFO util.ExecutionTimeGauge: ,saving DF for Policy_AutoPolicyPersonPhoneCommunicationInfo,,,,5355
...
2020-09-28 05:57:01,990 INFO utils.LineBufferedStream: 2020-09-28 05:57:01,990 INFO util.ExecutionTimeGauge: ,Persist checksum for AutoPolicyCoverageCategory,,,,20230
2020-09-28 05:57:22,276 INFO utils.LineBufferedStream: 2020-09-28 05:57:22,276 INFO util.ExecutionTimeGauge: ,Persist checksum for Policy_AutoPolicyPersonPhoneCommunicationInfo,,,,20286
2020-09-28 05:57:40,697 INFO utils.LineBufferedStream: 2020-09-28 05:57:40,697 INFO util.ExecutionTimeGauge: ,Persist checksum for Policy_AutoVehicleOwner,,,,18421
Decreasing amount of DDL queries
During report execution, the engine generates different DDL queries to update both schemas and partitions, and the queries have visible constant performance overhead, especially in cloud environments. Usually it isn't a problem because even for pretty big DSL (~3000 lines of code) the overhead is usually no more than 20 minutes for clean Hive (usually for first ever run of the report) and no more than 15 minutes for Hive with existing schema for the cloud environment. But even that time can be decreased ~20% by saying the engine that the report is going to work with totally new data (in another word intersection of sets of already processed data and the data to be processed is an empty set). For that you set the following property to true:
Spark configuration parameter | Is required | Default value | Description |
---|---|---|---|
spark.genesis.partition.keep | No | false | Disables all DDLs that are generated to support data consistency when there is overlapping in partitions between reports that are run on different dates |
Changing storage level
There different phases in the DSL execution and two of them work with big amount of data:
- Data load and transformation
- Data conversion to data frames and saving the data in Hive
Those two phases use RDD persistence with MEMORY_AND_DISK storage level that shows optimal performance for big DSL reports with lots of data to process. Still the storage level can be changed by the setting the following properties to optimize performance of some particular report:
Spark configuration parameter | Is required | Default value | Acceptable values | Description |
---|---|---|---|---|
spark.genesis.transformation.persistence.storageLevel | No | MEMORY_AND_DISK | NONE DISK_ONLY DISK_ONLY_2 MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2 MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2 OFF_HEAP |
Sets storage level of RDD persistence for load and transformation phase of DSL report execution |
spark.genesis.dataframes.persistence.storageLevel | No | MEMORY_AND_DISK | NONE DISK_ONLY DISK_ONLY_2 MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2 MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2 OFF_HEAP |
Sets storage level of RDD persistence for data frames conversion and saving phase of DSL report execution |
Hive Queries Logging
During report execution, the engine generates different SQL Hive queries. You can tune the logging of the queries by the following parameters:
Spark configuration parameter | Is required | Default value | Description |
---|---|---|---|
spark.genesis.hive.query.logging | No | false | Enables Hive queries logging. All other logging parameters are applicable only if this parameter is true |
spark.genesis.hive.query.logging.listDriversForEachQuery | No | false | Logs all registered JDBC drivers before executing any Hive SQL query. It can be used for tracing cluster instability related to missing drivers |
spark.genesis.hive.query.logging.driverManager | No | false | Enables logging/tracing in DriverManager (see DriverManager.setLogWriter for details). It can be used for tracing cluster instability related to missing drivers or misconfigured drivers. Be cautious though because enabling this property will cause generating lots of data in Spark driver log. |
Query logs can be found by searching QueryLogger patter in logs and they look like in this example:
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,255 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_AutoLicenseInfo WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,255 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.AutoPolicyCoverages WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,255 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_PremiumEntries WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,256 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_AutoPolicyPersonEmailCommunicationInfo WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,256 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.AutoPolicyCoverageCategory WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,256 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_AutoPolicyPerson WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,256 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_AutoVehicleRiskItem WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,256 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,256 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_AutoPolicyPersonPhoneCommunicationInfo WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,257 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,257 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.UnderwritingClaimCount WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,258 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,258 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.ClaimScores WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,259 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,258 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_CarRatingSpecification WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC
2020-11-06 11:56:23,259 INFO utils.LineBufferedStream: 2020-11-06 11:56:23,259 INFO util.QueryLogger: Executing Hive query: SELECT DISTINCT year, weekofyear, modelname FROM Policy_AutoPolicySummaryBaseDSL.Policy_AutoPolicySummary WHERE modelName = 'PersonalAuto' ORDER BY year DESC, weekofyear DESC, modelname DESC