Analyzing ETL Jobs with SnappyFlow

Analytics processing usually requires user to run a sequence of database reads/write jobs and transformation operations. In many cases these jobs are complex and may take a long time to complete. It is a very common need of the users to monitor individual jobs runs, summarize performance of different jobs, compare two runs of a specific job type.

SnappyFlow provides an easy yet powerful tool to analyze ETL Jobs. The approach is explained in the following video

Step 1: Drop logs from ETL Jobs#

SnappyFlow allows for a job to have up to a 3-level hierarchy- Job, Stage, Task. Logs in JSON format have to be dropped whenever a job/stage/task is started, completed or terminated. This log can be parsed using SnappyFlow’s ETL parser.

Log format for a Job:

{    "jobName": <Job-name>,    "jobId": <Unique JobId>,    "time": <Time in epoch milliseconds format>    "type": "job",    "status": <status: started, success, failed, aborted>  }

Log format for a Stage:

 {     "jobName": <Job-name>,     "jobId": <Unique JobId>,    "stageId": <stageId>,    "stageName": <stageName>     "time": <Time in epoch milliseconds format>     "type": "stage",      "status": <status can be started, success, failed, aborted> }

Log format for a Task:

{    "jobName": <Job-name>,    "jobId": <Unique JobId>,    “stageId”: <staged>,    “stageName”: <stageName>    "time": <Time in epoch milliseconds format>    "type": "task",    "status": <status can be started, success, failed, aborted>}

Step 2: Forward logs to SnappyFlow#

Add the following log parser to logging section of sfAgent’s config.yaml:

logging:  plugins:    - name: etlRaw      enabled: true        config:          log_path: <log file path>

Restart sfAgent with the new configuration.

service sfagent restart

Check if documents have been received in SnappyFlow. You will find 3 documents under metrics with plugin name as “etlRaw” and documentType as “job”, “stage” and “task” depending on your hierarchy.

Step 3: Generate an access URL for use by summarization module#

Logs shipped to SnappyFlow are in a raw form and they cannot be directly used for reporting and analysis. Therefore user has to export this raw data to a summarization script that transforms the data and sends it back to SnappyFlow into a new document.

Import a ETL template into your dashboard.

Go to “Scratchpad” pane

Click on ‘Export API Endpoint’ option in the component and create component URL for all 3 components for interval, say Last 5 mins.

Click on the ‘API Endpoints’ option for the project to view the API List. Copy the URL’s for the 3 components and the Authentication token. These need to be provided in Step 4

Step 4: Run summarization script as a cronjob#

Install the pip utility from the below link. Refer to the link for Installation and Usage instructions.

sfapmetl · PyPI

The python script takes a config file path as input
Set values for key, appName, projectName, Name.
Provide the component Url’s for Job, stage and Task and authKey (from Step 3)
The data will be available in the dashboard under the plugin ‘etlReport’ and documentType - job, stage and task.

Step 5: Review ETL Dashboards#

You will now see the summarized data in dashboard under etlReport for job, stage and tasks. Select a particular job and choose a timeline to see job duration trends over the selected time period. Clicking on a particular job id provides a drilled down view of stages and tasks within that job.