Interpreters

Teragrep uses an Apache Zeppelin feature called Zeppelin Interpreters to implement different language backends. These language backends are categorized into interpreter groups.

Interpreters within the same group can share data between each other. For example, you can perform a search using DPL, and then further process the results using Scala, as they both belong to the Spark group of interpreters.

To choose an interpreter, enter a percent sign (%) followed by the interpreter name in the paragraph’s code editor. For example: %dpl

The interpreter may crash if a programming error has been made. To restart it, click the reload icon in the Interpreter Binding Settings.

Data Flow

Teragrep takes user input using paragraphs in the Notebook View. When a notebook is created, it is stored as new file on a Git repository. New notebooks are created with a single paragraph.

Select the desired interpreter in the paragraph’s code editor. If the interpreter belongs to the Apache Spark group, the code is then packaged and launched within Apache Spark. If not, the code is compiled and executed directly. Results of the execution are then pushed back to the Teragrep user interface, which processes the received output and displays it on the output section of the paragraph.

DPL abstracts the complexity of the tasks required to access the Teragrep Archive.

The code is executed using the user’s credentials. Because of this, the program receives the same user level permissions that the user has on the operating system that is running Teragrep, and the operating system logging mechanism (such as SELinux) records a log of all the operations.

Interpreter Dependencies

Teragrep can download dependencies when configured with a remote Maven Repository.

For more information, see Apache Zeppelin’s documentation.

For security reasons, dynamic dependency loading is currently not shipped with Teragrep. However, dependencies can be configured for each interpreter individually.

Spark

Data Processing Language (DPL)

Teragrep’s Data Processing Language (DPL) is selected by entering %dpl on the paragraph’s code editor.

By default, DPL uses US time format (MM/DD/YYYY) for compatibility reasons.

The following example has the following criteria:

  • Matching to keywords "200000762939453" and "997913837433"

  • Only from the dataset "f17" (specified with index=)

  • With a specific stream (specified with sourcetype=) and with a specific host (specified with host=)

  • Within a predefined date range between "01/08/2020:02:00:00" and "01/08/2020:02:00:00"

%dpl

index="F17" sourcetype=lOg:f17:0 AND host=sC-99-99-14-162 earliest="01/08/2020:02:00:00" latest="01/08/2020:02:00:00" 200000762939453 997913837433
53033599

Count Aggregation

The following example aggregates all data:

  • From column "_raw"

  • Grouped by "host"

  • Within the dataset "f17"

%dpl

index=f17 sourcetype=log:f17:0 _index_earliest="12/31/1970:10:15:30" _index_latest="12/31/2022:10:15:30" | chart count(_raw) by host

The values _index_earliest= and _index_latest= are currently aliases to earliest= and latest=.

53033598

Data output

DPL produces the data into a Spark Query named after the paragraph’s unique identifier. (You can see each paragraph’s identifier from the Paragraph Settings menu.) This output can then be processed with other interpreters that belong to the same interpreter group.

Scala

Scala is the default language for Apache Spark. Because of this, Scala is selected by entering %spark on the paragraph’s code editor.

%spark

println(sc.version)
println(util.Properties.versionString)
println(java.net.InetAddress.getLocalHost().getHostName())
println(System.getProperty("user.name"))
println(java.time.LocalDateTime.now.format(java.time.format.DateTimeFormatter.ofPattern("dd.MM.yyyy hh:mm:ss")))
53033626

PySpark

%pyspark
import sys
import datetime
import socket
import getpass
import numpy

print(sys.version)
print(numpy)
print(socket.gethostname())
print(getpass.getuser())
print(datetime.datetime.now())
53033628

Spark SQL

Due to the limitations in Spark, the data is wrapped into an array when selecting from a dataset that is provided with a Recall function. SQL code for unwrapping can be provided on request if necessary.

%sql

SELECT * FROM `paragraph_1622045160279_903527815`;
53033629

Spark SQL execution gives you detailed information about failures.

53033630

Kotlin

%kotlin

println("Hello this is Kotlin!")
53033631

Spark Configuration

You can alter the Spark configuration by using the interpreter %spark.conf.

Resource Allocation

The administrator of the Spark Cluster may limit the available resources per user. A request that exceeds this limit will crash the interpreter.

Dynamic Allocation profile:

%spark.conf
spark.dynamicAllocation.enabled=true

Static Allocation profile:

%spark.conf
spark.executor.instances=8
spark.executor.cores=1
spark.executor.memory=1g

Standalone Interpreters

Standalone interpreters are interpreters that don’t belong to any group with other interpreters. Because of this, they can’t share data between other interpreters in memory. They can still access files within the file system, of course.

You can use the standalone interpreters to demonstrate different code snippets within your organization. You can combine this with the Maven Dependency feature, which can download Maven artifacts from your organization’s internal repository. This is a handy way of working when using DevOps practices.

Java

%java

public class JavaExample {
    public static void main(String[] args) {
        System.out.println("Java? Yes!");
    }
}
53033617

Shell

%sh
whoami
hostname
date
53033618

MarkDown

%md
Markdown Example
===
List

* Item 1

Numbered list
1. Item
53033619

Python

Python and PySpark are two seperate interpreters, even though they both use the same language. You can use PySpark to access Spark features.
%python
import sys
import datetime
import socket
import getpass
import numpy

print(sys.version)
print(numpy)
print(socket.gethostname())
print(getpass.getuser())
print(datetime.datetime.now())
53033620

Display Systems

Teragrep uses AngularJS to render output.

Angular can be used either in frontend mode or in backend mode. In frontend mode the browser code controls the execution; in backend mode, other interpreters are allowed to modify the displayed content.

You can use the frontend mode by entering %angular on the paragraph’s code editor:

53033616

You can use the backend mode by printing out standard output strings that begin with %angular:

%sh

echo '%angular <h1>Hey</h1>'
53033602