Data Engineering
How to Configure the Metastore in Apache Spark
This article explores the Hive metastore, detailing its setup configurations and necessary parameters for integration within a Spark framework. It covers both local and remote metastore options and discusses the various configurations required.
Spark Exasol connector – java.sql.SQLException: [ERROR] Connection String does not support (workerID) argument.
The current exasol connector (2.1.6) ist not compatible with the latest exasol jdbc driver (24.1.1) Context We have a Spark data processing pipeline that uses the Exasol connector version 1.4 and a JDBC driver version 7. Everything was working fine,…
Subtle Difference in Dockerfile and Dockercompose – Variables in Entrypoints
TLDR: Variables in Entrypoints should be escaped. This can be done by using a second $. Background While setting up a Spark Thrift Server i encountered a – in retrospective – obvious oversight. I would always get the following Error,…
Spark – Error with UTF8 encoding in Docker Image
In German, we encounter special characters known as Umlaute, including ä, ü, ö. If the configuration is not correctly set, encoding these symbols may result in information loss. Let’s explore a practical example where such a misconfiguration led to a…
Spark – java.nio.channels.UnresolvedAddressException
A very short writedown of the following error, which apperently this user also encountered and documented (github). Be aware, that this error code might appear in several scenarios. It just happened, that in my specific situation, it was an easy…
Exasol – object XXX not found
TLDR: Identifiers in Exasol are stored in upper case internally. Selections should also be quoted. Observation: In Exasol I created a Python User Defined Function like this: Then tried to call the UDF in a select statement like this: However…
Python – Pass by object: Practical pitfall
Inside a loop I was accessing an object within a dictionary multiple times, transform and visualize it. The intention was, to have all transformation isolated from each other. What actually happened though, was that those transformations accumulated because of Python’s…
Duplicate Keys when Generating a Json from a Dictionary in Python
TLDR: A dictionary in json treats all keys as string, while a python dict distinguishes not only between the content but also its datatype (see stackoverflow). When saving a dictionary into a json and reloading the dictionary from it, you…
How To Create A Superset Guest Token With Python To Embed Dashboards
The ulterior motive is to embed a Superset Dashboard into e.g. a REACT application. To achieve this, one step includes the creation of guest tokens (service accounts). This process is (in my opinion) not sufficiently well documented, which is why…
Airflow – Fill Dagbag takes too long
TLDR: It is possible to dynamically create dags with only one dag script. However, at task execution the original dag script will be parsed once again. This results in unnecessary parsing iterations of dags, which are not the parent dag…