JRald blog

Build your own CA with Ansible

2020-11-28T00:00:00+00:00

Securing your Kafka, Elasticsearch, Cassandra, or whatever distributed software requires configuring using SSL (also known as TLS) to encrypt communications:

Node to node communication
Client to node communication

Setting up SSL means providing SSL certificates for each node. But generating SSL certificates is a cumbersome task:

The Kafka documentation describes extensively the process.
Elasticsearch brings its own elasticsearch-certutil tool.
Datastax also documents a similar process for Cassandra

I will describe here how to generate an SSL certificate for each node using Ansible. It makes sense as I am also deploying Kafka, Elasticsearch and the like with Ansible.

There are several important rules to know when generating certificates:

The name present in the certificate must match the public name of the host. We can not share the same certificate on all nodes unless using star certificates. Any TLS client connecting to a node will check that certificate name and hostname matches unless disabling hostname verification.
The name present in the certificate, should match the reverse DNS name corresponding to the IP of the host. Java clients connecting to a node, will do a reverse DNS lookup to get the public name of the host they are connecting to.

These two rules are meant to prevent Man in the middle attacks. A TLS certificate allows checking you’re talking to the wanted target, not something in between which could spy and steal information.

When a machine has multiple names (think about DNS aliases, virtual hosts), a certificate can contain multiple names. The main name is called CN (Common Name), while other names are called SAN (Subject Alt Names).

The certificate authority

As Kafka or Elasticsearch clusters should never be publicly exposed, using a public certificate authority (Thawte, Verisign and the like) is not necessary. A self-signed certificate authority local to the cluster or the environment (Dev, Q/A) should be enough.

So the first step is to create a certificate authority that will be used to sign the certificates of all hosts belonging to our cluster. As this step will be done only once, I won’t automate it.

$ mkdir ownca
$ openssl req -new -x509 \
    -days 1825 \ (1)
    -extensions v3_ca \ (2)
    -keyout ownca/root.key -out ownca/root.crt (3)

Generating a RSA private key
......+++++
....+++++
writing new private key to 'ownca/root.key'
Enter PEM pass phrase: (4)
Verifying - Enter PEM pass phrase:
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:FR (5)
State or Province Name (full name) [Some-State]:.
Locality Name (eg, city) []:.
Organization Name (eg, company) [Internet Widgits Pty Ltd]:eNova Conseil
Organizational Unit Name (eg, section) []:.
Common Name (e.g. server FQDN or YOUR name) []:Root
Email Address []:rootca@enova-conseil.com

1	The CA root certificate will last 5 years
2	This certificate will be used as a CA
3	Generate both key and self-signed certificate
4	The key is protected with a password
5	Information describing the Root certificate

For safety reasons, the generated key should be kept secret and stored in a secure place:

It must not be transfered to target Kafka servers
It must not be kept in source control (Git) unless hidden in Ansible Vault password file

The nodes certificates

This is where Ansible comes in. As your cluster might have many nodes, automating certificate generation makes sense. For each target host, I will repeat the same process:

On the target host, generate a key target.key and a CSR (Certificate signing request) target.csr
Pull the CSR on the control host.
Sign the CSR with the CA key. This will generate a certificate target.crt.
Push the generated certificate target.crt on the target host. The CA certificate root.crt is also pushed.

As the TLS keys .key are sensitive, they do not travel, they stay where they were generated. On the contrary, certificates .crt and CSRs .csr only contain public information.

# Step 1
- name: Generate private key
  become: true
  openssl_privatekey:
    path: "/etc/pki/tls/private/{{ openssl_name }}.key"

- name: Generate CSR
  become: true
  openssl_csr:
    path: "/etc/pki/tls/private/{{ openssl_name }}.csr"
    privatekey_path: "/etc/pki/tls/private/{{ openssl_name }}.key"
    country_name: FR
    organization_name: "eNova Conseil"
    common_name: "{{ openssl_name }}"
    subject_alt_name: "DNS:{{ ansible_host }},DNS:{{ ansible_fqdn }}"

# Step 2
- name: Pull CSR
  become: true
  fetch: 
    src: "/etc/pki/tls/private/{{ openssl_name }}.csr"
    dest: "{{ openssl_ownca_dir }}/{{ openssl_name }}.csr"
    flat: true

# Step 3
- name: Sign CSR with CA key
  connection: local
  delegate_to: localhost
  openssl_certificate:
    path: "{{ openssl_ownca_dir }}/{{ openssl_name }}.crt"
    csr_path: "{{ openssl_ownca_dir }}/{{ openssl_name }}.csr"
    ownca_path: "{{ openssl_ownca_dir }}/root.crt"
    ownca_privatekey_path: "{{ openssl_ownca_dir }}/root.key"
    provider: ownca

# Step 4
- name: Push certificate
  become: true
  copy:
    src: "{{ openssl_ownca_dir }}/{{ openssl_name }}.crt"
    dest: "/etc/pki/tls/private/{{ openssl_name }}.crt"

- name: Push CA
  become: true
  copy: 
    src: "{{ openssl_ownca_dir }}/root.crt"
    dest: "/etc/pki/ca-trust/source/anchors/root.pem"

Once you have the key, the certificate and CA certificate chain on the target host, you can start using them:

- name: Update CA Trust
  become: true
  command: "update-ca-trust extract"

- name: Build PKCS12 file containing key and cert
  become: true
  openssl_pkcs12:
    action: export
    path: "/etc/pki/tls/private/{{ openssl_name }}.p12"
    friendly_name: "{{ openssl_name }}"
    privatekey_path: "/etc/pki/tls/private/{{ openssl_name }}.key"
    certificate_path: "/etc/pki/tls/private/{{ openssl_name }}.crt"
    other_certificates: "/etc/pki/ca-trust/source/anchors/root.pem"
    state: present

The produced PKCS12 file can be used as a Java Keystore. The java_keystore Ansible module can be used to create a JKS file instead.

The attentive reader has noticed I am using a bunch of openssl_xxx Ansible modules (namely openssl_privatekey, openssl_csr, openssl_certificate and openssl_pkcs12). These modules require to have openssl and PyOpenSSL installed on each host.

- name: Python OpenSSL package
  become: true
  yum:
    name: 
      - pyOpenSSL
      - python2-pip
      - ca-certificates

- name: Upgrade Python OpenSSL
  become: true
  pip:
    name: pyOpenSSL>=0.15

Retrieving Kafka Lag

2020-01-16T00:00:00+00:00

This article shows how to get Kafka lag for a given consumer group using the Java API. It’s about implementing part of the kafka-consumer-group command-line tool in pure Java.

To get consumer lag we will go through several steps:

Get consumer group current offset, 4 in the above example
Get topic end offset: the producers offset, 8 in the above example
Compute the lag: the difference between both

Getting consumer group offset

Kafka 2.0 introduced an AdminClient class which contains a very useful listConsumerGroupOffsets method. This method returns for a given consumer group a dictionary (topic name, partition) → current offset

        return adminClient
                .listConsumerGroupOffsets(groupId)
                .partitionsToOffsetAndMetadata().get();

Obviously, this solution expects consumer offsets to be stored in Kafka’s __consumer_offsets topic. It does not apply, for example, to some Kafka Connect sink implementations which store their lag in the target data store.

The listConsumerGroupOffsets is asynchronous and returns a KafkaFuture (some kind promise) which implements Java’s Future. My code is blocking, there is room for improvement.

To get consumer group Ids, there is a listConsumerGroups in the same AdminClient class:

        return adminClient
                .listConsumerGroups()
                .valid()
                .thenApply(r -> r.stream()
                        .map(ConsumerGroupListing::groupId)
                        .collect(toList())
                ).get();

By computing the current offset derivative, we could compute the consumer message rate.

There is another method to get consumer offsets, it is in the consumer client and is named committed. Contrary to listConsumerGroupOffsets method, it requires to know the consumed topic partitions. So it’s useless in our case.

Getting topic end offset

The KafkaConsumer class contains an endOffsets method to get the end offset of a topic partition. It returns a dictionary (topic name, partition) → end offset

            return consumer.endOffsets(partitions);

By computing the end offset derivative, we could compute the producer message rate.

Getting the topic start offset using beginningOffsets method, we also could compute the topic size per partition.

Joining offsets and computing lag

Both consumer offsets and topic end offsets are given per partition. To compute the lag we have to do a join using the topic partition as key.

            Map<TopicPartition, OffsetAndMetadata> consumerGroupOffsets = getConsumerGroupOffsets(groupId);
            Map<TopicPartition, Long> topicEndOffsets = getTopicEndOffsets(groupId, consumerGroupOffsets.keySet());
            Map<Object, Object> consumerGroupLag = consumerGroupOffsets.entrySet().stream()
                    .map(entry -> mapEntry(entry.getKey(), new OffsetAndLag(topicEndOffsets.get(entry.getKey()), entry.getValue().offset())))

As consumer lag is equal to topic end offset - consumer current offset, computing it is straightforward:

            long lag = endOffset - currentOffset;
            if (lag < 0) {
                lag = 0;
            }
            return lag;

Conclusion

We managed to get consumer lag using the Java Kafka client API and a few lines of code.

However, I regret several things about this API:

The endOffsets method is not in the AdminClient class. If it were the case, instantiating a consumer would be useless.
We have to open two connections, and repeat twice the connection settings like bootstrap.servers, once for the admin client, then for the consumer client. It would be interesting if they could share options and maybe even the TCP connection.
AdminClient class often returns KafkaFuture<Something>, the API design is very different from Consumer and Producer clients. I wonder why they created a KafkaFuture class instead of reusing CompletableFuture.

Home temperature monitoring

2020-01-10T00:00:00+00:00

I had an unused Raspberry Pi 3 and some free time during holidays. Here is what I built to monitor temperature and humidity. It’s just a working prototype but it was both simple and fun to do.

I used Grafana and InfluxDB because both run on ARM and Raspberry, and are very easy to set up. Both use the Go language, so no specific runtime environment is needed.

Hardware

It all started with the Raspberry Weather web site, which brought me to the DHT22 and DS18B20 sensors.

I bought the temperature module from AZ Delivery. It’s a DHT22 (AM2302) temperature and humidity sensor soldered on tiny board with a small resistance. As a result, all you need is provided (even jumper wires), you don’t need anything else: no breadboard, no resistance… AZ Delivery also provides some documentation about their product as a PDF e-book. which explains how to plug this module on either an Arduino or a Raspberry.

I used the Raspberry Pi 3 I had, but a smaller or older board should be enough.

Software

Raspbian

First I flashed a brand new Raspbian Buster on the micro SD card. The image can be downloaded on Raspberry Pi web site. With raspi-config I configured the network (WiFi), and enabled SSH.

InfluxDB

Then I installed InfluxDB on the Raspbian, by adding the InfluxData Debian repository

/etc/apt/sources.list.d/influxdb.list

deb https://repos.influxdata.com/debian buster stable

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install influxdb
sudo systemctl start influxdb

To check that InfluxDB is running, you can cUrl it:

curl -s -v http://localhost:8086/ping
...
< HTTP/1.1 204 No Content
< Content-Type: application/json
< Request-Id: c28feb69-30bc-11ea-8628-b827eb3ca438
< X-Influxdb-Build: OSS
< X-Influxdb-Version: 1.7.9

Grafana

As Grafana is concerned, I downloaded the .deb file and installed it as described on Grafana web site

wget https://dl.grafana.com/oss/release/grafana-rpi_6.5.2_armhf.deb
sudo dpkg -i grafana-rpi_6.5.2_armhf.deb

Grafana service is not enabled nor started by default.

sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

To check that Grafana is running, you can open a browser on http://myraspberry.local:3000, and log in. The default admin/admin user account can be changed in /etc/grafana/grafana.ini config file.

Code

The code reads temperature and humidity from the sensor every minute or so. And then writes the result in the InfluxDB database.

I hesitated to do it with Go language, but I chose Python because I immediately found simple code examples.

Python

To boostrap my Python environment, I installed several packages:

sudo apt-get install build-essential python3-dev python3-openssl \
  python3-setuptools python3-pip python3-wheel python3-yaml python3-influxdb

To read from my DHT22 sensor, I used the Adafruit_Python_DHT package even if it’s deprecated. I’ll try to use the newer Adafruit_CircuitPython_DHT package later.

sudo pip3 install Adafruit_DHT

The code I wrote is based on provided code samples.

To write into the InfluxDB database I used the official Python client. Its documentation can be found here. The code I wrote is based on provided code samples.

Sources

The InfluxDB database can be created with influx CLI tool

influx -execute 'create database dht22'

The sources are here, there are 3 files:

dht22.py: The Python code to read DHT22 sensor and write the result into InfluxDB
dht22.yml: The config file telling where the sensor is plugged and which database to use.
dht22.service: The service unit to place in /etc/systemd/system to automatically start dht22.py when Raspberry Pi boots.

I had 2 kinds of problems:

There were holes in the graphs because the polling loop was stuck for more than a minute. At the moment, I ignore why.
The sensor returned out of range values (3200% humidity, -10°C inside). So I added measure validation to avoid weird graphs.

Kafka connect plugin install

2019-12-10T00:00:00+00:00

You want to use a Kafka Connect plugin with stock Apache Kafka, or you can not use the confluent-hub tool because your server is behind a firewall. Then this blog post is for you.

I’ll show how to get Kafka Connect JDBC running without using confluent-hub install.

Download

First, use the Confluent Hub to find Kafka connect plugins. Once you’ve found the plugin you were looking for, you should check the Licensing. Most plugins created by Confluent Inc use the Confluent Community License and are mostly open source.

When you click on the Download button, you’ll have to provide an email to get the plugin zip file.

I’ll take the Kafka Connect JDBC plugin as an example. Once you’ve shown your passport to Confluent toll, you’ll get a confluentinc-kafka-connect-jdbc-5.3.1.zip.

Install

Unzip the confluentinc-kafka-connect-jdbc-5.3.1.zip and you’ll get a confluentinc-kafka-connect-jdbc-5.3.1 containing:

lib contains binaries Jar files
etc contains sample configuration files
doc contains some documentation and the license file

Then in your Kafka folder (/opt/kafka_2.12-2.3.1):

$ cd /opt/kafka_2.12-2.3.1
$ mkdir plugins (1)
$ cd plugins
$ ln -s /opt/confluentinc-kafka-connect-jdbc-5.3.1/lib jdbc (2)

1	Create a `plugins` folder to contain plugins
2	Link the `lib` folder of the plugin in the `plugins` folder

Configure

In the Kafka Connect configuration file connect-standalone.properties (or connect-distributed.properties), reference the plugins folder:

connect-standalone.properties

bootstrap.servers=localhost:9092

plugin.path=/opt/kafka_2.12-2.3.1/plugins (1)

1	Path to the `plugins` folder

Finally use the sample config files in confluentinc-kafka-connect-jdbc-5.3.1/etc to create your own:

thing-jdbc-sink.properties

name=thing-jdbc-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=thing (1)
connection.url=jdbc:sqlite:thing.db (2)
auto.create=true

1	Input topic containing Avro values. This means you’ll need the Avro Converter plugin and the Confluent Schema Registry as well.
2	Output database

Run

To run Kafka Connect in standalone mode just run it with above config files:

$ bin/connect-standalone.sh config/connect-standalone.properties config/thing-jdbc-sink.properties

To check the connector is properly running, you can cUrl the REST API:

$ curl -s http://127.0.1.1:8083/connectors/thing-jdbc-sink/status |jq '.'
{
  "name": "thing-jdbc-sink",
  "connector": {
    "state": "RUNNING",
    "worker_id": "127.0.1.1:8083"
  },
  "tasks": [
    {
      "id": 0,
      "state": "RUNNING",
      "worker_id": "127.0.1.1:8083"
    }
  ],
  "type": "sink"
}

Kafka integration tests

2019-07-03T00:00:00+00:00

You’re developping a Java application plugged to Kafka, or maybe you’re programming a data processing pipeline based on Kafka Streams. How do you automate tests involving both Java code and Kafka brokers?

Such an integration tests should be able to

Start Zookeeper and then Kafka
Send messages into Kafka so as to trigger business code
Consume messages from Kafka and check their content
Stop Kafka and then Zookeeper

Kafka embedded in the test

As both Kafka and Zookeeper are Java applications, it is possible to control them from Java code. It is possible (have a look at camel-kafka or logback-kafka-appender), but is not easy.

There are many libraries to run an embedded Kafka from JUnit without sweating:

Kafka JUnit by Charith Ellawala
Another Kafka JUnit by Markus Günther
Spring Kafka Test by the Spring team

The drawback of this solution is that Kafka and Zookeeper servers are started in the same JVM as your test. So one can fear unexpected behaviour.

Kafka JUnit

Charith’s Kafka JUnit library is one of the most simple and efficient.

        <dependency>
            <groupId>com.github.charithe</groupId>
            <artifactId>kafka-junit</artifactId>
            <version>4.1.5</version>

This library supports both JUnit 4 & 5.

@ExtendWith(KafkaJunitExtension.class) (1)
@KafkaJunitExtensionConfig(startupMode = StartupMode.WAIT_FOR_STARTUP)
public class CharitheMessageServiceIT {

    private static final String TOPIC = "kafka_junit";

    @Test
    void testSendAndConsume(KafkaHelper kafkaHelper) throws Exception { (2)
        String bootstrapServers = kafkaHelper.producerConfig().get(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG).toString();
        sendAndConsume(bootstrapServers, TOPIC);
    }

1	Load JUnit 5 extension that will start an embedded Kafka
2	A `kafkaHelper` is injected to get embedded Kafka address

This KafkaHelper contains several methods to easily produce and consume messages

        ListenableFuture<List<String>> futureMessages = kafkaHelper.consumeStrings(TOPIC, 3); (1)
        kafkaHelper.produceStrings(TOPIC, "one", "two", "three"); (2)
        List<String> messages = futureMessages.get(5, TimeUnit.SECONDS);
        assertThat(messages).contains("one", "two", "three");

1	Start a non blocking consumer
2	Produce some messages in a Topic

Spring Kafka Test

Spring Kafka Test is an addition to Spring Kafka library.

        <dependency>
            <groupId>org.springframework.kafka</groupId>
            <artifactId>spring-kafka-test</artifactId>
            <version>${spring-kafka.version}</version>
            <scope>test</scope>
        </dependency>

This library supports only JUnit 4 at the moment, as a result it contains a JUnit Rule to handle embedded Kafka lifecycle.

public class SpringMessageServiceIT {
    private static final String TOPIC = "spring";
    @ClassRule (1)
    public static EmbeddedKafkaRule kafka = new EmbeddedKafkaRule(1,
            false, TOPIC);
    @Test
    public void testSendAndConsume() throws Exception {
        sendAndConsume(kafka.getEmbeddedKafka().getBrokersAsString(), TOPIC); (2)
    }

1	JUnit 4 Rule that will start an embedded Kafka and create a topic.
2	The `kafka` rule is used to get the embedded Kafka address.

Spring Kafka Test contains a KafkaTestUtils class which is a swiss army knife to write Kafka related tests.

        try(Consumer<Integer, String> consumer = new KafkaConsumer<Integer, String>( (1)
                KafkaTestUtils.consumerProps("spring_group", "true", kafka.getEmbeddedKafka()))) {
            KafkaTemplate<Integer, String> template = new KafkaTemplate<>( (2)
                    new DefaultKafkaProducerFactory<>(
                            KafkaTestUtils.producerProps(kafka.getEmbeddedKafka())));
            consumer.subscribe(Collections.singleton(TOPIC));

            template.send(TOPIC, "one");
            template.send(TOPIC, "two");

            ConsumerRecords<Integer, String> records = KafkaTestUtils.getRecords(consumer); (3)
            assertThat(records).are(value("one")); (4)

1	Use `KafkaTestUtils` to create a consumer.
2	Use `KafkaTestUtils` along with the usual `KafkaTemplate` to quickly send messages.
3	Use `KafkaTestUtils` to quickly consume messages.
4	The `KafkaConditions` integrates with AssertJ to make received messages simpler.

Spring Kafka Test is probably the way to go when you’re developping a Spring application. However this library lacks some syntactic sugar to make tests more readable.

Kafka in docker

Test containers purpose is to start Docker containers from JUnit to do integration tests with any product: MySQL, Elasticsearch, Kafka… There is a base module, a Kafka extension and a JUnit 5 extension.

        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>testcontainers</artifactId>
            <version>${testcontainers.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>kafka</artifactId>
            <version>${testcontainers.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>junit-jupiter</artifactId>
            <version>${testcontainers.version}</version>
            <scope>test</scope>
        </dependency>

Testcontainers library is strongly integrated with JUnit 5, a single annotation and you’re done. A JUnit 4 rule is also available.

@Testcontainers (1)
public class ContainersMessageServiceIT {
    private static final String TOPIC = "containers";
    @Container (2)
    public KafkaContainer kafka = new KafkaContainer("5.2.1");

    @Test
    public void testSendAndConsume() throws Exception {
        sendAndConsume(kafka.getBootstrapServers(), TOPIC);
    }

1	Trigger Testcontainers start
2	Create a Kafka container. By default the cp-kafka Docker image created by Confluent is used. As a consequence the version number matches the Confluent Platform version, not Apache Kafka.

As Testcontainers is a generic library to run containers, there is no helper class to read/write messages. Starting a Docker container is slower than starting an embedded Kafka, but process isolation is stronger. You are starting the real thing, no hacked Kafka broker, so you are closer to production. 3 Dockers containers are actually used by Testcontainers Kafka.

Kafka Consumer subscriptions

Dealing with asynchronous code in tests is often painful, Kafka consumers don’t help.

It can take a long time for the consumer group controller to be elected, and partitions to be assigned. Between the consumer bootstrap and the first messages being received, it can take a second or so.

Using a ConsumerRebalanceListener to wait for partitions to be assigned and check which ones are assigned can be useful.

The Awaitility library can aleviate the burden of asynchronous testing.

Logging configuration

2019-05-17T00:00:00+00:00

There is nothing fundamentally new in this article. It’s just a quick reminder (to myself) about Java logging framework configuration.

The configuration files will contain:

Console aka stdout output
Rolling file output with both date and size rolling policies. Rotating the file every day is practical because it allows to find yesterday’s failure cause easily. Rotating when the file reaches a given size allows to protect against disk flood.
Sample log patterns to help formating log file.

Logging libraries support multiple configuration file format: XML, Properties, YAML… I chose to use XML, it’s a matter of taste.

Logback

pom.xml

        <dependency>
            <groupId>ch.qos.logback</groupId>
            <artifactId>logback-classic</artifactId>
            <scope>runtime</scope>
        </dependency>

Logback implements SLF4J from the ground up.

logback.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration debug="true">(1)
    <!-- Properties -->
    <property name="log.dir" value="target/log" />(2)

    <!-- Appenders -->
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>(3)
            <pattern>%date{HH:mm:ss.SSS} %-5level [%thread] %logger{1} - %msg%n</pattern>
        </encoder>
    </appender>
    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>blog.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>${log.dir}/blog.%d{yyyy-MM-dd}-%i.log</fileNamePattern>
            <maxFileSize>10MB</maxFileSize>
            <maxHistory>10</maxHistory>
            <totalSizeCap>100MB</totalSizeCap>
        </rollingPolicy>
        <encoder>(4)
            <pattern>%date{ISO8601} %-5level [%thread] %logger - %msg%n</pattern>
        </encoder>
    </appender>

    <!-- Loggers -->
    <logger name="com.github.gquintana.logging" level="DEBUG"/>
    <root level="DEBUG">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="FILE"/>
    </root>
</configuration>

1	The `debug` flag enables Logback startup logs.
2	The `log.dir` property can be overriden at JVM (`-Dlog.dir=…`) or OS level.
3	The `pattern` is documented in the layout section.
4	Format the `date` in ISO8601.

You can force Logback to use a specific configuration file using a JVM property -Dlogback.configurationFile=/path/to/config.xml.

The Logback Manual contains detailed information.

Log4J2

pom.xml

        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j-impl</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <scope>runtime</scope>
        </dependency>

The Log4J2-SLF4J adapter is in the Log4J2 group.

log4j2.xml

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="trace">(1)
    <!-- Properties -->
    <Properties>(2)
        <Property name="logDir">${sys:log.dir:-target/log}</Property>
    </Properties>

    <!-- Appenders -->
    <Appenders>
        <Console name="CONSOLE">(3)
            <PatternLayout pattern="%date{HH:mm:ss.SSS} %-5level [%thread] %logger{1} - %msg%n"/>
        </Console>
        <RollingFile name="FILE"
                     fileName="${logDir}/blog.log"
                     filePattern="${logDir}/blog.%d{yyyy-MM-dd}-%i.log.gz">
            <PatternLayout>(4)
                <Pattern>%d{ISO8601} %-5level [%thread] %logger %m%n</Pattern>
            </PatternLayout>
            <Policies>
                <TimeBasedTriggeringPolicy/>
                <SizeBasedTriggeringPolicy size="1m" />
            </Policies>
            <Strategies>
                <DefaultRolloverStrategy max="10"/>
            </Strategies>
        </RollingFile>
    </Appenders>

    <!-- Loggers -->
    <Loggers>
        <Logger name="com.github.gquintana.logging" level="debug"/>
        <Root level="info">
            <AppenderRef ref="CONSOLE"/>
            <AppenderRef ref="FILE"/>
        </Root>
    </Loggers>
</Configuration>

1	Setting `status` to `trace` or `debug` shows Log4J2 internal logs.
2	The `logDir` property is set from JVM property (`-Dlog.dir=…`) with default value. See Property substitution in documentation.
3	The `PatternLayout` is documented in the layout section.
4	Format the `date` in ISO8601.

You can tell Log4J2 to load a specific configuration file using a JVM property -Dlog4j.configurationFile=/path/to/config.xml.

The Log4J2 Manual contains extensive documentation.

Log4J1

Even if it’s deprecated, let’s end with the venerable Log4J v1 library.

pom.xml

        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <scope>runtime</scope>
        </dependency>

The Log4J1-SLF4J adapter is in the SLF4J group.

log4j.xml

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/" debug="true">(1)
    (2)
    <!-- Appenders -->
    <appender name="CONSOLE" class="org.apache.log4j.ConsoleAppender">
        <layout class="org.apache.log4j.PatternLayout">(3)
            <param name="ConversionPattern" value="%d{HH:mm:ss.SSS} %-5p %c{1} - %m%n"/>
        </layout>
    </appender>
    <appender name="FILE" class="org.apache.log4j.RollingFileAppender">(4)
        <param name="File" value="${log.dir}/blog.log"/>
        <param name="MaxFileSize" value="10MB"/>
        <param name="MaxBackupIndex" value="10"/>
        <layout class="org.apache.log4j.PatternLayout">(5)
            <param name="ConversionPattern" value="%d{ISO8601} %-5p [%t] %c - %m%n"/>
        </layout>
    </appender>
    <!-- Loggers -->
    <root>
        <priority value="INFO"/>
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="FILE"/>
    </root>
</log4j:configuration>

1	Like Logback, the `debug` flag enables Log4J1 internal logs.
2	There aren’t any properties in Log4J1.
3	The `PatternLayout` is documented in the JavaDoc.
4	There is no DailyRollingFileAppender in Log4J1 unless you add `log4j-extras` extension. Even with this extension, one can not mix size and time rollover.
5	Format the `date` in ISO8601.

You shouldn’t forget file:/ if you use specific configuration file with the JVM property -Dlog4j.configuration=file:///path/to/config.xml.

The Log4J1 Manual is only a quick introduction. By the time, there was a book .

Ansible collection processing

2019-04-25T00:00:00+00:00

As a Java developer, I sometimes dream that I can use the Java 8+ Stream API in my Ansible playbook to process list and dict variables.

In this article, I’ll show you can process a list of users:

users:
  - id: bouh
    name: "Mary"
    admin: True
    role: child
  - id: sulli
    name: "James Sullivan"
    admin: False
    role: monster
  - id: bob
    name: "Bob Wazowski"
    admin: False
    role: assistant
  - id: celia
    name: "Celia Mae"
    admin: False
    role: assistant

Jinja Filters

The main tool to transform Ansible variables are Jinja filters. There are 2 libraries of filters available:

The Jinja builtin filters. This list can also be found in Jinja source code filters.py.
The Ansible filters This list can also be found in Ansible source code filter directory.

Filters are similar to Unix or Anguler pipes and can be chained.

Like in other data processing libraries there two kinds of operators:

Mappers take stream of element and produce a stream of elements: selectattr, rejectattr, map, list
Reducers: take a stream of elements and produce a single element: join, first, last, max, min

admin_user_ids: |
  {{ users
  |selectattr('admin')
  |map(attribute='id')
  |join(',') }} (1)
normal_user_count: |
  {{ users
  |rejectattr('admin')
  |list |count }} (2)

1	Take the `id` attribute of `users` having `admin` set to true and join them.
2	Take the `users` havin `admin` set to false and count them. As the `rejectattr` filter returns an iterator, but the `count` filter requires a list, I have to use `list` filter to convert it.

The selectattr/rejectattr filters can take 3 arguments: the attribute, a boolean operator and an argument. The operator can be chosen among Jinja’s builtin tests. This list can also be found in Ansible source code tests.py.

assistant_user_ids: |
  {{ users
  |selectattr('role', 'equalto', 'assistant')
  |map(attribute='id')
  |join(',') }}

With Ansible 2.7+, the map filter can take 3 arguments: the attribute, an operator and arguments. The operator can be chosen among Jinja filters, and will be applied to each element of the list.

user_first_names: |
  {{ users
  |map(attribute='name')
  |map('regex_replace', '(\\w+)( .*)?', '\\g<1>')
  |join(',') }} (1)

1	For each `users` take its name and when the regular expression matches apply the replacement, then join the result.

JSON Query

Another strategy is to use a JSON Path to walk down the YAML tree. It’s bit less verbose and bit more powerful than the previous solution.

jq_admin_user_ids: |
  {{ users
  |json_query("[?admin].id")
  |join(',') }} (1)
jq_assistant_user_ids: |
  {{ users
  |json_query("[?role == 'assistant'].id")
  |join(',') }} (2)

1	Take the `id` attributes of `users` having `admin` set to true and then join them.
2	Take the `id` attributes of `users` having `role` set to `assistant` and then join them.

This json_query is based on the jmespath Python library, this means 2 things:

You can use jmespath.org web site to cook your JSON path query.
You’ll have to add the jmespath library to your Python environnement.

Sadly, nesting JMESPath expressions inside Jinja template expressions inside YAML files can be tricky. This following example fails, even if the JMES path is alright in Python interpreter.

jq_bid_user_ids: |
  {{ users
  |json_query("[?starts_with(id,'b')].id")
  |join(',') }}

Conclusion

It’s possible to transform a variable containing an array into another list. However it’s still painful to do because neither YAML nor Jinja tare programming languages. I personnaly regret I can’t invoke Python code from Ansible playbook and use for comprehensions, imagine something like:

py_admin_user_ids: |
  {{ ','.join([ user.id for user in users if user.admin ]) }}

Structured logging with SLF4J and Logback

2017-12-01T00:00:00+00:00

I don’t know who first coined the term structured logging. There is a 2015 blog post by James Turnbull to get started.

Python and .Net developers have libraries dedicated to structured logging : structlog and serilog. In this article I will describe how to do structured logging in Java with usual logging libraries like SLF4J et Logback.

Structured logging with SLF4J

All Java developers know how to log a message:

Logger demoLogger = LoggerFactory.getLogger("logodyssey.DemoLogger");
demoLogger.info("Hello world!");

Properly configured, it produces a log like

21:10:29.178 [Thread-1] INFO  logodyssey.DemoLogger - Hello world!

Notice how this "Hello world!" message is qualified with several fields: a timestamp, a thread Id, a level and a logger/category.

This is what the term "structured logging" means, a log is more than a message string. The message is associated with contextual information about what was occurring, it tells more detail about what was going on when this log was printed

How can we enrich this contextual information provided by default, and add the user Id for example? It is the purpose of the MDC (Mapped Diagnostic Context):

MDC.put("userId", "gquintana");
demoLogger.info("Hello world!");
MDC.remove("userId");

The MDC is a map-like object filled in the Java code, and used in the back-end logging library to output custom data. With the adequate configuration, we can get the user Id the log:

21:10:29.178 [Thread-1] gquintana INFO  logodyssey.DemoLogger - Hello world!

The MDC can store any information about the user (user Id, session Id, token Id), about the current request (request Id, transaction Id), about long running threads (batch instance Id, broker client Id). Later on, this information will be part of the log.

Having this kind of information allows to group logs by user, by request, by processing. Remember that logs may be scattered across different servers, on different time periods. These additional fields allow correlating logs belonging to the same scenario and finding answers to questions like "what was the user X doing when he met this nasty error?"

Let’s get back to the example, we saw the MDC stores extra information about logs. The MDC is usually based on a thread local variable, this has two drawbacks:

It must be properly cleaned after being used, or you may experience information leaks if the thread is reused. Think about thread pools in web servers like Tomcat.
The information may no be properly transfered from one thread to another. Think about asynchronous calls.

As a result, calling MDC.remove, like the above example, (or MDC.clear) is required to clean the MDC after usage. In order not to forget to do the housework afterwards, we can use a try-with-resource construct:

try(MDC.MDCCloseable mdc = MDC.putCloseable("userId", "gquintana")) {
        demoLogger.info("Hello world!");
}

It’s better but still verbose. Hopefully, this kind of code won’t make its way in your business code because it is usually hidden in an interceptor like a Servlet filter, a Spring aspect or a JAXRS interceptor. In Logback, there is a MDCInsertingServletFilter class which can serve as an example.

JSON logging with Logback

At this point, a log is more than simple string, it is qualified with useful information: timestamp, level, thread, user Id… How can we write this data structure on disk or send it over the wire to a log collection tool? We have to serialize it. For a human being, a simple text format as shown above is readable enough. However, for a machine, this is just a word soup without any structure. In short, to send structured logs to a log collection tool and benefit from this structure (search by user, by thread…), we must use a structured format, like JSON for example.

Compared to the Syslog format, another popular log format, the JSON format

Can properly handle multi-line logs like stack traces/call traces or messages containing line separators (wanted or not)
Is a versatile format and can have custom fields like user Id, transaction Id
Is more verbose, so compression (GZip or the like) may be required to reduce the weight

Most popular log collection tools likes Filebeat, Graylog, Fluentd already use some kind of compressed JSON format under the hood. You should too.

Generating JSON logs with Logback is very easy. I’ll show how to use two Logback extensions, the Logstash Logback encoder and the Logback Contrib library.

The first one uses a Logback extension point known as encoder that you can plug into any appender:

    <appender name="FILE" class="ch.qos.logback.core.FileAppender">
        <file>log/log-odyssey.log</file>
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <customFields>{"application":"log-odyssey"}</customFields>
        </encoder>
    </appender>

It will produce the expected result:

{
  "@timestamp": "2017-11-25T21:10:29.178+01:00",
  "@version": 1,
  "message": "Hello world!",
  "logger_name": "logodyssey.DemoLogger",
  "thread_name": "Thread-1",
  "level": "INFO",
  "level_value": 20000,
  "HOSTNAME": "my-laptop",
  "userId": "gquintana",
  "application": "log-odyssey"
}

The Maven coordinates for this library are net.logstash.logback:logstash-logback-encoder:4.11.

The second one uses a different extension point called layout. In the end, it looks very similar to the first one, a bit more verbose though:

    <appender name="FILE" class="ch.qos.logback.core.FileAppender">
        <file>log/log-odyssey.log</file>
        <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
            <layout class="ch.qos.logback.contrib.json.classic.JsonLayout">
                <jsonFormatter class="ch.qos.logback.contrib.jackson.JacksonJsonFormatter"/>
                <appendLineSeparator>true</appendLineSeparator>
            </layout>
        </encoder>
    </appender>

The result is very close as well, even though the fields are named differently:

{
    "timestamp":"1511814391083",
    "level":"INFO",
    "thread":"Thread-1",
    "mdc": {
        "userId":"gquintana"
        },
    "logger":"logodyssey.DemoLogger",
    "message":"Hello world!",
    "context":"default"
}

In order to be on par with the first example, it is possible to subclass the JsonLayout and add custom fields:

public class CustomJsonLayout extends JsonLayout {
    @Override
    protected void addCustomDataToJsonMap(Map<String, Object> map, ILoggingEvent event) {
        map.put("application", "log-odyssey");
        try {
            map.put("host", InetAddress.getLocalHost().getHostName());
        } catch (UnknownHostException e) {
        }
    }
}

Several Maven dependencies are required ch.qos.logback.contrib:logback-json-classic:0.1.5, ch.qos.logback.contrib:logback-jackson:0.1.5 and com.fasterxml.jackson.core:jackson-databind for this library to work.

In the end these libraries are similar, both use the Jackson library to generate JSON. Contrary to the above JSON examples which have been prettyfied to be human readable, producing one JSON document per line is better because it is more compact, and each end of line marks the end of a log, there is no multi-line log. This format is known as NDJSON or and JSON Lines. Logstash and Filebeat can easily read this kind of JSON file.

Conclusion

A log is more than a textual message, it can be enriched with information at different levels:

Line of code: message, timestamp, level, threadId, appender…
User or transaction: user Id, session Id…
Deployment unit: application Id, container Id, host Id, environment Id (production, staging)…

Once qualified with this contextual information, the log message becomes a structured piece of information and must be processed as such. Producing logs in JSON format allows to keep that structure and eases storing these logs in Elasticsearch. More on that later, it time permits.

Log collection in AWS land

2017-09-30T00:00:00+00:00

In AWS, it is really easy to popup new machines and scale. The more machines you have, the more important it’s to centralize logs. In this article, I will describe what I discovered while trying to collect logs from applications deployed on Beanstalk and send them into Elasticsearch.

Disclaimer: I am an AWS newbie.

The family picture

Beanstalk: contains a web server and runs an application (Java, JS, Go…), both produce logs in a local /var/log/something directory. There can be multiple instances of the same application for scalability, or it can be different applications in the same environment.
Cloudwatch: is used to monitor EC2, Beanstalk… instances, it is the place where logs and metrics are gathered. From there you can trigger alerts, schedule tasks…
S3: is a file storage where can be used to archive logs on long term and survive instances stop. However theses logs are not easily searchable because they are compressed files.
Elasticsearch: can be used to index logs and make them searchable. A Kibana UI is provided to make search and dashboard building even more easy.
Lambda: a provided Lambda function is used to bridge logs from Cloudwatch to Elasticsearch

To ship logs into Cloudwatch, an AWSLogs agent is provided. To archive logs into S3, a script is cron-ed along with logrotate.

This article will skip the security (IAM) settings which are required to allow these components to communicate.

Beanstalk

There are several components producing logs in a Beanstalk instance:

Beanstalk deployment
Proxy server: either Apache or Nginx produce Access logs and Error logs
Web server: Tomcat, NodeJS…
Application with its own logging framework

Each component produces logs with its own format. Beanstalk knows how to automatically collect logs for the first three components: deploy logs, access logs, web server logs… As it knows where this logs are located it is in charge of rotating, archiving on S3, and purging log files.

The Beanstalk console allows to download 100 lines of each file for a given instance. It is useful to understand why a deployment fails, but it’s not meant to dig into the logs of a running application cluster.

To tell Beanstalk to take care of your application specific log files, just at some configuration files to indicate where they are located:

/opt/elasticbeanstalk/tasks/taillogs.d/my-app.conf

/var/log/my-app/my-app.log

/opt/elasticbeanstalk/tasks/bundlelogs.d/my-app.conf

/var/log/my-app/my-app.*.log

The first file allows my-app.log (current log file) content to appear in the Beanstalk console. The second file allows all (old log files) files to be archived.

Cloudwatch Logs

First of all, the log stream from Beanstalk to Cloudwatch must be enabled in Beanstalk configuration file:

.ebextensions/cloudwatch.config

option_settings:
  aws:elasticbeanstalk:cloudwatch:logs:
    StreamLogs: true
    DeleteOnTerminate: false
    RetentionInDays: 30

This sets up the Cloudwatch log agent. Provided you’re using an image based on Amazon Linux, you can also install it on any EC2 instance with yum:

yum update -y
yum install -y awslogs
service awslogs start

Then the Cloudwatch log agent must be configured to watch your custom log files.

/etc/awslogs/config/my-app.conf

[/var/log/my-app/my-app.log]
log_group_name=/aws/elasticbeanstalk/my-app-dev/var/log/my-app/my-app.log
log_stream_name={instance_id}
file=/var/log/my-app/my-app*.log

Log files produced my components (Apache, Tomcat…) managed by Beanstalk are already configured. The above config file is only for application specific log files. This agent support multiline logs (like stacktraces, call trace…) provided they start with a whitespace (space, tab). It’s not as powerful as Filebeat or Logstash.

At this point, you’ll be able to see logs aggregated from multiple Beanstalk instances in the Cloudwatch console.

It’s better than Beanstalk console to monitor a running platform. Yet, log search is still limited because logs are not structured (split into fields) and the full text search is simplistic.

Using Cloudwatch, it is also possible :

to raise alerts when a specific pattern is found in logs
to extract metrics from logs (HTTP request time, number of 404 errors in Access logs for example) and draw charts

More information can be found here.

Elasticsearch

To send logs into Elasticsearch and get a better log search experience, subscribe a log filter to each Cloudwatch log group. There is a special Lambda which can do Log filtering and send logs to Elasticsearch. This log filter can be used to split text logs into fields:

This tool can split space delimited logs, like access logs, into fields. But it’s hard to "grok" more complicated logs with such basic tool. It supports JSON formatted logs, so a good solution for application logs, is to configure your favorite logging framework to produce JSON logs.

This article is worth reading.

At this point, we can open Kibana and configure an index pattern named cwl-*. Cloudwatch log filter mimics Logstash and uses a field named @timestamp for timestamp

Conclusion

AWS provides all the building blocks to centralize logs and monitor your whole infrastructure. It’s not hard to collect logs and send them in Elasticsearch. But it’s also far less powerful than the complete Elastic stack.

Java File vs Path

2017-09-02T00:00:00+00:00

I’ve been using java.io.File and java.io.File*Stream since Java 1.1, a long time ago. Java 7 introduced a new file API named NIO2 containing, among others, the java.nio.file.Path and java.nio.file.Files classes. It took me a while to lose my habits and embrace the new API.

Spoiler: The most funny part of this article is at the end!

Quick comparison

java.io.File (class) java.nio.file.Path (interface)

java.io.File (class)	java.nio.file.Path (interface)
`file = new File("path/to/file.txt")`	`path = Paths.get("path/to/file.txt")`
`file = new File(parentFile, "file.txt")`	`path = parentPath.resolve("file.txt")`
`file.getFileName()`	`path.getFileName().toString()`
`file.getParentFile()`	`path.getParent()`
`file.mkdirs()`	`Files.createDirectories(path)`
`file.length()`	`Files.size(path)`
`file.exists()`	`Files.exists(path)`
`file.delete()`	`Files.delete(path)`
`new FileOutputStream(file)`	`Files.newOutputStream(path)`
`new FileInputStream(file)`	`Files.newInputStream(path)`
`file.listFiles(filter)`	`Files.list(path) .filter(filter) .collect(toList())`

file = new File("path/to/file.txt")

path = Paths.get("path/to/file.txt")

file = new File(parentFile, "file.txt")

path = parentPath.resolve("file.txt")

file.getFileName()

path.getFileName().toString()

file.getParentFile()

path.getParent()

file.mkdirs()

Files.createDirectories(path)

file.length()

Files.size(path)

file.exists()

Files.exists(path)

file.delete()

Files.delete(path)

new FileOutputStream(file)

Files.newOutputStream(path)

new FileInputStream(file)

Files.newInputStream(path)

file.listFiles(filter)

Files.list(path) .filter(filter) .collect(toList())

Some additional notes:

Path throws IOException more often than File, and rarely return a boolean to tell if something was done (mkdirs(), delete())
File is more object oriented than Path: I regret that size(), exists()… methods are not on the Path interface. This is probably due to the fact that this API was added in Java 7, but default methods on interfaces were added later in Java 8.
Path based InputStream/`OutputStream`s are less expensive from a GC point view. Thanks @kittster for mentionning this article from Cloudbees.

One liners

java.nio.file.Files allows to read, write, copy files in a single line:

Files.write(Paths.get("image.png"), bytes); (1)

List<String> lines = Files.readAllLines(Paths.get("letter.txt"), StandardCharsets.UTF_8); (2)

Files.lines(Paths.get("letter.txt"), StandardCharsets.UTF_8)
                .forEach(System.out::println);

1	Write a binary file
2	Read a text file

This nearly makes Guava IO and Commons IO useless. I regret that there isn’t any method out of the box to read/write a whole file as a single string.

Many APIs (JAXB, Jackson to name a few) don’t use Path`s to read/write files, the workaround is usually use an `InputStream or an OutputStream.

try(InputStream inputStream = Files.newInputStream(path)) {
  Thing thing = (Thing) unmarshaller.unmarshal(inputStream);
}

Multiple file systems

When the File is only for local files, Path can also be used to access remote files. A Path is associated to a FileSystem.

To create a new Path instances, there is not constructor (Path is interface), we need to call a factory method. The above 2 lines are the same:

path = Paths.get("path/to/file.txt");
path = FileSystems.getDefault().getPath("path/to/file.txt");

As the default file system is the local one, you get a path to a local file. Depending on the underlying file system, you’ll get a different implementation: sun.nio.fs.UnixPath, sun.nio.fs.WindowsPath…

With this trick in mind, we can read the content of a Zip file, as if we had extracted it:

URI zipUri = new URI("jar:file:/path/to/archive.zip")
try(FileSystem zipFS = FileSystems.newFileSystem(zipUri, emptyMap())) { (1)
    Path zipPath = zipFS.getPath("/archive") (2)
        Files.list(zipPath)
                .map(Path::toString)
                .forEach(System.out::println);
}

1	"Mount" the Zip file as a file system
2	`zipPath` is of type `com.sun.nio.zipfs.ZipPath`

You can even plug additional file systems: ZIP, SFTP, SMB, WebDAV, SSH/SCP, Amazon S3, In memory, HDFS, … This almost means you can read a remote file as if it were local.