Is it possible to implement ETL with an opportunity to make changes during execution and a completely visual processing process without writing code at all?
Yes, it is. With NiFi.
Implementation of ETL is one of the most common tasks now. Creating an aggregator site or simply integrating several enterprise applications leads to the need to solve the ETL task. You can solve this problem with the help of well-known frameworks, such as Apache Camel for example. But let's try doing it with NiFi.
So, what is NiFi?
NiFi was created by the NSA and was called "NiagaraFiles" and in 2014 was given over to the Apache Software Foundation (ASF) as part of the NSA technology transfer program. Currently, it is distributed under Apache License 2.0.
Its main features are:
- user's web interface;
- routes can be changed in the runtime;
- flexibly configurable.
You can learn more details about the main features of the system here Apache NiFi Overview. This article will cover only the main points that provide visibility and the ability to make changes in the runtime. Let's look at a simple ETL task like reading data from FTP, converting character set and uploading to the database.
- NiFi was installed and is ready for use now.
- NiFi User Interface from Apache NiFi User Guide was read.
NiFi design is based on Flow Based Programming idea. All input files go through a chain of connected processors that perform some actions. We should create such chain to implement our task. We will call this chain 'a route'. And we need a processor for reading data from FTP for our route. NiFi has a lot of already implemented processors and we can choose one of them.
Sign in Processor corner shows processor's configuration state (correct or incorrect). The current sign means that processor's configuration is incorrect.We can see what the problem is.
We set correct settings for all the processors.
1. GetFTP processor
2. ConvertCharacterSet processor
3. PutDatabaseRecord processor
4. CSVReader processor
We need to define how exactly each type of processor results should be handled too. Let's look into PutDatabaseRecord processor details (SETTINGS tab), for example.
We can see 3 type of results: failure, retry, success. We choose failure and success types as Automatically Terminate Relationships. We should definitely handle failure type results in a different way and we can do it using PutEmail processor, LogAttributes processor or something else, but we are not going to do it now, because it is just an example. According to the description of retry type, attempting the operation again may succeed. So we can send this type of results to PutDatabaseRecord processor again for a second attempt.
And our route is ready, we can see it below:
Each processor shows the number of files that were handled. The content and attributes of each file can be obtained as well. This feature is called DataProvenance in NiFi.
It's so comfortable to support the system without the need to search in logs trying to understand what exactly has happened. Do you agree?
How to make changes in the runtime
Let's change our route and upload the data to another FTP, for example. I've stopped PutDatabaseRecord processor and added a new PutFTP processor.
You can see that input files are in the queue to PutDatabase record.
You need to change the destination of the defined connection from PutDatabaseRecord to PutFtp processor and that is all.There is our new route. And we can see that all data from the changed connection that stood in the queue were handled by PutFTP processor.
We have changed the route in NiFi.
Let's sum up. We created a route that resolved our simple ETL task. After that, we made some changes to this route in the runtime without any loss of the source data. All this was done without writing the code and it was rather simple.
I want to add that NiFi has a lot of already implemented processors, but you can develop a processor yourself if you want. It's a really interesting tool with a lot of other features and processors.