Tableau: Delineate Belgian provinces explained

Tableau Software is a self-service BI-tool that allows data visualization, meaning that even business users should be able to easily visualise their data (without needing help of IT). You can check out the Tableau website for more information about this great tool.

As you may have already seen in several guides or tutorials, Tableau is able to link certain dimensions like postal code, countries, etc. to a certain geographical area. Unfortunately, most of these tutorials use data related to the United States of America. As of today, the Belgian geographical support is at a low point. One thing we can easily do though, is show the data based on provinces using their borders.

In this blog we’ll be showing the Belgian population per province in the year 2012 on a map. The data I used (and which you can see in the image below) can be downloaded from the Belgian Open Data website.

Data used in Excel

I’ve slightly altered the data so that we would have the English name for each province (else Tableau will not recognize it – you can also choose French or German).

I’ve opened Tableau and loaded the data. Make sure that province is set as a geographical data type (State/Province). If this is not the case you can change it by right clicking on province and then selecting “Geographical Role” -> “State/Province”.

Geographic Role - State Province

When using filled maps (=a type of map visualization that Tableau offers which will fill the area according to your chosen data), you can only use one measure. Therefore I’ve added a new calculated field “Total” (see “Analysis” -> “Create Calculated Field”) based on the male and female amount of people.

calculated field - Total population

Now we will select both the province and the total and click on “Filled maps”.

Select province and total for map2

Tableau will automatically colour the provinces according to the amount of people who live there.

Now the only thing left to do is format your layout and then you’re done! I have coloured the provinces based on the % of the total (Red -> High %, Green -> Low %). I’ve also put the % of total in the label because I think showing the normal total would be unclear. The last thing I did was add the amount of males, females, total and total percentage to the label.

end result

You can view the dashboard that I made for this blog on Tableau public. Tableau public is a free tool that Tableau offers which allows you to publish your data on the web for other people to see.

Extra tip: you can control click on multiple provinces to view the sum of the total % for the selected provinces.

Thank you for reading!

Talend: Schema compatibility check

Most of the time when talking about Talend jobs, people think of standard ETL (Extract, Transform, Load). But in some cases there’s the need to check the incoming data before loading them into the target rather than just transforming it. We refer to this process as E-DQ-L (Extract, Data Quality, Load).

One of the things that you might want to check before loading is schema compatibility. For example: you expect to get a String that’s 5 long. If you, for any reason, receive a String that is larger than 5, it will generate an error. Or perhaps you expect a percent (in format BigDecimal like 0.19), but you receive it as a string (“19%”). This example will result into a failing job with an error saying “Type mismatch: cannot convert from dataType to otherDataType”.

Before I continue this blog I would like to emphasize that all the solutions below are possible with the Data Integration version of Talend, except for the last one. The last option requires a Talend Data Quality license.

Let’s create an example case: We want to extract data on a regular basis from a third-party source which we cannot fully trust in terms of schema-settings. We know how many columns we can expect and we have a rough idea of what it contains, but we do not fully trust the source to not give incompatible data. We want to load the records that are valid and we want to separately store the ‘corrupt’ data for logging purposes. I’ve gathered several solutions for this problem:

  1. Use rejected flow on an input-component

One thing you can do is reject the records as soon as you import them. Disable “die on error” on the basic settings tab of you input-component and then right-click it and select “Reject”. The rows will be rejected based on the schema of the file. In the example below we put phone number as an integer and as you can see 1 records is begin rejected. This is because the phone number contains characters and therefore cannot be read as an integer. If you did not disable the “die on error”-option then this component would make the job fail.

reject on input

  1. In case of the target being a database: use rejected links

You can also choose to directly input the data into your database, but to reject any rows that would create an error. You can then create a separate flow to determine what to do with these rejected records.

In your database output component (for example tOracleOutput) change the following:

  • Basic settings: Uncheck “Die on error”
  • Advanced settings: Uncheck “Use batch size”

Now, right-click on your component and select “Row-Reject” and connect it to an output-component. The output you’ll receive will be the rejected rows and what error would have been generated if you tried inserting them, as you can see in the picture below.

rejected rows databank

  1. Use a tFilter-component

You can make the data go through a filter-component before inserting it into your target. You can (manually) decide what’s allowed to go through. This can be useful when your destination is not a database, in which case option 1 is most likely not available.

Schema compatibilty check with tFilterRox

A tFilterRow-component also has the possibility to output the rejected rows, including the reason why they got rejected. You can enable this by right-clicking on your filter and selecting “Row-Reject”. An example of rejected rows by the filter:

rejected rows tFilterRow

Note – You can also use self-defined routines in the tFilterRow-component by checking “Use advanced mode”. This can be useful when you want to check whether or not converting is possible. For example: you could define a routine called “isInterger” that returns true if the conversion is valid and false if it’s impossible.

  1. Use a tSchemaComplianceCheck-component

Another way of making sure that your schema is compatible is by using the tSchemaComplianceCheck-component. Unfortunately, this component is only integrated in the Data Quality version of Talend.

It’s a very easy component to use. The only thing you have to do is connect the incoming data to the tSchemaComplianceCheck-component and then continue its flow to the destination source. You can get the rejected rows the same way as previously (by right clicking on it and then selecting “Row->Reject”).

tSchemaComplianceCheck job

The rejected rows and their error message look like this:

rejected rows tSchemaCompatibilty

That’s it for now. There’s probably a lot of other ways of checking schema compatibility. Feel free to comment if you know any. Thank you for reading!

Talend: tips and tricks part 2

In the first part of these entries we discussed how to test your expressions, the importance of optimizing the appearance of a tLogRow component and how to handle windows and views within Talend. This time around, we will be talking about the different ways to get components into your job, how to trace your dataflow and how to easily sync columns. As last time, this post will be useful for both starting and experienced users.

4. Getting components into your job

There are many ways to get components into your job. Most people search the palette (by either the search-function or by manually exploring the folders) and drag/drop the components into their job. You can achieve the same thing by simply clicking on a random place in your job and then type the name of the component. Obviously this is only recommended once you’re familiar with the different components and their names.

insert component

When working with metadata, you can use certain shortcuts to save a bit of time. Usually people just click on the metadata and then drop it onto their job. This will pop up a window allowing you to choose which type of component you want to use. Holding the Control-key while dragging the component will directly create an Output-component. Holding Control+Shift will result into an Input-component.

5. Syncing columns

Occasionally, you may have to change the schema of a certain component in the middle of development. This might affect other components in your job. In some cases, Talend asks if you want to propagate the changes you’ve made (to the other components).

propogate changes

You may accidently close this window, click “No” or not get this message at all, resulting in the following error: “The schema from the input link “youroutputlink” is different from the schema defined in the component”.

different schema error

When this happens, you can go to the basic settings of the component that has the error and click on “Sync columns”. The error should now be gone.

schema sync columns

6. Tracing your dataflow (Debug Run)

Lastly, I would like to say a few words about the debug run. In some cases we want to closely watch our dataflow in order to get a better understanding of what’s exactly happening. You can achieve this by running your job in debug mode. This can be done by clicking on the Run-window, then click on the “Debug Run” tab on the left side of the window and start it by clicking on “Traces Debug”.

debug run

The moment you open the “Debug run” tab, you’ll immediately see extra icons in your job. These magnifying glass icons indicate that details will be shown when you debug-run your job. The result should look something like this:

debug run example

You can Pause and Resume the run at any time. You can also add breakpoints if you like. Do this by right-clicking on a dataflow and then selecting “Show Breakpoint Setup”.

show breakpoint setup

This brings you to the “Breakpoint” tab of the data flow you clicked on. You can also go there by clicking on the specific flow and manually selecting “Breakpoint”. Let’s add a breakpoint to pause our run whenever we come across a record with “Bloom” as last name. Firstly, make sure to check the “Activate conditional breakpoint” option. After that, click on the plus-icon underneath the conditions. Then select the InputColumn we want to put our condition on, in our case this is “Last_name”, and add a value (“Bloom” in this example). The default Operation is “Equals”, which is the one we want. You can also specify an Operation if you need to, but this is unnecessary for this case.

component breakpoint properties

You can add multiple breakpoints if you like. Whenever you debug run your job now, it will stop at a record where the Last_name is “Bloom” (if any exist).

That’s it for now. Thank you for reading!

Talend: tips and tricks part 1

This blog contains some convenient tips and tricks that will make working with the open source tool Talend for data integration a lot more efficient. This blogpost will be especially useful for people who are just discovering this amazing tool, yet I am sure that people who have been using it for a while will also find it very helpful. These series of tips will be spread over multiple blog entries so make sure to check back often for future tips!

1. Testing expressions in the tMap component

Using the tMap component, you have the possibility to test your expressions. This way you can easily see whether or not the result is what you expected it to be. You can also use this to determine whether or not your expression will error. Let’s create an example.

We’ve got details of employees as input for our tMap. We would like the first name to be shown in uppercase. First of all, go into the expression builder by clicking the ellipsis next to your expression.

Ellipsis expression builder

To convert the first name to uppercase, we have to use the StringHandling function “UPCASE”. This will result in the following expression: StringHandling.UPCASE(employee.First_name)

After you’re done filling in test values, click on the “Test!” button and wait for the result. If everything goes as expected, you should see your first name in uppercase on the right side of the window.

Test expression builder tMap

2. Optimizing the appearance of the tLogRow component output

tLogRow is one of the most frequently used components. It is recommended that you learn how to optimize its use. Firstly, make sure that you always have the right appearance selected for your output. You can find this property in the basic settings of your tLogRow-component.

tLogRow modes

There are three types of Modes that you can choose between:

  • Basic

Basic will generate a new line for each record, separated by the “Field Separator” you’ve chosen (see image above). When using basic mode, I highly recommend to check the “Print header” option when working with multiple column records or multiple outputs, purely for visibility reasons.

basic mode output

  • Table (print values in cells of a table)

The table mode shows the records and their headers in a table-format, including the name of the component that generated this output (in our case: “tLogRow_1”). This emphasizes the importance of properly naming everything, especially when you have multiple components that generate output. In this case, it would have been better to rename our component to “EMPLOYEES”. Personally, I prefer this mode.

table mode output

  • Vertical (each row is a key value/list)

Vertical mode will show a table for each one of your records.

vertical mode output

The output mode you decide to use depends on what you’re trying to visualize. For example, when your goal is to show a single string, I would recommend using the basic mode. But when you have multiple table outputs (for example: departments, customers and employees in a single output), I’m certain the table mode would be the best option.

Sometimes your data is spread over multiple lines, resulting in an unclear output, like shown in the image below.

output with wrap

To force the output to put all the data on one single line, you can uncheck the “Wrap” option. This option is located underneath your output and will enable a horizontal scrollbar.

output without wrap

Do you also want to be able to get data regarding tweets using Talend, as shown in the image above? Read my previous blogpost and find out how!

3. Resetting windows and maximizing/minimizing them

Sometimes you accidently close a window and have a hard time finding a way to get it back. You can very easily reset your environment by clicking on “Window” – “Reset Perspective”.

reset perspective

You can see all of the views by clicking on “Windows” – “Show View” – “Talend”. Some of the views are not shown by default, such as “Modules”. Modules can be used to import .jar-files without having to restart your studio, which will most likely save you some time.

Lastly, because Talend is Eclipse-based, you have the possibility to maximize and minimize windows. I personally use this function when examining the output of a tLogRow-component including a lot of data. You can achieve this by either double-clicking on the window or by right-clicking on it and selecting “Minimize”/”Maximize”.

That’s it for now. I hope you enjoyed reading this blog and make sure to return soon for future blogs!

Using Talend to read tweets

A project regarding visual discovery required me to look into the possibilities of reading tweets. During my search for a suitable method to accomplish this, I came across the software Talend Open Studio. This is an open source data integration product which allowed me to fulfil my need to read tweets in a very easy way. You won’t have to manually use the Twitter API to get raw JSON, but have the possibility to load and transform the structured data into a file.

In this post we’ll go through the steps you have to take to get the Talend software to use this API and get you the tweets you want. Please keep in mind that, in order to successfully do this, you will need a Twitter account to authenticate yourself and some custom components. Both will be discussed later on.

As you might know, it used to be possible to get an unlimited amount of tweets for free using the Twitter API v1. Unfortunately, Twitter decided to retire this API and replace it with the Twitter API v1.1 which requires you to authenticate before you get the (limited amount of) tweets you want.

Step 1: Adding custom Twitter components

Unfortunately, the components you need aren’t included in Talend by default. Therefore, you will need to add the following three items to your palette: tTwitterOAuth, tTwitterOAuthClose and tTwitterInput. These are made by Gabriele Baldassarre and can be downloaded on his website: http://gabrielebaldassarre.com. Place the components in a location that’s easy to remember, keeping in mind that you’ll need to use the locations path in a second. For example:

Location example

Now we’re going to add these to our palette. In Talend, go to “Window – Preferences – Talend – Components” and fill in the correct path next to ‘User component folder:’.

User component folder

Click on Apply/ok. As you can see the components can now be found in your palette:

Twitter palette

Step 2: Configuring tTwitterOAuth

tTwitterOAuth is responsible for the connection and authentication towards Twitter. As said before, in order to use this component you will need a Twitter account. Add the component to your job.

tTwitterOAuth

As you can see in the components settings, it requires 4 different strings (consumer key, consumer secret, access token and access token) in order to work. We’ll now explain how to get these.

Go to the Twitter developers website (https://dev.twitter.com/) and click on “My applications” (click on your profile picture in the upper right corner).

Twitter My applications

New Twitter application

Create a new application by clicking on “Create New App” (as seen in the picture above) and complete the given form. If you don’t have a website you can use a dummy link, for example your localhost (http://127.0.0.1).

Twitter application form

Upon returning to the applications page, you’ll see your recently created application. Go to its detail page by clicking on it.  If you click on “Test OAuth” you’ll find the information you need in order for your Talend component to work. This button can be found in the upper right corner as shown in the image below:

Test OAuth

Copy the consumer key, consumer secret, access token and access token secret into the matching fields of the tTwitterOAuth component. Keep in mind that these are strings and should therefore be surrounded by double quotes.

Settings tTwitterOAuth

Step 3: Linking tTwitterInput with tTwitterOAuth and tTwitterOAuthClose

Add a tTwitterInput component to your job. Whenever tTwitterOAuth has successfully authenticated us, we’ll tell our job to proceed to the tTwitterInput component. Do this by right-clicking tTwitterOAuth, selecting “Trigger – On subject OK“ and then clicking on tTwitterInput. The components are now linked. At this point, your job should look like this:

connection tTwitterOAuth/tTwitterInput

When tTwitterInput is done, we want to close our authentication. In order to do this, first we need to add a tTwitterOAuthClose component to the job. Next, tell the tTwitterInput component to proceed to the tTwitterOAuthClose component as done before: right click tTwitterInput, select “Trigger – On subject OK” and click on tTwitterOAuthClose.

Connection Twitter components

Step 4: Configuring tTwitterInput

tTwitterInput error

As you can see, at this moment we get the following error: “Parameter (Conditions) must have at least one value”, meaning we can’t just get any tweets we want but we have to specify at least one condition in order for this component to be able to search for tweets. For example, let’s say we only want tweets containing the hashtag Oracle. This is achieved by double-clicking on tTwitterInput and adding a condition (click on the green plus-sign) telling tTwitterInput to include the term “#Oracle”.

tTwitterInput basic settings

The error has now disappeared. You can add multiple operations and even decide whether they should have an AND or OR-structure. You can even add a filter telling it to only get negative or positive tweets, so technically this could be used for sentimental analysis.

Now we’re going to edit the column mapping in order to get the information we want. Keep in mind that as an example we’re only going to add a couple of basic operations. As you will see in the complete list there are many more possibilities then we’ll be showing. Click on the triple dots next to “Edit schema”.

tTwitterInput schema

TweetId is present in the schema by default. Let’s go ahead and add the content of the tweet (Text) and the date on which is was published (CreationDate).

tTwitterInput schema details

After you’re done editing the schema, click on OK. Now we have to change the operations of the column mapping to the right values. Change them to the following:

tTwitterInput mapping

Lastly, let’s specify some other characteristics of the tweets we want the component to return. Click on the “Advanced settings”-tab.  Let’s say we only want recently published information and limit the amount of tweets it should return by 100.

tTwitterInput advanced settings

As you can see, there are multiple conditions that you can edit.  Keep in mind that if you decide to specify a date, this should be done in a string-format (therefore, don’t forget to add double quotes or it won’t work). Also, there’s a limit on how far you can go back in time. As far as I’m aware the API is limited so that you can only receive tweets that have been published in the last week.

You’re now done and ready to get some tweets!

Step 5: Testing

Let’s put the received data into a CSV-document. Add a tFileOutputDelimited component to your job. Link it with your tTwitterInput by right clicking and selecting “Row – Structured” (you can also decide to write raw JSON if you prefer this) and clicking on tFileOutputDelimited. Edit the basic settings of your output file if you’d like to and run the job.

Twitter end result

Open the created file and voila, there are your tweets!

Tweets example