Talend MDM: How to use validation rules.

When creating a MDM data model, Talend offers you standard constraint possibilities.
You can choose whether fields are mandatory or not by setting the minimum and maximum occurrence. You may also set fixed values for a field.
Sometimes these options aren’t enough: you also want to use custom validations like email validation and URL validation. To solve this kind of requirements, Talend gives the possibility to use validation rules.

To illustrate this, I’ve created the entity ‘Customer’. As you can see FirstName and LastName are mandatory fields. However, for my business case, this data model doesn’t meet all my requirements: I also want validate the PostalCode(Belgian format: ‘9999’) and the Email before it saves the record.

Datamodel

How can you solve this within Talend?

You can create a validation rule by right clicking on the Customer entity and select Set the Validation Rule as shown below.

set validation rules

A window will pop up where you need to fill in a name for your validation rule. After you’ve chosen a name the following window will be shown.

set validation rule builder

You can add rules by clicking on:select xpath

For each rule you’ve to select an XPath, create an expression and set a message. You can add these by clicking on “…” in the field.

When creating an expression you can use the expression builder by clicking on “…”. The builder provides lots of predefined functions.

expression builder

After you have created and set your validation rules you’ve to deploy your model to the MDM server. After you’ve deployed the data model, we may test the validation rule by creating a new record. When you enter an invalid postalcode or an invalid emailaddress you’ll get the following message:

invalid postalcode

When entering the correct information you’ll get the following message:

save successfully

As you can see we’ve created a single validation rule set with different rules. The record can only be saved if all the rules of the validation set are met.

Talend: Schema compatibility check

Most of the time when talking about Talend jobs, people think of standard ETL (Extract, Transform, Load). But in some cases there’s the need to check the incoming data before loading them into the target rather than just transforming it. We refer to this process as E-DQ-L (Extract, Data Quality, Load).

One of the things that you might want to check before loading is schema compatibility. For example: you expect to get a String that’s 5 long. If you, for any reason, receive a String that is larger than 5, it will generate an error. Or perhaps you expect a percent (in format BigDecimal like 0.19), but you receive it as a string (“19%”). This example will result into a failing job with an error saying “Type mismatch: cannot convert from dataType to otherDataType”.

Before I continue this blog I would like to emphasize that all the solutions below are possible with the Data Integration version of Talend, except for the last one. The last option requires a Talend Data Quality license.

Let’s create an example case: We want to extract data on a regular basis from a third-party source which we cannot fully trust in terms of schema-settings. We know how many columns we can expect and we have a rough idea of what it contains, but we do not fully trust the source to not give incompatible data. We want to load the records that are valid and we want to separately store the ‘corrupt’ data for logging purposes. I’ve gathered several solutions for this problem:

  1. Use rejected flow on an input-component

One thing you can do is reject the records as soon as you import them. Disable “die on error” on the basic settings tab of you input-component and then right-click it and select “Reject”. The rows will be rejected based on the schema of the file. In the example below we put phone number as an integer and as you can see 1 records is begin rejected. This is because the phone number contains characters and therefore cannot be read as an integer. If you did not disable the “die on error”-option then this component would make the job fail.

reject on input

  1. In case of the target being a database: use rejected links

You can also choose to directly input the data into your database, but to reject any rows that would create an error. You can then create a separate flow to determine what to do with these rejected records.

In your database output component (for example tOracleOutput) change the following:

  • Basic settings: Uncheck “Die on error”
  • Advanced settings: Uncheck “Use batch size”

Now, right-click on your component and select “Row-Reject” and connect it to an output-component. The output you’ll receive will be the rejected rows and what error would have been generated if you tried inserting them, as you can see in the picture below.

rejected rows databank

  1. Use a tFilter-component

You can make the data go through a filter-component before inserting it into your target. You can (manually) decide what’s allowed to go through. This can be useful when your destination is not a database, in which case option 1 is most likely not available.

Schema compatibilty check with tFilterRox

A tFilterRow-component also has the possibility to output the rejected rows, including the reason why they got rejected. You can enable this by right-clicking on your filter and selecting “Row-Reject”. An example of rejected rows by the filter:

rejected rows tFilterRow

Note – You can also use self-defined routines in the tFilterRow-component by checking “Use advanced mode”. This can be useful when you want to check whether or not converting is possible. For example: you could define a routine called “isInterger” that returns true if the conversion is valid and false if it’s impossible.

  1. Use a tSchemaComplianceCheck-component

Another way of making sure that your schema is compatible is by using the tSchemaComplianceCheck-component. Unfortunately, this component is only integrated in the Data Quality version of Talend.

It’s a very easy component to use. The only thing you have to do is connect the incoming data to the tSchemaComplianceCheck-component and then continue its flow to the destination source. You can get the rejected rows the same way as previously (by right clicking on it and then selecting “Row->Reject”).

tSchemaComplianceCheck job

The rejected rows and their error message look like this:

rejected rows tSchemaCompatibilty

That’s it for now. There’s probably a lot of other ways of checking schema compatibility. Feel free to comment if you know any. Thank you for reading!

Use of contexts within Talend

When developing jobs in Talend, it’s sometimes necessary to run them on different environments. For other business cases, you need to pass values between multiple sub-jobs in a project. To solve this kind of issues, Talend introduced the notion of “contexts”.

In this blogpost we elaborate on the usage of contexts for easily switching between a development and a production environment by storing the connection data in context variables. This allows you to determine on which environment the job should run, at runtime, without having to recompile or modify your project.

To start using contexts in Talend you have two possible scenario’s:
1) you can create a new context group and its corresponding context variables manually, or
2) you can export an existing connection as a context.
In this example we’ll go over exporting an existing Oracle connection as a context.

Double click an existing database connection to edit it and click Next. Click Export as context

Image

NOTE There are some connections that don’t allow you to export them as a context. In that case you’ll have to create the context group and its variables manually, add the group/variables to your job, and use the variables in the properties of the components of your job.

After you’ve clicked the Export as context button you’ll see the Create/Edit context group screen. Enter a name, purpose and description and click Next.

Image

Now you’ll see all the context variables that belong to this context group. Notice that Talend has already created all the context variables that are needed for the HR connection. If you want to change their names you can simply click them and they become editable.

Click the Values as table tab.

Image

In the Values as table tab you can edit the values of the context variables by simply clicking the value and changing it. To add a new context, click the context symbol in the upper right corner.

Image

The window that pops up is used to manage contexts. To create a new context, click New, enter the name of the context, in our example Production, and click Ok. To rename the Default context, select it, click Edit, enter Development and click Ok. When you’re done editing, click Ok.

Image

After the window closes, you’ll see that an extra column appeared. Enter the connection data of the production environment in the Production column and click Finish.

Image

In the connection window it’s possible to check the connection again, but this time you’ll be prompted which connection you want to check.

Image

Verify that both the connections work and click Finish.

Now that we’ve exported the connection as a context, it’s possible to use it in a job. Create a new job, use the connection that has been exported as a context and connect it to a tLogRow component. Your job should look something like this

Image

When using a connection that has been exported as a context in a job, you have to include the context variables in order for your job to be able to run. Go to the context tab and click the context button in the bottom left.

NOTE When using one of the newer versions, Talend proposes to add missing context variables whenever you try to run a job, because of this you don’t need to add them manually as described in this example.

Image

Select the context group that contains the context variables, in our case the HR context group.

Image

Select the contexts you want to include and click OK

Image

NOTE A context group can also be added to a job by simply selecting the context from the repository, dragging it towards the context tab of the job, and dropping it there.

Once you’ve added the context group to the job, it’s possible to run the job for both the development and production environment by selecting the context in the dropdown menu of the Run tab.

Image

Using Talend to read tweets

A project regarding visual discovery required me to look into the possibilities of reading tweets. During my search for a suitable method to accomplish this, I came across the software Talend Open Studio. This is an open source data integration product which allowed me to fulfil my need to read tweets in a very easy way. You won’t have to manually use the Twitter API to get raw JSON, but have the possibility to load and transform the structured data into a file.

In this post we’ll go through the steps you have to take to get the Talend software to use this API and get you the tweets you want. Please keep in mind that, in order to successfully do this, you will need a Twitter account to authenticate yourself and some custom components. Both will be discussed later on.

As you might know, it used to be possible to get an unlimited amount of tweets for free using the Twitter API v1. Unfortunately, Twitter decided to retire this API and replace it with the Twitter API v1.1 which requires you to authenticate before you get the (limited amount of) tweets you want.

Step 1: Adding custom Twitter components

Unfortunately, the components you need aren’t included in Talend by default. Therefore, you will need to add the following three items to your palette: tTwitterOAuth, tTwitterOAuthClose and tTwitterInput. These are made by Gabriele Baldassarre and can be downloaded on his website: http://gabrielebaldassarre.com. Place the components in a location that’s easy to remember, keeping in mind that you’ll need to use the locations path in a second. For example:

Location example

Now we’re going to add these to our palette. In Talend, go to “Window – Preferences – Talend – Components” and fill in the correct path next to ‘User component folder:’.

User component folder

Click on Apply/ok. As you can see the components can now be found in your palette:

Twitter palette

Step 2: Configuring tTwitterOAuth

tTwitterOAuth is responsible for the connection and authentication towards Twitter. As said before, in order to use this component you will need a Twitter account. Add the component to your job.

tTwitterOAuth

As you can see in the components settings, it requires 4 different strings (consumer key, consumer secret, access token and access token) in order to work. We’ll now explain how to get these.

Go to the Twitter developers website (https://dev.twitter.com/) and click on “My applications” (click on your profile picture in the upper right corner).

Twitter My applications

New Twitter application

Create a new application by clicking on “Create New App” (as seen in the picture above) and complete the given form. If you don’t have a website you can use a dummy link, for example your localhost (http://127.0.0.1).

Twitter application form

Upon returning to the applications page, you’ll see your recently created application. Go to its detail page by clicking on it.  If you click on “Test OAuth” you’ll find the information you need in order for your Talend component to work. This button can be found in the upper right corner as shown in the image below:

Test OAuth

Copy the consumer key, consumer secret, access token and access token secret into the matching fields of the tTwitterOAuth component. Keep in mind that these are strings and should therefore be surrounded by double quotes.

Settings tTwitterOAuth

Step 3: Linking tTwitterInput with tTwitterOAuth and tTwitterOAuthClose

Add a tTwitterInput component to your job. Whenever tTwitterOAuth has successfully authenticated us, we’ll tell our job to proceed to the tTwitterInput component. Do this by right-clicking tTwitterOAuth, selecting “Trigger – On subject OK“ and then clicking on tTwitterInput. The components are now linked. At this point, your job should look like this:

connection tTwitterOAuth/tTwitterInput

When tTwitterInput is done, we want to close our authentication. In order to do this, first we need to add a tTwitterOAuthClose component to the job. Next, tell the tTwitterInput component to proceed to the tTwitterOAuthClose component as done before: right click tTwitterInput, select “Trigger – On subject OK” and click on tTwitterOAuthClose.

Connection Twitter components

Step 4: Configuring tTwitterInput

tTwitterInput error

As you can see, at this moment we get the following error: “Parameter (Conditions) must have at least one value”, meaning we can’t just get any tweets we want but we have to specify at least one condition in order for this component to be able to search for tweets. For example, let’s say we only want tweets containing the hashtag Oracle. This is achieved by double-clicking on tTwitterInput and adding a condition (click on the green plus-sign) telling tTwitterInput to include the term “#Oracle”.

tTwitterInput basic settings

The error has now disappeared. You can add multiple operations and even decide whether they should have an AND or OR-structure. You can even add a filter telling it to only get negative or positive tweets, so technically this could be used for sentimental analysis.

Now we’re going to edit the column mapping in order to get the information we want. Keep in mind that as an example we’re only going to add a couple of basic operations. As you will see in the complete list there are many more possibilities then we’ll be showing. Click on the triple dots next to “Edit schema”.

tTwitterInput schema

TweetId is present in the schema by default. Let’s go ahead and add the content of the tweet (Text) and the date on which is was published (CreationDate).

tTwitterInput schema details

After you’re done editing the schema, click on OK. Now we have to change the operations of the column mapping to the right values. Change them to the following:

tTwitterInput mapping

Lastly, let’s specify some other characteristics of the tweets we want the component to return. Click on the “Advanced settings”-tab.  Let’s say we only want recently published information and limit the amount of tweets it should return by 100.

tTwitterInput advanced settings

As you can see, there are multiple conditions that you can edit.  Keep in mind that if you decide to specify a date, this should be done in a string-format (therefore, don’t forget to add double quotes or it won’t work). Also, there’s a limit on how far you can go back in time. As far as I’m aware the API is limited so that you can only receive tweets that have been published in the last week.

You’re now done and ready to get some tweets!

Step 5: Testing

Let’s put the received data into a CSV-document. Add a tFileOutputDelimited component to your job. Link it with your tTwitterInput by right clicking and selecting “Row – Structured” (you can also decide to write raw JSON if you prefer this) and clicking on tFileOutputDelimited. Edit the basic settings of your output file if you’d like to and run the job.

Twitter end result

Open the created file and voila, there are your tweets!

Tweets example

Migrate your MS Access data to an Oracle database using the ETL Tool Talend

APEX is promoted as the perfect replacement for MS Access applications. One thing you should consider though is how you migrate your data to the Oracle database. In APEX there is a handy tool called the Data Workshop that can be used for this. You first export your Excel files from the MS Access database, and then follow the data upload wizard to import the data into identical tables. Since you are not always working with a 1-1 relationship, you will most likely have to write some PL/SQL to get all the data in the right tables.

dataworkshop

The downside is that you will need to repeat this process when you go into production. This is not a big problem if you only have one table to migrate. But if you have multiple tables and/ or your users also want new data during tests and trainings, you will spend a lot of time exporting and importing Excel files.

A recent APEX project for a client required a large data migration from MS Access Databases to the Oracle database. Because we would require fresh data on several points in the development process we decided to use the ETL Open Source Tool Talend. We got impressed of how intuitive the tool is, it only took a few days before we were familiar with the tool. Once you get the hang of it, you can write (or should I say draw) migrations of tables in no time. We needed to migrate from an MS Access database but the tool supports a wide range of databases and documents to import your data from. In total we migrated around 30-40 tables to our Oracle database.

Let’s have a closer look at one of our migration jobs.

talendmsaccessoracle

At the left we see our MS Access database. Each tAccessInput component will get data from one table. After that we join the tables in our tMap_1 component. The reason we don’t just write our joins in one component, is because this way we can really see how many rows every table returns.

On the bottom we have some Oracle Database input connections. They will join the persons of our MS Access Database with the persons in our Oracle Database based on the National registration number. After that we write our data to our Oracle Database. You may notice that we have two lines going to Excel files. This is our error logging; we use this to log the rows that did not find a match. In our first Excel for example we write persons that did not find a match in our Oracle Database.

This is just one example, in total about 20 jobs were built. During the development we also had to deal with certain calculations or convert data. For most things there was a component ready to use and if there wasn’t you could always write a Java expression in the tMap items.

I hope I convinced you of the benefits of using Talend as a migration tool for APEX projects, because we will certainly use this tool again!

Connecting to Salesforce and Mailchimp using Talend

A lot of companies use Salesforce to manage their customers and contacts. In addition Mailchimp can be used for sending out mailings to these connections. Mailchimp also captures information about what people did with these mails. This can be useful information for your CRM. A while ago, I was asked to make a list of everyone that have opened their mails in Mailchimp. Let me show you how easy it is, to do something like that with Talend.

In Talend:

  • we can get a list of email addresses from Mailchimp of receivers that opened a mail
  • and we can ask Salesforce for the email addresses and names of all our connections
  • and we can also use a mapping component to join these lists.

Talend has a standard interface with Salesforce. And Mailchimp offers lots of RESTful web services, which we can make use of in our Talend job.

  1. Connecting to Salesforce  

Right click “Salesforce” under the Metadata and choose “Create Salesforce Connection”.

pic1

After choosing a name for our connection, all we need to fill in, is the username and password for our Salesforce-connection.  The rest is already filled in for us.

pic2

To enable the “Finish” button, we need to check our properties first, using the button “Check login”.

Under Metadata, we can now browse through all our Salesforce-data.

pic3

Now you’re probably wondering, how to use this data in your ETL-flow. Well.. that’s even easier!

Simply drag one of the tables (with the blue icons) into your job and choose for the “tSalesforceInput” component from it’s 3 suggestions.

pic4

After specifying the necessary mappings you should get something like this:

pic5

We’ve used Contact and Account data of Salesforce for this.

In the next part, let’s check out how we generated the list of email addresses.

2.       Connecting to Mailchimp

Accessing your Mailchimp-data, is a bit harder. We need two components from the Talend-palette:

The ‘tRest’ component,  because we need to use a RESTful webservice for requesting our data from Mailchimp. And the ‘tExtractJSONFields’ component for interpreting the data we receive back.

After dragging the tRest component to your job, choose ‘POST’ as the ‘method’ and fill in the URL, corresponding to the report you wish to receive.

pic6

If you want to receive your report in XML-format instead of JSON, just add “.xml” at the end of the URL.

Here we needed the Mailchimp report, that gives us information on opened emails.

If you are interested in other kinds of reports, you can find the list here:

http://apidocs.mailchimp.com/api/2.0/#lists-methods

Every request, needs certain parameters. We can specify them in the HTTP body field, like this:

“{\”apikey\”: \”your api key will be here\”,\”cid\”: \”put a campaign id here\”}”

The API-key will always be needed as the first parameter. You can find it in Mailchimp under your ‘Account Settings’  – ‘Extras’ .

pic7pic8

The second component we need, is called ‘ExtractJSONFields’. After dragging it to our job, we link our first component to it.

pic9

We can use ‘Edit schema’, to define the data we want to extract.

pic10

Finally all we need to do, is specify the location of this data we are interested in, for example the ‘email’-field inside the ‘member’-field.

pic11

Now that we’re able to access our data from Mailchimp, let’s take a look at how we used it for generating the list of e-mailaddresses.

First we asked Mailchimp for all our Campaigns, then we used the ‘flowToIterate’-component so we could ask Mailchimp for the email addresses, once for every campaign in the list:

pic12

Finally all we had to do, is put these two jobs together and press ‘run’.

So.. I hope you’ll enjoy it, as much as I did!