Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

How to anonymize data e.g. personal information into a anonymized format ?

TrifactaUsers
10 - Fireball

Description:

For a project, I need to anonymize a set of values in my data set. For example: There is a reporter row with a user name e.g. "mtruss". For each "mtruss", I want to anonymize the value to "Reporter 1".

 I have to repeat this for every unique user name.

 

As I need to create a classification / pattern recognition from the users and their relationship to other rows, it is important, that the "mtruss" is always "Reporter 1". For the next user e.g. "jdoe" it would always be "Reporter 2". This means: The relationship of "Reporter 1" to a certain entry should remain intact and representative.

 

Conditions:

I have access to a list (CSV) of each unique value, that needs to be anonymized. In this case all user names.

 

Question:

Is there a standardized way to do this with Dataprep?

 

6 REPLIES 6
Trifacta_Alumni
Alteryx Alumni (Retired)

Hi @Mario Truß? ,

 

Thank you for providing such excellent details about your task! This is definitely doable in Dataprep. You can use Dataprep to convert your list of unique values into a mapping table, and then join that mapping table back to your original dataset. Here's how this will work:

 

1. Import your CSV of unique values into Dataprep. Add that dataset to a flow, and create a new recipe. We will use this recipe to generate the mapping of usernames to anonymized values.

 

2. Open the recipe and add a step using the "New formula" transformation.

 

3. You can use the $sourcerownumber metadata parameter to generate unique IDs for each user name. Since you are looking to also add the word "Reporter", you can use the MERGE function to concatenate "Reporter" and the row number for each unique value. Here is the formula to enter in the "Formula" field:

MERGE(['Reporter ', $sourcerownumber])

Your preview and builder should look like the following image:

Note that the first anonymized ID reads "Reporter 2". Since you imported a CSV, the first row of your file contained the header, and your data started on row 2. If you want your anonymized IDs to start at 1 instead of 2, you can modify your formula as shown below:

MERGE(['Reporter ', ($sourcerownumber-1)])

You have now created your mapping table.

 

4. Open the recipe on your primary dataset. Add a join against the recipe that you created in step 3, with the user name fields as the join key. This will use your mapping table to generate anonymized values that are unique to each username.

 

Let me know if this helps! If so, please mark this as "Best Answer" so that other users can see your question has been resolved.

TrifactaUsers
10 - Fireball

Hi @Connor Carreras?,

 

Thanks for your answer. I created the table successfully. And the join also works for fields, where there is only the reporter inside the field.

 

There is also a field called "description", which contains text and sometimes mentions with the username. I would like to replace these mentions with the anonymized value as well.

 

Can you think of a way to replace this quickly? I can only think of replacing each user name step by step with the replace function, but then I would have to repeat the replacement 250 times for each unique user name available.

 

I was thinking of a way to give the replace function the list of user names and anonymized user names. But I don't know how to do that.

 

Further help would be greatly appreciated.

Trifacta_Alumni
Alteryx Alumni (Retired)

After you join your anonymized names back into your source dataset, you will want to add a step that uses the SUBSTITUTE function to find and replace all instances of the username in the "description" column.

 

The SUBSTITUTE function will look like this:

SUBSTITUTE(description,name, anonymized_name)

In this example, "description" is the name of the column that I am modifying, "name" is the name of the column that contains the values that I want to find, and "anonymized_name" is the name of the column that contains the values that I want to replace.

 

Here's how this function will work when applied to the same sample data I used earlier:

You can see that this function finds all occurrences of the username in the description column and replaces each occurrence with the anonymized name.

TrifactaUsers
10 - Fireball

Hey @Connor Carreras?,

 

this is not 100%, what I needed, but close. "mtruss" can occur in any row, not just the one, with "mtruss" is the reporter. I need to find all occurences of "mtruss" in all description.

 

Example would be a Support Ticket, where I am the reporter "mtruss", but I mention "jdoe" in the Description.

 

The result must be:

Reporter: "Reporter 1"

Description: General Support Text + Occurence of "Reporter 2

 

I my head I would need to check each description for any occurence of the anonmized_name list.

 

Makes sense? Any idea on how to accomplish this?

 

Best,

Mario

TrifactaUsers
10 - Fireball

@Connor Carreras? :

Just checked back with the requirements.

 

It would be enough to know, how to delete a list of stopwords from all descriptions. I don't need to replace the values in the descriptions.

 

Can I take a list of stop word e.g. the list "anonymized_name" and delete all occurrences of each value from rows in "description"?

TrifactaUsers
10 - Fireball

@Connor Carreras? any chance you can help me here?