Alteryx Designer Cloud Discussions

427eba845ac4c8b94b51 · ‎04-27-2021

Trifacta_Alumni · ‎04-27-2021

Hi @Talha Ardic?,

U.S. addresses without comma delimiters between street, city, state, and zip are really tricky to work with! I would recommend tackling this in a three step process:

Create columns containing the state and zip
Create a column containing the street only
Use the SUBSTITUTE function to delete all of the data in your original address column except the city name

Create columns containing the state and zip

It sounds like you've already successfully split state and zip into new columns. That's awesome! Just for reference, I would accomplish this using the "Extract text or patterns" transformation instead of the "Split on text or pattern" transformation. Doing the operation this way allows me to retain the state and zip data in my original column. These two address elements will be needed later on to define the boundaries for the more complex city and street values. If you want to retain your existing "Split on text or pattern" steps, just make a copy of your original column first.

To create columns containing the state and zip, I used the Trifacta patterns for these address elements in my "Extract text or pattern" step. I chose to extract both state and zip as a single element using this pattern:

`{state-abbrev} {zip}`

This ensures that I don't accidentally extract a street element that looks like a state (for example, "CT" can represent either "Court" or "Connecticut"), or a street number that looks like a zip code.

My recipe step looks like this in the transform builder:

Create a column containing the street only

Next, I want to create a column that contains the street number and name. Once again, I'm going to use the "Extract text or patterns" transformation to accomplish this. Here's how this step will look in the transform builder:

In this case, I'm going to tell Trifacta to extract data that matches the following pattern:

`{any}+`

This will capture any characters. To constrain my matches, I populated the "End extracting before" box with a further pattern:

` {city} {state-abbrev} {zip}`

This prevents Trifacta from extracting past the boundary defined by the city, state-abbrev, and zip patterns. Here is a preview of the data extracted by this step:

You might have noticed that Trifacta includes a "city" pattern. However, without comma delimiters in the address we cannot use the "city" pattern to identify city. This is because the "city" pattern is greedy; it will always match 3 or more words separated by spaces unless given additional constraints. If I were to use the "city" pattern directly, I would end up with the following preview:

You can see that the "city" pattern captures words that are actually part of the street name. This is why I chose to use the "city" pattern to define the end of the street name.

Use the SUBSTITUTE function to delete all of the data in your original address column except the city name

Now that we have created columns that contain the street, state, and zip code, we can use the SUBSTITUTE function to replace all of the data in your original column except the city.

You can use the SUBSTITUTE function in either the "Edit with formula" transformation or the "New formula" transformation, depending on whether you want to modify an existing column in your dataset or create a new column. For this example, I picked the "New formula" transformation. Here's what this step looks like in the transform builder:

I essentially perform two find-and-replace operations on the original data using the SUBSTITUTE function. The inner statement finds values in the original data that match the extracted state and zip code, and replaces those values with an empty string. The outer statement searches the result of the inner statement for values that match the extracted street, and replaces those values with an empty string. You're left with the city name.

This is the nested SUBSTITUTE formula that I used:

SUBSTITUTE(SUBSTITUTE(column1, column3, '', false), column2, '', false)

To tidy up your column further, you can use the "Trim leading and trailing whitespace" task to remove the extra spaces that remain around your city name. You can find this option in the column menu:

Let me know if this helps!

427eba845ac4c8b94b51 · ‎04-27-2021

Hello Connor,

Thank you so much for your response! Unfortunately creating a column containing the street name only step is not working properly as it can't detect 2 or longer word city names (e.g. Santa Barbara, West Chester...). I applied the same steps but it only selects first word of the city name. Please see the screenshot I am attaching. Is there any other way?

Trifacta_Alumni · ‎04-27-2021

I see what you mean. In this case, given the complexity of both the street name and the city name, and the unpredictable variations in the street name patterns, I would recommend downloading a zip code to city mapping table and using that to identify the cities. There should be a number of publicly available zip code mapping tables.

Once you have a mapping table available, you can perform a lookup against that table to create a column of city names. You can then use the SUBSTITUTE function to isolate the street name and number instead of the city name.

If you can't find a suitable mapping table, an alternate approach would be to create a pattern to match all of the possible permutations that would represent the end of a street name. However, be aware that you aren't guaranteed to catch 100% of your street names. In the screenshot you provided, "Calle de Los Amigos" jumps out as a particularly difficult street name to match programmatically. If you want to pursue this route, I've built an initial pattern to start from:

`( (Hwy|Hwy.|HWY|HWY.|hwy|hwy.|Highway|highway|HIGHWAY|Route|ROUTE|route|Rte|Rte.|rte|rte.|RTE|RTE.|rt|rt.|RT|RT.|Rt|Rt.|Rd|Rd.|rd|rd.|RD.|RD|Road|road|ROAD) {digit}+ )|( (avenue|Avenue|AVENUE|lane|Lane|LANE|road|Road|ROAD|boulevard|BOULEVARD|Boulevard|drive|DRIVE|Drive|square|SQUARE|Square|street|STREET|Street|highway|HIGHWAY|Highway|Route|ROUTE|Parkway|parkway|court|COURT|way|WAY|Way|PLAZA|Plaza|plaza|PLACE|place|Place|Circle|circle|Turnpike|turnpike|TURNPIKE|Cutoff|cutoff|CUTOFF|Bypass|bypass|bypass|pl|pl.|Pl|Pl.|PL|Pl.|Plz|Plz.|ave|ave.|Ave|Ave.|AVE|AVE.|dr|dr.|Dr|Dr.|DR|DR.|sq|sq.|Sq|Sq.|SQ|SQ.|Rte|Rte.|RTE|RTE.|rd|rd.|Rd|Rd.|RD|RD.|Cl|Cl.|CL|CL.|Blvd|Blvd.|blvd|blvd.|BLVD|BLVD.|LN|LN.|ln|ln.|Ln|Ln.|Ct|Ct.|Pkwy|Pkwy.|PKWY|PKWY.|Hwy|Hwy.|hwy|hwy.|ST|ST.|St|St.|st|st.|Mall|mall|MALL) )`

This is basically a list of all the possible endings for a street. It would fit into your "Extract text or patterns" transformation as shown below:

Since this approach will likely not catch all possible street names, I strongly recommend the zip code to city name mapping method.

427eba845ac4c8b94b51 · ‎04-28-2021

Thank you Connor! This solved my problem! One tiny detail though, when I use substitute function to replace city names with ' ' (empty string), it also replaces the part of the street name if the city name is included in it. In the screenshot below Hamilton Mason Rd. Hamilton is an example of this. Is there a way to get ahead of this by harmonizing substitute funct. with an if statement?

Trifacta_Alumni · ‎04-28-2021

Good catch! You're right, the SUBSTITUTE function acts as a "Find-and-Replace-All" operation, which is why "Hamilton Mason Rd. Hamilton" is being converted to "Mason Rd.".

Since there is a possibility that the city name occurs in the street name, you can adjust your approach to identifying the street name to use positional extraction instead of pattern matching.

Here's a screenshot of the step to build out:

I've circled the case with a repeated city name in the preview.

The formula to use is below:

LEFT(column2, (len(column2) - LEN(City) -1))

Because you have already removed the zip code and state from the original address, the city name always appears at the end of your column. This means that we can use positional math to identify where the city name begins, starting from the end of the string. LEN(City) tells us how many letters are contained in the city name. LEN(column2) tells us how many letters are contained in the entire street + city combination. By subtracting the length of the city name from the length of the entire string, we return the position where the city name begins, if the first letter in the string had a position of "1". Trifacta assigns a position of "0" to the first letter in a string, so to return the accurate position we need to subtract 1 from the result. We can then use this inside the LEFT function to extract all of the items from the string up until the start of the city name.

Looking at your screenshot, your version of this formula will probably read as follows:

LEFT(Address, (LEN(Address) - LEN(City) -1))

427eba845ac4c8b94b51 · ‎04-28-2021

This worked! Thank you so much Connor! I appreciate your help!

Alteryx Designer Cloud Discussions

How can I separate my address column in to street, city, state, zip if they are not separated by a comma (only space b/w them)? I'm able to split state and zip out but none of the patterns catch city correctly. (includes Rd. St. etc.)