Alteryx Designer Cloud Discussions

4fd0839b3f6e23ee344e · ‎01-14-2019

Hi there,

I have a very large dataset that lives in Oracle. I have previously used Trifacta to create Hive datasets for downstream consumption, i.e.,

Browse to the data through the Oracle connection in Trifacta
Import the data of interest
Create an empty flow that publishes to Hive

This has worked well! However, when I tried this for my very large dataset, it ran for almost a day before I decided to kill it.

Are there any recommendations or best practices for reading in large [Oracle] datasets and pushing them out to Hive for consumption by Trifacta? My team has suggested looking into something called Apache NiFi, but I'm not exactly sure how this works and would prefer to avoid introducing third party tools if possible.

Any and all responses welcome.

Cheers,

Victor

Trifacta_Alumni · ‎01-14-2019

Hi, Victor--

In Release 5.1 and later, Wrangler Enterprise supports JDBC ingestion, which is designed to manage the ingest of large database tables. Unfortunately, this feature does not currently support ingestion from Hive. No roadmap information on this.

Have you experimented with breaking up your table into multiple datasets? Have you enabled custom SQL queries in your environment?

You might also get clever with parameterized datasets within your custom SQL queries. If you can provide your version number of Enterprise, I can try to provide more information here.

Cheers,

-SteveO

4fd0839b3f6e23ee344e · ‎01-31-2019

Hi @Steve Olson?,

Thank you for your suggestions, and I'm sorry for my much-delayed response. I am running with Enterprise 5.0.0. I will look into custom SQL +/- parameterized datasets, and re-post if I hit any snags.

Cheers,

Victor

Alteryx Designer Cloud Discussions

Speeding up ingestion of very large datasets into Hive