Alteryx Designer Cloud Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Cloud.
SOLVED

Speeding up ingestion of very large datasets into Hive

4fd0839b3f6e23ee344e
6 - Meteoroid

Hi there,

 

I have a very large dataset that lives in Oracle. I have previously used Trifacta to create Hive datasets for downstream consumption, i.e.,

  1. Browse to the data through the Oracle connection in Trifacta
  2. Import the data of interest
  3. Create an empty flow that publishes to Hive

 

This has worked well! However, when I tried this for my very large dataset, it ran for almost a day before I decided to kill it.

 

Are there any recommendations or best practices for reading in large [Oracle] datasets and pushing them out to Hive for consumption by Trifacta? My team has suggested looking into something called Apache NiFi, but I'm not exactly sure how this works and would prefer to avoid introducing third party tools if possible.

 

Any and all responses welcome.

 

Cheers,

 

Victor

2 REPLIES 2
Trifacta_Alumni
Alteryx Alumni (Retired)

Hi, Victor--

 

In Release 5.1 and later, Wrangler Enterprise supports JDBC ingestion, which is designed to manage the ingest of large database tables. Unfortunately, this feature does not currently support ingestion from Hive. No roadmap information on this.

 

Have you experimented with breaking up your table into multiple datasets? Have you enabled custom SQL queries in your environment?

 

You might also get clever with parameterized datasets within your custom SQL queries. If you can provide your version number of Enterprise, I can try to provide more information here.

 

Cheers,

-SteveO

 

 

4fd0839b3f6e23ee344e
6 - Meteoroid

Hi @Steve Olson?,

 

Thank you for your suggestions, and I'm sorry for my much-delayed response. I am running with Enterprise 5.0.0. I will look into custom SQL +/- parameterized datasets, and re-post if I hit any snags.

 

Cheers,

 

Victor