Data Science Mixer

MaddieJ · ‎10-31-2021

Australia's Nikita Atkins joins us to share how GHD uses Alteryx Intelligence Suite and Machine Learning to find creative solutions and efficiencies in supply chain and infrastructure investment.

Panelists

Nikita Atkins - @Nikita_Atkins, LinkedIn, Twitter
Susan Currie Sivek - @SusanCS, LinkedIn, Twitter

Topics

Cocktail Conversation

Mixer LI (3).png

Join the conversation by commenting below!

Transcript

Episode Transcription

SUSAN: 00:01	[music] Do you think you’ve got the one singular perfect mix of data science tools, or does that ideal combination always depend on the nature of a specific project? Welcome to Data Science Mixer, a podcast featuring topics, words, and lively and informative conversations that will change the way you do data science. I’m Susan Currie-Sivek, Senior Data Science Journalist for the Alteryx community. In this episode, I talk with Nikita Atkins of GHD, a global professional services company, about the diverse variety of data science tools and strategies he’s used in his work. We’ll hear how he and his teams have combined code with AutoML in Alteryx to tackle millions of rows of data affecting billions of dollars of commerce and governed by hundreds of rapidly changing business rules. Flexibility has been a key to these projects’ success and an open-minded approach also allowed for some creative new ways to address familiar problems. And Nikita will tell us about the effects of adopting a different mix of tools in terms of time, costs, and efficiency. Plus, he’s built new ways to help his team share data and custom-built tools to save time and increase everyone’s data science capabilities. Let’s meet Nikita and get right down to business.
NIKITA: 01:20	So hi everyone. My name’s Nikita Atkins. I have been doing data science business intelligence data management for over 20 years. I went to the University of Wollongong and did my mathematics degree over 20 years ago now. And I always thought when I did my degree, I would get into something like banking or finance or actuaries. But I ended up jumping in and getting a job originally in business intelligence. It’s something that I had not even heard of at the time. And obviously for the first 10 years of my career did a lot in business intelligence and data warehousing and helping organisations get more out of their data. And so I worked on that all the way up to about 2008, 2009, and that was fascinating work and really interesting. But about 10 years ago - just over 10 years ago - decided, “I really want to use my mathematics degree more and more.” And so I got into an area that was known back then as data mining. Data mining turned into data science and turned into machine learning and artificial intelligence. And so I feel like I have been on a whirlwind journey over the last - what is it now? - 12, 13 years now around data science.
NIKITA: 02:30	I have worked with some of the biggest organisations around this. Working with miners, oil and gas, transportation, and even worked with governments on a range of different options. All have very different but similar challenges. I joined GHD about three years ago. GHD is a engineering and design firm. It’s been around for 90 years. And we’re actually employee-owning with about 10,000 employees worldwide. And we do anything to do with engineering and design. We don’t do construction, but we do everything else. So we help sectors such as energy and resources, water, transportation, environment. We even have our own architecture team. So we do property and buildings, for example. And GHD realised about three years ago that they wanted to start up something called GHD Digital and that was to help their clients and other clients transition into new ways of using digital technology. And the digital intelligence and the advanced analytics team started about three years ago, and they asked me to join them to help spin up and build this team in Australia and globally.
NIKITA: 03:37	So after first joining GHD virtually, there was no one there from an advanced analytics point of view. I basically focused on-- my role transitioned very quickly to half and half. So I have a 50% external focus where, as a consultant, I’m focused on, how do we do business development and how do we sell data science and then obviously deliver that to our clients? And you’ll hear about some of the clients that we work with very soon. And then I also have an internal focus as well, and that internal focus is as the data science design leader. I am identifying the best approaches, standards, methodologies and building up training and education on how we can use data and data science better across the whole organisation. So working with our traditional engineering teams across as I said, water, energy, transportation, for example, how can we better use data? How can we better capture data? How can we then use that to deliver better insights to our clients while delivering on design me the next bridge or the next freeway or helping mines for an environmental point of view as well? So it’s been a fascinating journey. And every day is a different day. Different organisations, different clients, different challenges.
SUSAN: 04:57	Nice. I love it. So a couple of important questions before we go on to hear more about those projects that you’ve worked on. Could you share with us which pronouns you use?
NIKITA: 05:06	Yeah. So he, him.
SUSAN: 05:08	Cool. Thank you. And as you may know on Data Science Mixer, we typically try to enjoy some sort of special beverage or snack while we’re chatting. So do you have something there with you today?
NIKITA: 05:19	Yes. I’m enjoying my regular morning latte. So just a regular latte, but it’s a lovely one.
SUSAN: 05:24	Excellent. Good choice. So yeah, I’m once again as on previous recent episodes going through the selection of [inaudible] Sparkling water flavours. So today’s lime, so very [crosstalk].
NIKITA: 05:36	Very nice.
SUSAN: 05:37	Yes. Good, good, good. All right. So one of the projects you’ve worked on recently that I’ve heard a little bit about was with the port of Melbourne looking at commodities and shipping containers. Could you tell us just a little bit about that?
NIKITA: 05:48	Yeah. So the Port of Melbourne has been a fascinating project. So for those people who aren’t aware, Port of Melbourne is a large capital city container and general cargo port in Australia. It’s worth about $7.5 billion to the Australian economy. And they manage a site that’s worth-- it’s over 500 hectares, 1200 acres. So it’s a large area that works 24 by 7. And they get approximately 3 million TEU containers every year. So it’s a very substantial port. One of the challenges that the port and the Victorian Government has is there’s very little visibility on where containers go to once they leave the port or where they come from an exports point of view. So the port, which is actually currently on a private lease from the Victorian Government along with the Victorian Government asked GHD if we would run what we call the origin destination survey to help basically capture a lot of data and understand the broader supply chain networks that are directly tied to the Port of Melbourne around container movements.
NIKITA: 06:53	It sounds a simple problem, but it is a hugely challenging one because there are so many different stakeholders that we had to deal with. From empty container parks, train operators, truck operators. You’re dealing with everyone from a little truckie operator that may be family owned, all the way up to massive importers and exporters of goods, so. And we had to deal with everyone in between. And that often meant getting lots of slices of data from, I think it was over 100 different stakeholders. And in the end, we got 57 different source data. All different formats. All different granularities. And the challenge we had to do was stitch it all together to present a seamless view of how a single container, for example, moved out of the port, via train or truck, to the eventual customer and then how that same container may have moved back into the port through an export process.
NIKITA: 07:46	So some of the numbers we had to deal with, it was 100 million records across 57 different data sources. This was for a two-month period. So we looked at September and October 2019. And that was hugely challenging on its own. It was a huge amount of data cleansing with about I think it was close to 200 business rules that we had to apply throughout the whole process. Having said that, we were really successful. The data cleaning aspect of that, once we get all the data, took about 11 weeks, which was really rapid for this. We were running almost daily iterations. I mean, we would run a data source. We would then run a set of rules. Show the client. At the end of the day, they would give us feedback. And usually, the next day we would re-run that with a new iteration, a new feed. All of this was done through Alteryx. And some of the things that we did with Alteryx, obviously, the data cleansing was huge.
NIKITA: 08:37	We started doing some of this cleansing using Python and Pandas. But what we found really quickly was it was very difficult for us to keep up with the rapid iterations and the changes we needed to make. So we made the decision very quickly to move from Python to Alteryx for the data cleansing aspects. And Alteryx actually became more critical, not just as a development tool, but actually Alteryx became critical as part of our project management. So, as I said, we would run iteration and we would show the client the end results. And Alteryx actually was part of that project management stand-up that we would have. We would show them the rules. We’d explain how we had modified that. We would show them the results. And so Alteryx became really easy. Because even though our client didn’t understand Alteryx, they understood pretty quickly what we were looking at, how we were modifying things, and that actually gave them a real sense of confidence because they could understand what they were seeing on the screen very quickly and thus they had confidence in our results.
SUSAN: 09:35	Nice. Very cool.
NIKITA: 09:36	And the last of that is, so we ran that for two months of September/October, but we had to deliver a report on the whole of 2019. And so what we ended up doing is using the Intelligence Suite to use the machine learning aspects. To take that two-month period and to extrapolate all of the commodities, the container movements, the supply chains across the whole year. Traditionally, this would have been just done with someone going, “Times this month together via weighting factor,” and that would be it. We decided that we’d use the AutoML because we wanted to make sure that we picked up any macro seasonality and micro seasonality across the whole year across commodities. And so the Intelligence Suite in the AutoML had been really important for us to do that. And what we did was we extrapolated that and so we had a detailed data set that said, “Here’s all your containers, and this where they went to, which postcodes they went to, and whether they went via rail or road, how many depots they stopped in along the way.”
NIKITA: 10:36	When we went back and looked at the results and we compared it with the baseline data that the Port of Melbourne had - which is not as detailed as what we had but obviously allowed us to compare from a container and commodity point of view - the AutoML capability scaled us within 99.9995%. Well, the way we did our validations when we looked at container counts and our total weight or TEU, our TEU was spot on. We had zero error for TEU. So that was really good validation. And we only missed out on a container count of about 10 out of, as I said, 3 million container movements every year. So it was very small errors that we were getting. Out of interest, the 10 containers that we missed was a very unique particular commodity that we had not seen, and it only happened in 1 or 2 months.
NIKITA: 11:25	So the port, once they saw that, basically, it had huge confidence because this was a hugely rich dataset; they got out of us from the survey, which will now help them for capital planning for the next 10 years. It helps the Victorian Government make better planning decisions about what upgrades are needed as part of road and rail. And in Australia at the moment, we’re doing a very substantial infrastructure project which is called the Inland Rail and that will connect Melbourne and Brisbane together for a very important freight corridor going forward. But this project will help justify the numbers and help Victoria, New South Wales, and Queensland make better decisions about how they can encourage more freight movement onto that new inland rail when that’s completed in a few years’ time.
SUSAN: 12:15	Very cool. Yeah. It’s awesome to think of how that project, I mean, a huge project, but to have even a more massive impact across all of those different things that you just mentioned. It’s neat to think about all of the consequences that could have rippled forth from the work that you’ve done there. I’m curious. For those who are not familiar with the Alteryx Intelligence Suite, this is the suite of tools that is in addition to Alteryx Designer that includes assisted modelling, AutoML, some natural language processing tools, computer vision tools, and so forth. You mentioned here your choice to use an AutoML approach. Can you elaborate a little bit more on that if you can? Kind of why that was a good option for you with this particular project?
NIKITA: 12:54	Yeah. So this project, we used the version 1 of the Intelligence Suite. And so it had the wizard, and we used the wizard back then for that. We really wanted to-- we didn’t know which was the right machine learning model for this. We always suspected that it would end up being a random forest. But the AutoML capability intelligence really just allowed us to focus on getting that data right, focus on the feature cleansing, and then let Alteryx run those models. In the end, I think we ran about 50 different models through that to let it decide which was the best model that would give us the best accuracy, the best stability for our results. The power of this was-- as I said, the business rules changed, often on a very quick basis. The Intelligence Suite not only let us pick the right model for the situation, but as data changed quickly, and the rules changed quickly, the inputs changed. And we could just click off and let that run and it would then modify the model based on what-ifs or any new features. So it was very powerful. It allowed us to focus on what we needed to, which was getting the best quality data into those projects. And that was really powerful and just allowed us to do what needed to be done, which was focusing on that data cleansing and the final outputs.
SUSAN: 14:19	Yeah. Absolutely. Makes a lot of sense. So I’m curious too, speaking of ML and using ML in interesting ways you had also told me a bit about a project you worked on, and this is kind of funny when you were describing to me the Australian Clean Energy Regulator. And I was like, “Oh, I wonder what the actual agency name is?” But it is actually called, if I’m correct here, the Clean Energy Regulator, right?
NIKITA: 14:41	That’s exactly right. It is called the Clean Energy Regulator. So we had done the Port of Melbourne and then we got this project about six months later. And so we were much more wiser from the Port of Melbourne experience for this. So The Clean Energy Regulator, every year, it has a need to, basically, predict how many PV, or photovoltaical solar as we know it, how many solar installations are going to happy across the country, both in a residential and commercial context. GHD had been doing this for two years previous to this, and we had traditionally used econometrics modelling and traditional-- as I said, econometrics modelling used to do this work. We decided this year that we would do it slightly differently. And there was a few reasons for it. So we wanted to use machine learning to do it, backed up by our advisory economists as well.
NIKITA: 15:34	And the reason we had to do that was, over the last 12 months to 2 years, we’ve had some really interesting problems in Australia. One, we had the major bushfires back at the end of 2019. And obviously, with COVID lockdown, there has been a certain level of-- there would have been an impact on solar. We weren’t sure if it was going to be a positive impact or negative impact. And we definitely felt as an organisation that, machine learning through Alteryx particularly, was really required to help capture some of these nuances associated with natural disasters and pandemics and so forth. So what we did was as part of this, again, we had basically at this stage built a very interesting database which allowed us to capture some of the raw data that we use for a lot of these models across clients. So our Bureau of Statistics, our census data. We get updates from market data, our Reserve Bank, and so forth. We actually capture that on a daily basis and store that historical information to allow us to have a starting point for all our clients.
NIKITA: 16:38	So when the Clean Energy Regulator came to us, we basically grabbed that data and we had really detailed population forecasts, really detailed building and dwelling information across by-- we actually had a couple of machine learning models to break it down by postcode. So we actually had a really detailed data collection that we had to start with. Bring that in, bring in lots of data that we had never used before, such as the economy, the number of visitors coming to Australia, Australian citizens who are going away for holidays. We decided that we would use these rather than use a measure of how many COVID cases there were. We actually used economic indicators across the nation to help use those as a dummy variable for the pandemic. The reason we chose that is that these models are now future proof. So yeah, hopefully, we’ll all be coming out of the pandemic, we’ll all be vaccinated, that will be terrific. But this model will still hold true because it will be based on general indicators that can be used going forward.
NIKITA: 17:42	And the great thing about is we used AutoML. We ran close to 1,000 different models across all these different indicators, across a variety of residential and commercial a range of different installation sizes. So 50-megawatt installations, 1-megawatt installations, and even lower by state, by postcode. And we ran all these different models, aggregate them back up, and then we basically had this really interesting model and dataset that has successfully shown how believe it or not, that residential solar uptake was actually going up because of COVID, which is contrary but the idea behind that is, “Well, I can’t go overseas. I’ve got a few thousand dollars left in my bank account. Let’s use it to spend it on solar so that we’re spending less on electricity.” It sounds surprising, but it was really interesting. However, we knew that there was a natural tipping point where a lot of people last year had done that. Some had done it a little bit earlier in 2020. And what we successfully showed and modelled was we had got a saturation point and that on the second half of 2020 there would be a slowing down of solar uptake in residential areas. But it’s a really interesting exercise that has been the Clean Energy Regulator has definitely appreciated our insights on this. And allowed us to show how machine learning can be applied in normally an economic area but with really successful outcomes for the client in a way that we can explain it to them and hopefully puts us in good stead to helping the Clean Energy Regulator in following years.
SUSAN: 19:25	Yeah. Absolutely. Good deal. So one other thing that you had mentioned in our previous conversation was sort of an internal algorithm catalogue or repository that you had created using Alteryx Server. And I thought that was a really interesting concept, that this was a way that internally you were working to make your data science work and data engineering more efficient and more repeatable. Is that something you can tell us a little bit about?
NIKITA: 19:51	Yeah. Absolutely. As I said, GHD’s been around for 90 years. And we have lots of engineers around the place, and a lot of them actually use R and Python in their day-to-day jobs for other reasons. So we have, for example, hydrologists who do very detailed modelling and some of them use actually R and Python. And they have built a series of algorithms and code and libraries internally to help their clients. One of the jobs I have as that service line leader, the internal role, is to start to capture and identify these things. And at the moment, they tend to be captured on various file systems. There’s a little bit of version control, but not much. But there’s a mishmash of the stuff all over the place. What we’re trying to use is start capturing some of those, initially in traditional data science, but in these other areas. Grab these algorithms and put them through in as an Alteryx workload.
NIKITA: 20:46	Whether we’re storing them-- initially, we’re capturing them as macros within the Alteryx workflow, saving them to the Alteryx server, and then having these actually on the Alteryx server. They have version control. They have their own little community. So we have a water community, an energy community, and so forth. But then people can go in there, actually call that project, call that macro, run it. They basically upload in CSV or shell file and run that, or they could download the macro, can see the Python code and R code, and they can modify it and update it back and having full version control. And we’re using the Alteryx server to do that and we really want the Alteryx server to be the one-stop shop for people to either, not reinvent the wheel. There are some really good stuff so that we can have standard algorithms there to use by our clients no matter industry they’re from. Also, explore and learn as well. So we have a huge interest, particularly in R and Python, and we think that rather than-- if the natural next step you go.
SUSAN: 22:10	Yeah. That’s very cool. I had not thought about it as a learning opportunity in the sense that you just described it, but that makes a lot of sense for that scene. So any other projects that you wanted to talk about that maybe have come to mind as we’ve been chatting?
NIKITA: 22:24	There’s another one, again, that’s a little bit more internal focused, but it’s one that is really interesting at the moment. So we’ve been around for a long time and we’ve been actually using computers for a long time. And we have a very comprehensive internal file drive and we have our SharePoint system that we also use as well. The problem is a lot that data is captured by projects. And so a project that was done last year or two years ago, it’s often completed, archived, and we forget the inherent knowledge or content that we’ve stored in that project over time. And so what we developed is a small robot actually within Alteryx to scan our file systems, shared file systems, the ones that are publicly open and just catalogue all the different files that we’ve got. So we’ve got spatial files, we’ve got images, we’ve got Python scripts on that side of things.
NIKITA: 23:17	We’re still running that code at the moment, but just to give you a sense of the numbers we’re getting, we’ve captured 1 billion files. So that’s a huge number of inherent knowledge then. The metadata alone, so not the content but the metadata about basically the filename, the size, the directory, it’s currently 500 megabytes alone. It is a huge dataset, just the metadata. And we’re using that to catalogue and use that for search capabilities. So for example, at the moment someone’s come to me says, “We’re looking for all data that may be associated with weather. Where have we captured weather information or brought weather information and where have we got that across different projects?” Well, we can now do a rather simple search across that metadata to help us identify all the different folders that may be labelled with weather or meteorology or climate and particular meteorological file types as well. And that helps us to quickly identify that.
NIKITA: 24:11	What we’re trying to do next is take those different file systems and then identify those that need to be put into a centralised database or data lake. And so that we’ll have a standardised approach going forward. The benefits of this is that we often have to pay for data, whether we have to pay for satellite imaging, for example, pay for weather information, or pay for other things. If we can identify some of this data and centralise it and then when a new project comes up, we can actually say, “Do you need to buy this because we’ve got this image, for example, satellite imaging, for this same location 12 months ago If you’re happy with the satellite image being only 12 months old that maybe you can save a couple thousand dollars. Now, we do tens of thousands of projects every year. So if you save $100 per project, that number can add up very quickly in terms of licenses saved.
SUSAN: 25:05	Yeah. For sure. I love that. I love both the efficiency then in the potential of that data. I mean, I can see even a recommendation engine or something being based on that. Like, “Oh, you’re looking at weather data. Would you also like to look at this other dataset?” [laughter]
NIKITA: 25:19	Absolutely.
SUSAN: 25:21	Awesome. Cool. So it’s come up a couple of times here, and I’m not sure if there’s more that you’d like to say about this, but you’ve mentioned a few things about how low-code or no-code and/or AutoML tools have come up in your various projects. And it sounds like these have integrated with or and complimented the manual kinds of coding that your data scientists are doing. Can you talk a little bit more maybe about that and how those tools have come into play for you generally speaking?
NIKITA: 25:51	Yeah. Absolutely. So I’ll give you an example. So most of the people that I hire as data scientists are either R programmers or Python programmers from way back, right? And although I’ve used Alteryx now for coming on to five years and love it, a lot of the people in my team had not heard or used Alteryx before. And initially, there was a little hesitation, particularly when I bought it in. Said, “No. Let’s have a look at this.” And particularly among hardcore data scientists. But we basically did a time in motion study around this, particularly around data cleansing. And there was a really interesting result in all of this where, depending on the proficiency of the programmer, whether they’d been using-- we did this comparing Alteryx with Python particularly. If they’d been using Python for 10 or so years, versus those who are very new to Python. But we found that no matter what their proficiency, there was definitely a cost-saving when it comes to data cleansing.
NIKITA: 26:50	So we’re finding that before data cleaning completing data selection, filtering all those wonderful data operations that you need to do at the start if you were very proficient in Python, the ratio was about four to one. That is what you would take about to do in three to four hours in Python we could do in one hour in Alteryx Designer. Now, if we looked at someone like a graduate who had just come out of university with a basic understanding of Python - limited experience on that - that ratio jumped up to about 10 to 1. What they would take to do in Python in about 10 hours they could do in Alteryx in 1 hour. So the cost savings alone was huge. And one of the things that my hardcore data scientists very quickly is, “We like this not because its program is low-code, but because it means that we can spend less time doing what we don’t like.” No data scientist in their right mind likes data cleansing. And in fact, I actually do, but I know I’m the exception. I love data cleansing.
NIKITA: 28:00	But what that meant for them is, every hour I don’t need to do data cleansing is another hour I can spend doing my machine learning, doing my feature selections, and doing that kind of statistical mathematical stuff that they all enjoy doing, so. And they got turned around very quickly in terms of that. The other benefit out of this was the whole it just lowers the risk profile for our projects. So if I pick up-- and one of the reasons we look-- when we look at ORC software, we look at basically how quickly a new person can pick up the software and how long before they become proficient in it. If you’re honest to yourself about Python, depending on how immersive you can be, you can be very proficient in Python somewhere in between, well, some people say 6 to 12 months. I think it’s closer to 12 to 36 months. We measure that down to weeks. I believe through the huge work that’s been done in the Alteryx community, the free training materials that’s available, I have had a hardcore data scientist go from not knowing Alteryx, to installing it, and finishing their first level - I think it’s for foundation certification - in Alteryx they did that in eight weeks just to give you an indication. And that means then if someone, unfortunately, leaves in this marketplace - and there’s huge demand for data scientists - I could bring someone else in and they can actually pick up what the persons done in Alteryx quickly because I know they can become up to speed and actually understand that because of that low-code visual nature of Alteryx.
NIKITA: 29:39	So as a result of that, I think my team internally are huge advocates now of Alteryx. But having said that, they still understand that Alteryx is still a framework for them and there’s still opportunities for us to program and develop our algorithms in R and Python, particularly in niche areas. So, for example, hydrology is a classic example. We do a lot of hydrology work. Alteryx doesn’t have much hydrology macros out there, funnily enough. So what do we do? We use Alteryx still but we program particularly in R in this example because they’ve got very neat, good models that we can use on R. We program those in our R node and then we put it through as an Alteryx workflow. It just then cements that and then we can just modify that code occasionally as well. So that’s been really powerful. The last thing I will say is that more and more, what we’re doing is moving away from R to Python. But again, that process doesn’t change. So if we have something in R but it’s not efficient, it’s not running as quickly as we’d like, then I will take that R code, I’ll give it to one of my Python experts and say, “Can we actually reprogram that in Python to make it a little bit more efficient as well?” But guess what? The great thing is, the macro doesn’t change. You take out the R node, we put in a Python node, great. It’s got [inaudible] in there. That’s all good. But fundamentally, Alteryx becomes the framework which executes that and that doesn’t change whether we’re running R or Python. And the great thing is the end-user doesn’t see any difference [from their perspectives?].
SUSAN: 31:11	Right. Right. Yeah. Oh, that’s so interesting. I’m just thinking now of the comments on this podcast episode and how we’re going to get the R - Python debate going here, but. [laughter]
NIKITA: 31:21	Funny you should say that. I have been an R programmer from way back and you tend to see that the people that feel comfortable traditional mathematics statics background, they will tend to lean towards R. Having said that, I am now coming up-- I am using Python more and more. And I have come to the conclusion here are some things that Python does more efficiently than R, particularly because it is 64 bit and you do have a range of different opportunities to paralyze and split up in terms of multithreading. So there are causes for it. But having said that, at the end of the day, as a data scientist, as a consultant, we use the tools that our clients require us to use. So there are a lot of organisations that say, “We need hydraulic modelling.” And some of them will actually say, “You must use these libraries in R.” So we are flexible. We need to be flexible. And at the end of the day, I think it’s great whether you use R or Python. But ultimately, we need to make sure that no matter what tool you use, you present the findings in a way that way that find interesting and they can slice and dice into it. So one thing that we haven’t talked about and I’ll maybe come back and talk about it another day is every date science project we do at GHD, we always develop a series of dashboards and interactive dashboards that we present back to the clients. We never provide a client, just with a spreadsheet. Sometimes clients insist on a spreadsheet, but we know that the best thing for them to do is using dashboards to slice and dice their results. And I think that’s fundamentally important to everything we do. It doesn’t matter whether it’s in R or Python. It doesn’t matter if you use Tableau or Power BI, another debate that’s very rich at the moment. At the end of the day, it needs to be providing the right results in the right format for our clients to get the maximum knowledge and insights out of that.
SUSAN: 33:13	Yeah. Absolutely. Makes sense. So one question that we always ask everybody on the podcast, and we call this the Alternative Hypothesis segment here, the question is: what is something that people think is true about data science or about being a data scientist, but that you have found in your experience to be incorrect?
NIKITA: 33:35	I think there are two things that I challenge my team on every day. Number one is, there’s always a better algorithm. Some data scientists would love to spend time saying, “Okay. Now, maybe if I tweak this random forest, or maybe if I use these different ensemble methods, or maybe if I move from a random forest to a neural network, or maybe if I go towards that to use even a deep learning network, I could get better results.” And sometimes that is true. But I would argue and I have frequently argued with my team members, “Okay. For every hour that you spend looking for a better algorithm, what if you spend that extra hour data cleansing or doing a better feature selection and feature engineering and doing something around collecting better data? What’s the impact?”
NIKITA: 34:30	And then some projects we’ve actually found that actually going back and actually collecting better data, cleansing our data a little bit better, classifying it a little better, or doing a little bit-- using a specialised feature engineering or feature selection, we get a significant jump in accuracy than if we had done it to another algorithm. So I think sometimes in data science we get focused on the algorithm that we use, and that is an important aspect because you don’t want to use the wrong algorithm for the wrong purpose, but sometimes data cleansing can provide the same jump in accuracy depending on what you’re looking at. So that’s number one. And number two, I will say contrary to that, on the flip side of that, is there is an assumption that people think that data science projects fail because of data. I will say that I’ve had many a data science project fail. And I would say the majority of the time the data science projects fail is not because of the data.
NIKITA: 35:33	Data technology is predictable. I think that people and being able to add things like change management, strong communication, delivery of results in a clear and concise way that people can understand is so important. And I will say that most of the time that my projects have failed in my past 20 years has been because of people reasons. It rarely is because of data reasons. So one of the reasons-- if I could put my hat back on when I think back on data mining one of the reasons why data mining failed to jump and to be picked up back then in the ‘90s and the early 2000s wasn’t because of the lack of algorithms or techniques. It was because there was an overemphasis on technical aspects and technical speak over business-speak. And so people and being able to translate from something mathematical to business and vice versa is, I think, one of the reasons why some projects do not get as much out of them. Are not as successful as they would like to be.
SUSAN: 36:41	Well, those are great points. I think maybe we need to do a special episode of Data Science Mixer just on failure. [laughter] [crosstalk]. And I think we often talk about the things that worked and the things that were successful, but I think some of the things you’ve just highlighted as causes for failure maybe need to see a little more discussion too, so.
NIKITA: 36:59	Absolutely. [music]
SUSAN: 37:02	Thanks for listening to our Data Science Mixer chat with Nikita Atkins. Join us on the Alteryx Community for this week’s Cocktail Conversation to share your thoughts. Nakita talked about handling projects where the data and the parameters of the situation are both changing rapidly. He mentioned using AutoML in the Alteryx Intelligence Suite to adapt quickly to those changes. Have you faced a similar issue of constant change while developing your own projects? What strategies have you used to deal with this kind of challenge? Share your thoughts and ideas by leaving a comment directly on the episode page at community.alteryx.com/podcast or post on social media with the hashtag #datasciencemixer and tag Alteryx. Cheers.

SUSAN: 00:01 [music] Do you think you’ve got the one singular perfect mix of data science tools, or does that ideal combination always depend on the nature of a specific project? Welcome to Data Science Mixer, a podcast featuring topics, words, and lively and informative conversations that will change the way you do data science. I’m Susan Currie-Sivek, Senior Data Science Journalist for the Alteryx community. In this episode, I talk with Nikita Atkins of GHD, a global professional services company, about the diverse variety of data science tools and strategies he’s used in his work. We’ll hear how he and his teams have combined code with AutoML in Alteryx to tackle millions of rows of data affecting billions of dollars of commerce and governed by hundreds of rapidly changing business rules. Flexibility has been a key to these projects’ success and an open-minded approach also allowed for some creative new ways to address familiar problems. And Nikita will tell us about the effects of adopting a different mix of tools in terms of time, costs, and efficiency. Plus, he’s built new ways to help his team share data and custom-built tools to save time and increase everyone’s data science capabilities. Let’s meet Nikita and get right down to business. NIKITA: 01:20 So hi everyone. My name’s Nikita Atkins. I have been doing data science business intelligence data management for over 20 years. I went to the University of Wollongong and did my mathematics degree over 20 years ago now. And I always thought when I did my degree, I would get into something like banking or finance or actuaries. But I ended up jumping in and getting a job originally in business intelligence. It’s something that I had not even heard of at the time. And obviously for the first 10 years of my career did a lot in business intelligence and data warehousing and helping organisations get more out of their data. And so I worked on that all the way up to about 2008, 2009, and that was fascinating work and really interesting. But about 10 years ago - just over 10 years ago - decided, “I really want to use my mathematics degree more and more.” And so I got into an area that was known back then as data mining. Data mining turned into data science and turned into machine learning and artificial intelligence. And so I feel like I have been on a whirlwind journey over the last - what is it now? - 12, 13 years now around data science. NIKITA: 02:30 I have worked with some of the biggest organisations around this. Working with miners, oil and gas, transportation, and even worked with governments on a range of different options. All have very different but similar challenges. I joined GHD about three years ago. GHD is a engineering and design firm. It’s been around for 90 years. And we’re actually employee-owning with about 10,000 employees worldwide. And we do anything to do with engineering and design. We don’t do construction, but we do everything else. So we help sectors such as energy and resources, water, transportation, environment. We even have our own architecture team. So we do property and buildings, for example. And GHD realised about three years ago that they wanted to start up something called GHD Digital and that was to help their clients and other clients transition into new ways of using digital technology. And the digital intelligence and the advanced analytics team started about three years ago, and they asked me to join them to help spin up and build this team in Australia and globally. NIKITA: 03:37 So after first joining GHD virtually, there was no one there from an advanced analytics point of view. I basically focused on-- my role transitioned very quickly to half and half. So I have a 50% external focus where, as a consultant, I’m focused on, how do we do business development and how do we sell data science and then obviously deliver that to our clients? And you’ll hear about some of the clients that we work with very soon. And then I also have an internal focus as well, and that internal focus is as the data science design leader. I am identifying the best approaches, standards, methodologies and building up training and education on how we can use data and data science better across the whole organisation. So working with our traditional engineering teams across as I said, water, energy, transportation, for example, how can we better use data? How can we better capture data? How can we then use that to deliver better insights to our clients while delivering on design me the next bridge or the next freeway or helping mines for an environmental point of view as well? So it’s been a fascinating journey. And every day is a different day. Different organisations, different clients, different challenges. SUSAN: 04:57 Nice. I love it. So a couple of important questions before we go on to hear more about those projects that you’ve worked on. Could you share with us which pronouns you use? NIKITA: 05:06 Yeah. So he, him. SUSAN: 05:08 Cool. Thank you. And as you may know on Data Science Mixer, we typically try to enjoy some sort of special beverage or snack while we’re chatting. So do you have something there with you today? NIKITA: 05:19 Yes. I’m enjoying my regular morning latte. So just a regular latte, but it’s a lovely one. SUSAN: 05:24 Excellent. Good choice. So yeah, I’m once again as on previous recent episodes going through the selection of [inaudible] Sparkling water flavours. So today’s lime, so very [crosstalk]. NIKITA: 05:36 Very nice. SUSAN: 05:37 Yes. Good, good, good. All right. So one of the projects you’ve worked on recently that I’ve heard a little bit about was with the port of Melbourne looking at commodities and shipping containers. Could you tell us just a little bit about that? NIKITA: 05:48 Yeah. So the Port of Melbourne has been a fascinating project. So for those people who aren’t aware, Port of Melbourne is a large capital city container and general cargo port in Australia. It’s worth about $7.5 billion to the Australian economy. And they manage a site that’s worth-- it’s over 500 hectares, 1200 acres. So it’s a large area that works 24 by 7. And they get approximately 3 million TEU containers every year. So it’s a very substantial port. One of the challenges that the port and the Victorian Government has is there’s very little visibility on where containers go to once they leave the port or where they come from an exports point of view. So the port, which is actually currently on a private lease from the Victorian Government along with the Victorian Government asked GHD if we would run what we call the origin destination survey to help basically capture a lot of data and understand the broader supply chain networks that are directly tied to the Port of Melbourne around container movements. NIKITA: 06:53 It sounds a simple problem, but it is a hugely challenging one because there are so many different stakeholders that we had to deal with. From empty container parks, train operators, truck operators. You’re dealing with everyone from a little truckie operator that may be family owned, all the way up to massive importers and exporters of goods, so. And we had to deal with everyone in between. And that often meant getting lots of slices of data from, I think it was over 100 different stakeholders. And in the end, we got 57 different source data. All different formats. All different granularities. And the challenge we had to do was stitch it all together to present a seamless view of how a single container, for example, moved out of the port, via train or truck, to the eventual customer and then how that same container may have moved back into the port through an export process. NIKITA: 07:46 So some of the numbers we had to deal with, it was 100 million records across 57 different data sources. This was for a two-month period. So we looked at September and October 2019. And that was hugely challenging on its own. It was a huge amount of data cleansing with about I think it was close to 200 business rules that we had to apply throughout the whole process. Having said that, we were really successful. The data cleaning aspect of that, once we get all the data, took about 11 weeks, which was really rapid for this. We were running almost daily iterations. I mean, we would run a data source. We would then run a set of rules. Show the client. At the end of the day, they would give us feedback. And usually, the next day we would re-run that with a new iteration, a new feed. All of this was done through Alteryx. And some of the things that we did with Alteryx, obviously, the data cleansing was huge. NIKITA: 08:37 We started doing some of this cleansing using Python and Pandas. But what we found really quickly was it was very difficult for us to keep up with the rapid iterations and the changes we needed to make. So we made the decision very quickly to move from Python to Alteryx for the data cleansing aspects. And Alteryx actually became more critical, not just as a development tool, but actually Alteryx became critical as part of our project management. So, as I said, we would run iteration and we would show the client the end results. And Alteryx actually was part of that project management stand-up that we would have. We would show them the rules. We’d explain how we had modified that. We would show them the results. And so Alteryx became really easy. Because even though our client didn’t understand Alteryx, they understood pretty quickly what we were looking at, how we were modifying things, and that actually gave them a real sense of confidence because they could understand what they were seeing on the screen very quickly and thus they had confidence in our results. SUSAN: 09:35 Nice. Very cool. NIKITA: 09:36 And the last of that is, so we ran that for two months of September/October, but we had to deliver a report on the whole of 2019. And so what we ended up doing is using the Intelligence Suite to use the machine learning aspects. To take that two-month period and to extrapolate all of the commodities, the container movements, the supply chains across the whole year. Traditionally, this would have been just done with someone going, “Times this month together via weighting factor,” and that would be it. We decided that we’d use the AutoML because we wanted to make sure that we picked up any macro seasonality and micro seasonality across the whole year across commodities. And so the Intelligence Suite in the AutoML had been really important for us to do that. And what we did was we extrapolated that and so we had a detailed data set that said, “Here’s all your containers, and this where they went to, which postcodes they went to, and whether they went via rail or road, how many depots they stopped in along the way.” NIKITA: 10:36 When we went back and looked at the results and we compared it with the baseline data that the Port of Melbourne had - which is not as detailed as what we had but obviously allowed us to compare from a container and commodity point of view - the AutoML capability scaled us within 99.9995%. Well, the way we did our validations when we looked at container counts and our total weight or TEU, our TEU was spot on. We had zero error for TEU. So that was really good validation. And we only missed out on a container count of about 10 out of, as I said, 3 million container movements every year. So it was very small errors that we were getting. Out of interest, the 10 containers that we missed was a very unique particular commodity that we had not seen, and it only happened in 1 or 2 months. NIKITA: 11:25 So the port, once they saw that, basically, it had huge confidence because this was a hugely rich dataset; they got out of us from the survey, which will now help them for capital planning for the next 10 years. It helps the Victorian Government make better planning decisions about what upgrades are needed as part of road and rail. And in Australia at the moment, we’re doing a very substantial infrastructure project which is called the Inland Rail and that will connect Melbourne and Brisbane together for a very important freight corridor going forward. But this project will help justify the numbers and help Victoria, New South Wales, and Queensland make better decisions about how they can encourage more freight movement onto that new inland rail when that’s completed in a few years’ time. SUSAN: 12:15 Very cool. Yeah. It’s awesome to think of how that project, I mean, a huge project, but to have even a more massive impact across all of those different things that you just mentioned. It’s neat to think about all of the consequences that could have rippled forth from the work that you’ve done there. I’m curious. For those who are not familiar with the Alteryx Intelligence Suite, this is the suite of tools that is in addition to Alteryx Designer that includes assisted modelling, AutoML, some natural language processing tools, computer vision tools, and so forth. You mentioned here your choice to use an AutoML approach. Can you elaborate a little bit more on that if you can? Kind of why that was a good option for you with this particular project? NIKITA: 12:54 Yeah. So this project, we used the version 1 of the Intelligence Suite. And so it had the wizard, and we used the wizard back then for that. We really wanted to-- we didn’t know which was the right machine learning model for this. We always suspected that it would end up being a random forest. But the AutoML capability intelligence really just allowed us to focus on getting that data right, focus on the feature cleansing, and then let Alteryx run those models. In the end, I think we ran about 50 different models through that to let it decide which was the best model that would give us the best accuracy, the best stability for our results. The power of this was-- as I said, the business rules changed, often on a very quick basis. The Intelligence Suite not only let us pick the right model for the situation, but as data changed quickly, and the rules changed quickly, the inputs changed. And we could just click off and let that run and it would then modify the model based on what-ifs or any new features. So it was very powerful. It allowed us to focus on what we needed to, which was getting the best quality data into those projects. And that was really powerful and just allowed us to do what needed to be done, which was focusing on that data cleansing and the final outputs. SUSAN: 14:19 Yeah. Absolutely. Makes a lot of sense. So I’m curious too, speaking of ML and using ML in interesting ways you had also told me a bit about a project you worked on, and this is kind of funny when you were describing to me the Australian Clean Energy Regulator. And I was like, “Oh, I wonder what the actual agency name is?” But it is actually called, if I’m correct here, the Clean Energy Regulator, right? NIKITA: 14:41 That’s exactly right. It is called the Clean Energy Regulator. So we had done the Port of Melbourne and then we got this project about six months later. And so we were much more wiser from the Port of Melbourne experience for this. So The Clean Energy Regulator, every year, it has a need to, basically, predict how many PV, or photovoltaical solar as we know it, how many solar installations are going to happy across the country, both in a residential and commercial context. GHD had been doing this for two years previous to this, and we had traditionally used econometrics modelling and traditional-- as I said, econometrics modelling used to do this work. We decided this year that we would do it slightly differently. And there was a few reasons for it. So we wanted to use machine learning to do it, backed up by our advisory economists as well. NIKITA: 15:34 And the reason we had to do that was, over the last 12 months to 2 years, we’ve had some really interesting problems in Australia. One, we had the major bushfires back at the end of 2019. And obviously, with COVID lockdown, there has been a certain level of-- there would have been an impact on solar. We weren’t sure if it was going to be a positive impact or negative impact. And we definitely felt as an organisation that, machine learning through Alteryx particularly, was really required to help capture some of these nuances associated with natural disasters and pandemics and so forth. So what we did was as part of this, again, we had basically at this stage built a very interesting database which allowed us to capture some of the raw data that we use for a lot of these models across clients. So our Bureau of Statistics, our census data. We get updates from market data, our Reserve Bank, and so forth. We actually capture that on a daily basis and store that historical information to allow us to have a starting point for all our clients. NIKITA: 16:38 So when the Clean Energy Regulator came to us, we basically grabbed that data and we had really detailed population forecasts, really detailed building and dwelling information across by-- we actually had a couple of machine learning models to break it down by postcode. So we actually had a really detailed data collection that we had to start with. Bring that in, bring in lots of data that we had never used before, such as the economy, the number of visitors coming to Australia, Australian citizens who are going away for holidays. We decided that we would use these rather than use a measure of how many COVID cases there were. We actually used economic indicators across the nation to help use those as a dummy variable for the pandemic. The reason we chose that is that these models are now future proof. So yeah, hopefully, we’ll all be coming out of the pandemic, we’ll all be vaccinated, that will be terrific. But this model will still hold true because it will be based on general indicators that can be used going forward. NIKITA: 17:42 And the great thing about is we used AutoML. We ran close to 1,000 different models across all these different indicators, across a variety of residential and commercial a range of different installation sizes. So 50-megawatt installations, 1-megawatt installations, and even lower by state, by postcode. And we ran all these different models, aggregate them back up, and then we basically had this really interesting model and dataset that has successfully shown how believe it or not, that residential solar uptake was actually going up because of COVID, which is contrary but the idea behind that is, “Well, I can’t go overseas. I’ve got a few thousand dollars left in my bank account. Let’s use it to spend it on solar so that we’re spending less on electricity.” It sounds surprising, but it was really interesting. However, we knew that there was a natural tipping point where a lot of people last year had done that. Some had done it a little bit earlier in 2020. And what we successfully showed and modelled was we had got a saturation point and that on the second half of 2020 there would be a slowing down of solar uptake in residential areas. But it’s a really interesting exercise that has been the Clean Energy Regulator has definitely appreciated our insights on this. And allowed us to show how machine learning can be applied in normally an economic area but with really successful outcomes for the client in a way that we can explain it to them and hopefully puts us in good stead to helping the Clean Energy Regulator in following years. SUSAN: 19:25 Yeah. Absolutely. Good deal. So one other thing that you had mentioned in our previous conversation was sort of an internal algorithm catalogue or repository that you had created using Alteryx Server. And I thought that was a really interesting concept, that this was a way that internally you were working to make your data science work and data engineering more efficient and more repeatable. Is that something you can tell us a little bit about? NIKITA: 19:51 Yeah. Absolutely. As I said, GHD’s been around for 90 years. And we have lots of engineers around the place, and a lot of them actually use R and Python in their day-to-day jobs for other reasons. So we have, for example, hydrologists who do very detailed modelling and some of them use actually R and Python. And they have built a series of algorithms and code and libraries internally to help their clients. One of the jobs I have as that service line leader, the internal role, is to start to capture and identify these things. And at the moment, they tend to be captured on various file systems. There’s a little bit of version control, but not much. But there’s a mishmash of the stuff all over the place. What we’re trying to use is start capturing some of those, initially in traditional data science, but in these other areas. Grab these algorithms and put them through in as an Alteryx workload. NIKITA: 20:46 Whether we’re storing them-- initially, we’re capturing them as macros within the Alteryx workflow, saving them to the Alteryx server, and then having these actually on the Alteryx server. They have version control. They have their own little community. So we have a water community, an energy community, and so forth. But then people can go in there, actually call that project, call that macro, run it. They basically upload in CSV or shell file and run that, or they could download the macro, can see the Python code and R code, and they can modify it and update it back and having full version control. And we’re using the Alteryx server to do that and we really want the Alteryx server to be the one-stop shop for people to either, not reinvent the wheel. There are some really good stuff so that we can have standard algorithms there to use by our clients no matter industry they’re from. Also, explore and learn as well. So we have a huge interest, particularly in R and Python, and we think that rather than-- if the natural next step you go. SUSAN: 22:10 Yeah. That’s very cool. I had not thought about it as a learning opportunity in the sense that you just described it, but that makes a lot of sense for that scene. So any other projects that you wanted to talk about that maybe have come to mind as we’ve been chatting? NIKITA: 22:24 There’s another one, again, that’s a little bit more internal focused, but it’s one that is really interesting at the moment. So we’ve been around for a long time and we’ve been actually using computers for a long time. And we have a very comprehensive internal file drive and we have our SharePoint system that we also use as well. The problem is a lot that data is captured by projects. And so a project that was done last year or two years ago, it’s often completed, archived, and we forget the inherent knowledge or content that we’ve stored in that project over time. And so what we developed is a small robot actually within Alteryx to scan our file systems, shared file systems, the ones that are publicly open and just catalogue all the different files that we’ve got. So we’ve got spatial files, we’ve got images, we’ve got Python scripts on that side of things. NIKITA: 23:17 We’re still running that code at the moment, but just to give you a sense of the numbers we’re getting, we’ve captured 1 billion files. So that’s a huge number of inherent knowledge then. The metadata alone, so not the content but the metadata about basically the filename, the size, the directory, it’s currently 500 megabytes alone. It is a huge dataset, just the metadata. And we’re using that to catalogue and use that for search capabilities. So for example, at the moment someone’s come to me says, “We’re looking for all data that may be associated with weather. Where have we captured weather information or brought weather information and where have we got that across different projects?” Well, we can now do a rather simple search across that metadata to help us identify all the different folders that may be labelled with weather or meteorology or climate and particular meteorological file types as well. And that helps us to quickly identify that. NIKITA: 24:11 What we’re trying to do next is take those different file systems and then identify those that need to be put into a centralised database or data lake. And so that we’ll have a standardised approach going forward. The benefits of this is that we often have to pay for data, whether we have to pay for satellite imaging, for example, pay for weather information, or pay for other things. If we can identify some of this data and centralise it and then when a new project comes up, we can actually say, “Do you need to buy this because we’ve got this image, for example, satellite imaging, for this same location 12 months ago If you’re happy with the satellite image being only 12 months old that maybe you can save a couple thousand dollars. Now, we do tens of thousands of projects every year. So if you save $100 per project, that number can add up very quickly in terms of licenses saved. SUSAN: 25:05 Yeah. For sure. I love that. I love both the efficiency then in the potential of that data. I mean, I can see even a recommendation engine or something being based on that. Like, “Oh, you’re looking at weather data. Would you also like to look at this other dataset?” [laughter] NIKITA: 25:19 Absolutely. SUSAN: 25:21 Awesome. Cool. So it’s come up a couple of times here, and I’m not sure if there’s more that you’d like to say about this, but you’ve mentioned a few things about how low-code or no-code and/or AutoML tools have come up in your various projects. And it sounds like these have integrated with or and complimented the manual kinds of coding that your data scientists are doing. Can you talk a little bit more maybe about that and how those tools have come into play for you generally speaking? NIKITA: 25:51 Yeah. Absolutely. So I’ll give you an example. So most of the people that I hire as data scientists are either R programmers or Python programmers from way back, right? And although I’ve used Alteryx now for coming on to five years and love it, a lot of the people in my team had not heard or used Alteryx before. And initially, there was a little hesitation, particularly when I bought it in. Said, “No. Let’s have a look at this.” And particularly among hardcore data scientists. But we basically did a time in motion study around this, particularly around data cleansing. And there was a really interesting result in all of this where, depending on the proficiency of the programmer, whether they’d been using-- we did this comparing Alteryx with Python particularly. If they’d been using Python for 10 or so years, versus those who are very new to Python. But we found that no matter what their proficiency, there was definitely a cost-saving when it comes to data cleansing. NIKITA: 26:50 So we’re finding that before data cleaning completing data selection, filtering all those wonderful data operations that you need to do at the start if you were very proficient in Python, the ratio was about four to one. That is what you would take about to do in three to four hours in Python we could do in one hour in Alteryx Designer. Now, if we looked at someone like a graduate who had just come out of university with a basic understanding of Python - limited experience on that - that ratio jumped up to about 10 to 1. What they would take to do in Python in about 10 hours they could do in Alteryx in 1 hour. So the cost savings alone was huge. And one of the things that my hardcore data scientists very quickly is, “We like this not because its program is low-code, but because it means that we can spend less time doing what we don’t like.” No data scientist in their right mind likes data cleansing. And in fact, I actually do, but I know I’m the exception. I love data cleansing. NIKITA: 28:00 But what that meant for them is, every hour I don’t need to do data cleansing is another hour I can spend doing my machine learning, doing my feature selections, and doing that kind of statistical mathematical stuff that they all enjoy doing, so. And they got turned around very quickly in terms of that. The other benefit out of this was the whole it just lowers the risk profile for our projects. So if I pick up-- and one of the reasons we look-- when we look at ORC software, we look at basically how quickly a new person can pick up the software and how long before they become proficient in it. If you’re honest to yourself about Python, depending on how immersive you can be, you can be very proficient in Python somewhere in between, well, some people say 6 to 12 months. I think it’s closer to 12 to 36 months. We measure that down to weeks. I believe through the huge work that’s been done in the Alteryx community, the free training materials that’s available, I have had a hardcore data scientist go from not knowing Alteryx, to installing it, and finishing their first level - I think it’s for foundation certification - in Alteryx they did that in eight weeks just to give you an indication. And that means then if someone, unfortunately, leaves in this marketplace - and there’s huge demand for data scientists - I could bring someone else in and they can actually pick up what the persons done in Alteryx quickly because I know they can become up to speed and actually understand that because of that low-code visual nature of Alteryx. NIKITA: 29:39 So as a result of that, I think my team internally are huge advocates now of Alteryx. But having said that, they still understand that Alteryx is still a framework for them and there’s still opportunities for us to program and develop our algorithms in R and Python, particularly in niche areas. So, for example, hydrology is a classic example. We do a lot of hydrology work. Alteryx doesn’t have much hydrology macros out there, funnily enough. So what do we do? We use Alteryx still but we program particularly in R in this example because they’ve got very neat, good models that we can use on R. We program those in our R node and then we put it through as an Alteryx workflow. It just then cements that and then we can just modify that code occasionally as well. So that’s been really powerful. The last thing I will say is that more and more, what we’re doing is moving away from R to Python. But again, that process doesn’t change. So if we have something in R but it’s not efficient, it’s not running as quickly as we’d like, then I will take that R code, I’ll give it to one of my Python experts and say, “Can we actually reprogram that in Python to make it a little bit more efficient as well?” But guess what? The great thing is, the macro doesn’t change. You take out the R node, we put in a Python node, great. It’s got [inaudible] in there. That’s all good. But fundamentally, Alteryx becomes the framework which executes that and that doesn’t change whether we’re running R or Python. And the great thing is the end-user doesn’t see any difference [from their perspectives?]. SUSAN: 31:11 Right. Right. Yeah. Oh, that’s so interesting. I’m just thinking now of the comments on this podcast episode and how we’re going to get the R - Python debate going here, but. [laughter] NIKITA: 31:21 Funny you should say that. I have been an R programmer from way back and you tend to see that the people that feel comfortable traditional mathematics statics background, they will tend to lean towards R. Having said that, I am now coming up-- I am using Python more and more. And I have come to the conclusion here are some things that Python does more efficiently than R, particularly because it is 64 bit and you do have a range of different opportunities to paralyze and split up in terms of multithreading. So there are causes for it. But having said that, at the end of the day, as a data scientist, as a consultant, we use the tools that our clients require us to use. So there are a lot of organisations that say, “We need hydraulic modelling.” And some of them will actually say, “You must use these libraries in R.” So we are flexible. We need to be flexible. And at the end of the day, I think it’s great whether you use R or Python. But ultimately, we need to make sure that no matter what tool you use, you present the findings in a way that way that find interesting and they can slice and dice into it. So one thing that we haven’t talked about and I’ll maybe come back and talk about it another day is every date science project we do at GHD, we always develop a series of dashboards and interactive dashboards that we present back to the clients. We never provide a client, just with a spreadsheet. Sometimes clients insist on a spreadsheet, but we know that the best thing for them to do is using dashboards to slice and dice their results. And I think that’s fundamentally important to everything we do. It doesn’t matter whether it’s in R or Python. It doesn’t matter if you use Tableau or Power BI, another debate that’s very rich at the moment. At the end of the day, it needs to be providing the right results in the right format for our clients to get the maximum knowledge and insights out of that. SUSAN: 33:13 Yeah. Absolutely. Makes sense. So one question that we always ask everybody on the podcast, and we call this the Alternative Hypothesis segment here, the question is: what is something that people think is true about data science or about being a data scientist, but that you have found in your experience to be incorrect? NIKITA: 33:35 I think there are two things that I challenge my team on every day. Number one is, there’s always a better algorithm. Some data scientists would love to spend time saying, “Okay. Now, maybe if I tweak this random forest, or maybe if I use these different ensemble methods, or maybe if I move from a random forest to a neural network, or maybe if I go towards that to use even a deep learning network, I could get better results.” And sometimes that is true. But I would argue and I have frequently argued with my team members, “Okay. For every hour that you spend looking for a better algorithm, what if you spend that extra hour data cleansing or doing a better feature selection and feature engineering and doing something around collecting better data? What’s the impact?” NIKITA: 34:30 And then some projects we’ve actually found that actually going back and actually collecting better data, cleansing our data a little bit better, classifying it a little better, or doing a little bit-- using a specialised feature engineering or feature selection, we get a significant jump in accuracy than if we had done it to another algorithm. So I think sometimes in data science we get focused on the algorithm that we use, and that is an important aspect because you don’t want to use the wrong algorithm for the wrong purpose, but sometimes data cleansing can provide the same jump in accuracy depending on what you’re looking at. So that’s number one. And number two, I will say contrary to that, on the flip side of that, is there is an assumption that people think that data science projects fail because of data. I will say that I’ve had many a data science project fail. And I would say the majority of the time the data science projects fail is not because of the data. NIKITA: 35:33 Data technology is predictable. I think that people and being able to add things like change management, strong communication, delivery of results in a clear and concise way that people can understand is so important. And I will say that most of the time that my projects have failed in my past 20 years has been because of people reasons. It rarely is because of data reasons. So one of the reasons-- if I could put my hat back on when I think back on data mining one of the reasons why data mining failed to jump and to be picked up back then in the ‘90s and the early 2000s wasn’t because of the lack of algorithms or techniques. It was because there was an overemphasis on technical aspects and technical speak over business-speak. And so people and being able to translate from something mathematical to business and vice versa is, I think, one of the reasons why some projects do not get as much out of them. Are not as successful as they would like to be. SUSAN: 36:41 Well, those are great points. I think maybe we need to do a special episode of Data Science Mixer just on failure. [laughter] [crosstalk]. And I think we often talk about the things that worked and the things that were successful, but I think some of the things you’ve just highlighted as causes for failure maybe need to see a little more discussion too, so. NIKITA: 36:59 Absolutely. [music] SUSAN: 37:02 Thanks for listening to our Data Science Mixer chat with Nikita Atkins. Join us on the Alteryx Community for this week’s Cocktail Conversation to share your thoughts. Nakita talked about handling projects where the data and the parameters of the situation are both changing rapidly. He mentioned using AutoML in the Alteryx Intelligence Suite to adapt quickly to those changes. Have you faced a similar issue of constant change while developing your own projects? What strategies have you used to deal with this kind of challenge? Share your thoughts and ideas by leaving a comment directly on the episode page at community.alteryx.com/podcast or post on social media with the hashtag #datasciencemixer and tag Alteryx. Cheers.

This episode of Data Science Mixer was produced by Susan Currie Sivek (@SusanCS) and Maddie Johannsen (@MaddieJ).
Special thanks to Ian Stonehouse for the theme music track, and @TaraM for our album artwork.

Data Science Mixer

Episode Guide

Enabling insights with Alteryx Machine Learning | Nikita Atkins

Panelists

Topics

Cocktail Conversation

Transcript