We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
A recent survey from McKinsey showed that 56% of respondents reported AI adoption in at least one function, up from 50% the year prior, with the three most common use cases focused on service-operations optimization, AI-based enhancement of products, and contact-center automation. Businesses are committing huge amounts of money to AI initiatives. According to Appen’s 2021 State of AI report, AI budgets increased 55% year-over-year, reflecting a shift from an experimental project mindset to an expectation of business benefits and ROI.
One reason this shift is happening now is that many businesses have built expert data science teams and matured their understanding of the discipline. However, this has not proven to be enough to maximize the business potential of AI initiatives and deliver the desired ROI. What these businesses still lack is a best-practices approach to preparing data for the AI lifecycle. AI teams also need the right tools and techniques to help them gain greater insight into and better manage the lifecycle.
Moving forward, the success of AI and machine-learning initiatives will depend largely on a business’s ability to tie the right business use case to the right model, which has been trained using high-quality, properly-sourced data. Getting this rhythm down is at the heart of AI deployment and will help to reduce complexity within the lifecycle and ensure scalability and success sooner and longer.
Data lifecycle steps and considerations
AI teams tend to say that their main challenge isn’t building the model itself but understanding exactly how to source and label the data at scale, managing the models long-term, and checking for real-world model performance. The AI data lifecycle is dynamic and ever-changing, and the approaches we take to manage its different components need to be dynamic as well.
Here are some key considerations to keep top of mind as you move through the lifecycle:
- Data sourcing. Once you have an understanding of why and how your AI models will be leveraged (i.e. which use cases you’ll be focused on), it’s time to source the data to build the model. This means first assessing the options you have available to you from internal sources and/or external vendors. As you acquire data, it is key to ensure that 1) it’s feasible to make the process repeatable at scale from the sources you decide to leverage, and 2) that the data is high quality and ethically sourced. There are also different types of data to consider, depending on the maturity of your program and complexity of what you’re trying to accomplish. Pre-labeled datasets are ready to go and can make the model training process seamless and efficient, while synthetic data could serve as a substitute for hard-to-find data, enhancing model training.
- Data preparation. Next up is ensuring the data is properly annotated, rated, judged, and labeled to create optimal input for the model. In other words, this step turns your data into intelligence, and it should not be approached lightly. First, you need an ontology or data model that describes the contents of your data and how they’re related to each other. You will use parts of this ontology to label unstructured data such as text and images and extract its content which then turns into a knowledge graph. This is the process of taking an ocean of unlabeled, unstructured data and turning it into data that can be used to train your model to recognize different patterns that matter for your business use cases. Organizations can approach data preparation in a multitude of ways, typically leveraging either in-house staff and resources, freelancers, or third-party data partners who leverage crowdsourcing and technology to help prep the data.
- Model testing, training, and deployment. Then it’s time to train your model using the prepared data and ensure that it’s properly connected with the model infrastructure. The complexity of your use case comes into play here. If the model is processing radiology images to identify disease, the accuracy level will need to be higher than a model that is being used to identify products on a grocery shelf in an online marketplace. This step requires testing the model with your labeled data and then testing it with a different set of unlabeled data to see if the predictions are accurate. The team members involved in the project need to be regularly checking the predictions and identifying any issues or gaps in the data so they can train and retrain as needed. This is the “human-in-the-loop” approach. Once it’s been adequately tested and trained, the model can then be deployed by integrating it into existing production environments.
- Model evaluation. The process doesn’t end with deployment. AI and ML initiatives should not be treated like projects that have an ending but rather as cycles that require continuous monitoring. The evaluation stage helps teams avoid model drift as well, which occurs when environments change, impacting the model’s predictive capabilities. Ideally, this is when the team would also source real-world model performance validation, comparing their performance to competitors and peers to ensure best-in-class results. It’s all about continuous improvement at this point, which may be the most important, yet often overlooked, stage.
It takes patience and dedication to realize the benefits of AI. You’ll know you’re doing it right not when you wrap up a project but when you can take your learnings and apply them to other scenarios and functions within your organization. Success in AI means iterating quickly and building in a repeatable, scalable way. If you keep these data lifecycle considerations in mind when building AI, and if you don’t skip any steps or take any shortcuts, you’ll be on your way.
Sujatha Sagiraju is Chief Product Officer at Appen.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!