MLOps Community Meetup #120! Last week we talked to Crag Wolfe, Infrastructure Team Lead at Unstructured.io hosted by Ben Epstein.
//Abstract
Modern ML pipelines still often need pre-processed documents. This isn't changing anytime soon, in fact, the appetite is growing.
Unstructured.io is focused on extracting structured data from raw documents (pdf, pptx, html, etc). In the near term, we're more NLP-focused.
Check out Unstructured.io's open-source libraries!
// Bio
Crag is a seasoned Back-End Engineer, with over a decade of experience working at Red Hat. In his most recent role, he served as the Technical Lead for a key product at an NLP startup, where he spent five years honing his skills and expertise.
// Jobs board
[ Ссылка ]
// Related links
Then open-source community: [ Ссылка ]
Connect with Crag: crag@unstructured.io
Docs in an s3 bucket and saving the structured output:
[ Ссылка ]
----------- ✌️Connect With Us ✌️-------------
Join our Slack community: [ Ссылка ]
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: [ Ссылка ]
Catch all episodes, Feature Store, Machine Learning Monitoring, and Blogs: [ Ссылка ]
Connect with Demetrios on LinkedIn: [ Ссылка ]
Connect with Ben on LinkedIn: [ Ссылка ]
Connect with Crag on LinkedIn: [ Ссылка ]
Timestamps:
[00:00] Introduction to Crag Wolfe
[01:19] Agenda
[01:50] Unstructured.io introduction
[03:47] Then open-source community
[04:44] The goal
[05:50] Rapidly build custom preprocessing API
[08:47] Staging
[09:35] Demo
[10:02] Developer quick start
[11:20] SEC Filing Section Pipeline
[11:29] Section 1: Pulling in Raw Documents
[12:36] Section 2: Reading the Document
[15:10] Section 3: Custom Partitioning Bricks
[17:31] Section 4: Cleaning Bricks
[18:41] Section 5: Staging Bricks
[20:11] Section 6: Define the Pipeline API
[40:15] SEC Sentiment Analysis Model notebook
[41:18] Stage for transformers
[41:45] Training a summarization model with Unstructured + Argilla + Huggingface
[43:29] Crag's previous engineering experience
[44:33] Deciding what to tackle next
[46:01] Editing documents
[47:17] Scaling issues
[48:07] Moving out of NLP
[48:53] Wrap up
Ещё видео!