⚠️ De site vereist een werkende browser met JavaScript & DOM-ondersteuning voor het beste resultaat.

(Creating) Data that is too dirty for "AI"

#1 lemmydividebyzero
when datasets are scaled up to the volume of (partial) internet, together with the idea that scale will average out the noise, large dataset builders came up with a human-not-in-the-loop, cheaper-than-cheap-labor method to clean the datasets: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation to work best for their perspective of “cleaning”. Most datasets use heuristics adopted from existing ones, then add some extra filtering rules for specific characteristics of the datasets. I would like to invite you to have a taste together of these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and on for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

In 1980s, non-white women’s body size data was categorized as dirty data when establishing the first women's sizing system in US. Now in the age of GPT, what is considered as dirty data and how are they removed from massive training materials?

Datasets nowadays for training large models have been expanded to the volume of (partial) internet, with the idea of “scale averages out noise”, these datasets were scaled up by scrabbling whatever available data on the internet for free then “cleaned” with a human-not-in-the-loop, cheaper-than-cheap-labor method: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation that are “good enough” to remove “dirty data” of their perspective, not guaranteed to be optimal, perfect, or rational.

The talk will show some intriguing patterns of “dirty data” from 23 extraction-based datasets, like how NSFW gradually equals to NSFTM (not safe for training model), and reflect on these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and ask for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

Licensed to the public under 🔗http://creativecommons.org/licenses/by/4.0

#2 atrielienz
You copy and pasted the first paragraph a second time in the body of the post.

🔗 Related topics

spacer
Shoutbox 0
💬 Shoutbox
Mayumi: Very nice, daddy!
webmastahh: From now on, this website is expanding to an international audience!! 🙂😄
Webmastahhh: Joined a random match in krunker, got destroyed instantly, left
Webmastahhh: “just one match”... 3 hours later. 🙁
Webmastahhh: So I’ve got Paralives, new James Bond, No Man’s Sky and Final Fantasy XIV on my list…What should I start with first?
Webmastahhh: Stay cool, stay hydrated, and keep stacking those wins. Whether you're chilling in the shade or battling online, today is a perfect day to game! ️Who's playing right now?
beep: PlayStation had decided that you can't play your purchased games offline and Xbox is deep in it's enshittification. Quick somebody call an ambulance!
The Picard Maneuver: People are saying not enough eyes either. Sad.
7101334: What happened to good destruction-physics focused games?
CaptainBasculin: A really creative way to ask getting into a playtest
ragingHungryPanda: My friend said this yesterday, so I made it a meme
cannedtuna: Just got my first Steam Machine and I’m excited to try it out. Just love that clear purple casing.
The Picard Maneuver: The horror!
The Picard Maneuver: Problem solved
Steelkrill Studio: I’m a solo developer creating a detective horror game where you search for victims by a boat, inspired by the real-life Island of the Dolls in Mexico. I’d really appreciate any feedback!
cannedtuna: The bestest boy will always be by your side. ALWAYS.
Paradachshund: With as few spoilers as possible, how early access is subnautica 2?
QuentinCallaghan: The Original Doom Soundtrack Is Officially In The Library Of Congress
cannedtuna: Publisher CEO for Subnautica asked ChatGPT how to avoid paying out bonuses
webmastahh: FINAL FANTASY VII REBIRTH is now available on Nintendo Switch™ 2, Xbox Series X|S and Xbox on PC as part of Xbox Play Anywhere.
RandyEmons: hello world

250 left
smile grin sad wink tongue out surprised confused laughing XD heart
🏠 Start 👾 News 💾 Downloads 💬 Community Profile Profile