在线英语听力室

VOA科学技术2024--Will AI Systems Run Out of Publicly Available Data on the Internet?

时间:2024-08-12 08:23:04

搜索关注在线英语听力室公众号:tingroom,领取免费英语资料大礼包。

(单词翻译)

A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years.

Training data includes writing and information publicly available on the Internet. AI companies use the internet to "train" AI systems to create human-sounding writing. This "training" is what developers use to create large language models. Currently, many technology companies are developing large language models this way.

The nonprofit research group Epoch1 AI examines issues relating to AI. It has been following the development of large language models for a few years. In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032.

The team's latest paper has been reviewed by experts, or peer reviewed. It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California.

A 'gold rush'

Researcher Tamay Besiroglu is one of the paper's writers. He compared the current situation to a "gold rush" in which limited resources are depleted2. He said the field of AI might face problems as the current speed of development uses up the current supply of human writing.

As a result, technology companies like the maker3 of ChatGPT, OpenAI and Google are seeking to pay for high quality data. Their goal is to ensure a flow of good material to train their systems. OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material. The researchers consider this a short-term answer.

Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development. That could lead companies to seek online data considered private, such as email and phone communications. They also might increasingly use AI-created data, such as chatbot content.

A 'bottleneck4' in development?

Besiroglu described the issue as a "bottleneck" that can prevent companies from making improvements to their AI models, a process called "scaling up."

"...Scaling up models has been probably the most important way of expanding their capabilities5 and improving the quality of their output."

The Epoch AI group first made their predictions two years ago. That was weeks before the release of ChatGPT. At the time, the group said "high-quality language data" would be exhausted6 by 2026. Since then, AI researchers have developed new methods that make better use of data and that "overtrain" models on the same data many times. But there are limits to such methods.

While the amount of written information that is fed into AI systems has been growing, so has computing7 power, Epoch AI said. The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens.

But whether a "bottleneck" in development is a concern remains8 the subject of debate.

Nicolas Papernot teaches computer engineering at the University of Toronto. He was not involved in the Epoch study. He said building more skilled AI systems can come from training them for specialized9 tasks. Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as "model collapse10."

Permission and quality

Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models. Wikipedia has placed few restrictions11 on how AI companies use its articles, which are written by volunteers.

But professional writers are worried about their protected materials. Last fall, 17 writers brought a legal action against Open AI for what they called "systematic12 theft on a mass scale." They said ChatGPT was using their materials, which are protected by copyright laws, without permission.

AI developers are concerned about the quality of what they train their systems on. Epoch AI's study noted13 that paying millions of humans to write for AI models "is unlikely to be an economical way" to improve performance.

The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with "generating lots of synthetic14 data" for training. He said both humans and machines produce high- and low-quality data.

Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models.

"There'd be something very strange if the best way to train a model was to just generate...synthetic data and feed that back in," Altman said. "Somehow that seems inefficient15."

Words in This Story

exhaust -v. to completely use up a resource

depleted -adj. when a resource is almost used up

trajectory16 -n. the direction that something is taking or is predicted to take

synthetic -adj. created by a process that is not natural

scale -n. the level of size of a thing

generate -v. to create something through a process


分享到:

Error Warning!

出错了

Error page: /index.php?aid=567853&mid=3
Error infos: Got error 28 from storage engine
Error sql: select `l`.`tag`,`l`.`index`,`l`.`level_id`,`b`.`id`,`b`.`word`,`b`.`spell`,`b`.`explain`,`b`.`sentence`,`b`.`src` from `new_wordtaglist` `l` left join `new_word_base` `b` on `l`.`tag`=`b`.`word` where `l`.`arc_id`='567853' and `l`.`level_id`>='' group by `b`.`word` order by `l`.`index` asc

本文本内容来源于互联网抓取和网友提交,仅供参考,部分栏目没有内容,如果您有更合适的内容,欢迎 点击提交 分享给大家。