
Alisa Davidson
Post: May 6, 2025 11:12 AM Update: May 6, 2025 11:38 am

Edit and fact confirmation: May 6, 2025 11:12 am
simply
Although there is a growing concern about the lack of data on the education AI model, the public Internet is unlikely to constantly expand the data source and face AI to face true data.

Today’s AI model can do amazing things. It’s almost as if there is a magical power, but of course it is not. Instead of using magic, the AI model actually runs a lot of data and a lot of data.
However, the lack of data is increasing concern that the rapid innovation of AI can fall. In recent months, there have been several warnings by experts who claim that the world is exhausting fresh data to train the next -generation model.
Data shortages will be particularly difficult for the development of large language models, which are engines of power generation AI chatbots and image generators. They have been trained for vast amounts of data, and they need more and more to fuel to their development according to each new leap.
Due to the lack of AI education data, some businesses have already found alternative solutions such as using AI to create synthetic data for AI education, using content with media companies, and distributing “Internet of Things” devices that provide real -time insights to consumer behavior.
But there is a certain reason to think that these fears are exaggerated. Perhaps the AI industry will not lack data. This is because developers can always fall into the biggest information (open Internet) that the world knows.
Data
Most AI developers already disclose education data on the Internet. The GPT-3 model of OPENAI, the engine of the viral CHATGPT chatbot, which first introduced the creation AI to the public, was trained in data from Common CRAWL, a content archive released on the Internet. By that moment, the value or information of about 410 billion tokens, based on the virtually posted online, was supplied to CHATGPT, providing the knowledge needed to answer almost all the questions we could ask.
Web data is a wide range of terms that explain everything online, including government reports, science research, news articles and social media content. It is an amazingly rich and diverse data set that reflects everything from public sentiment to consumer trends, global economic status and DIY education content.
The Internet is an ideal point for the AI model. You can supply real -time information on millions of websites, including a lot of people who want to prevent bots from actively using special tools such as the Scraping browser of Bright Data.
Developers can easily measure the most powerful bot blocking mechanisms used in sites such as Ebay and Facebook, including Captcha Solvers, Automated Retries, API and Vast Proxy IP Network, can easily measure the most powerful bot blocking mechanisms used in sites such as Ebay and Facebook and get extensive information. The platform of the Bright Data is also allowed to be integrated with data processing workflow and is allowed to be unexpected structural, cleaning and training.
I’m not sure how many data can be used today. In 2018, the International Data Corp. estimates that the total amount of data posted online will reach 175 jetta by the end of 2025, while the latest number of Statista UPS is estimated to be 181 jetta bytes. Needless to say, it is a mountain of information and is increasing exponentially over time.
Challenge and ethical questions
The developer is still facing a big challenge when supplying this information to the AI model. Web data are notorious, notorious, structured, often inconsistent and are the decision values. Prior to understanding with the algorithm, intensive treatment and “cleaning” are required. Web data often includes many inaccurate and unrelated details that often distort the output of the AI model and the output of the so -called “hallucinations”.
In addition, there is an ethical question in discarding the Internet data, especially in relation to copyrighted data, and configures “fair use”. A company like Openai claims to scrape all the information that can be consumed online for free, but many contents producers say that it is not fair to do so because the company gains profits from work and gains profits from such tasks.
Despite the ambiguity of web data can be used to train AI and cannot be used, it cannot take away its importance. In the recent public web data report of Bright Data, 88%of developers who participated in the survey agreed that public web data was “important” for the development of the AI model due to accessibility and amazing diversity.
This explains why 72%of the developers are concerned that the efforts of large companies such as META, Amazon and Google may become more and more difficult to access this data for the next five years.
Examples of using web data
The above task explains why there were many stories about using synthetic data as an alternative available online. In fact, there is an emerging debate on the benefits of synthetic data on Internet scraping, and there is a clear argument for the former.
Advocates of synthetic data points to benefits such as increasing personal information protection, reducing prejudice and greater accuracy. In addition, this product is ideal for the AI model of GET-GO. In other words, developers do not need to invest in resources in labeling to read the AI model.
On the other hand, over -dependence on synthetic data sets can lead to model collapse and can create equally powerful examples of the excellence of public web data. First of all, it is difficult to overcome the pure diversity and richness of web -based data, which is very important for training the AI model that needs to handle the complexity and uncertainty of the actual scenario. In particular, if the model can access it in real time, it can help to create a more reliable AI model due to the mixing of human perspective and freshness.
In a recent interview, Bright Data’s CEO or Languner emphasized that the best way to ensure the accuracy of AI output is to supply data from a variety of public sources with established reliable reliability. It was argued that the AI model is likely to be incomplete if the AI model uses only a single or a few sources. Lench said, “Having multiple sources provides the ability to cross -refer to the data and build a more balanced data set.
The developer is also convinced that it is allowed to use data from the web. In the last winter legal decision, the federal judge ruled on the support of bright data accused by the meta for web scraping activities. In this case, the service terms of Facebook and Instagram do not have the legal basis for accessing the data that can be used publicly on that platform while the user prohibits the user to discard the website.
Public data also has the advantage of being organic. In the synthetic data set, small culture water and the complexity of their behavior are more likely to be omitted. On the other hand, the public data generated by the real people is as orthodoxy as it is, so it is interpreted as an AI model that has better information for excellent performance.
There is no future without the web
Finally, it is important to note that the nature of AI is also changing. As Lench pointed out, the AI Agent plays a much larger role in the use of AI and helps to collect and process data to be used for AI education. This advantage is that the speed operated by the AI agent means that the AI model can expand knowledge in real time, so there is more advantage than removing the burdensome manual work of the developer.
“The AI Agent can change the industry so that the AI system can access and learn the AI system in a constantly changing data set on the web instead of relying on static and manually processed data.” This can lead to a bank or cyber security AI chatbot that can make a decision that reflects the most recent reality. ”
Nowadays, almost everyone is used to using the Internet continuously. It becomes an important resource, accessing thousands of essential services and enabling work and communication. If the AI system surpasses human functions, access to the same resource is required and the web is the most important.
disclaimer
The trust project guidelines are not intended and should not be interpreted as advice in law, tax, investment, finance or other forms. If you have any doubt, it is important to invest in what you can lose and seek independent financial advice. For more information, please refer to the Terms and Conditions and the Help and Support Pages provided by the publisher or advertiser. Metaversepost is doing its best to accurately and unbiased reports, but market conditions can be changed without notice.
About the author
Alisa, a dedicated reporter for MPOST, specializes in the vast areas of Cryptocurrency, Zero-ehnowedge Proofs, Investments and Web3. She provides a comprehensive coverage that captures a new trend and a keen eye on technology, providing and involving readers in a digital financial environment that constantly evolves.
More

Alisa Davidson

Alisa, a dedicated reporter for MPOST, specializes in the vast areas of Cryptocurrency, Zero-ehnowedge Proofs, Investments and Web3. She provides a comprehensive coverage that captures a new trend and a keen eye on technology, providing and involving readers in a digital financial environment that constantly evolves.