No Comments

Discover the Internet Archive storage infrastructure

 

The Internet Archive (IA) is a non-profit institution based in San Francisco, California, which offers a collection of songs; videos; films; books; studies; code for websites, software and games in the public domain, ie, that are not owned by an individual or organization. In addition, it also offers historical website archiving services, with Archive-It and Wayback Machine tools.

The history of AI began in 1996, when the commercial internet was still not widespread. The organization currently offers 475 billion archived sites; 28 million texts, books and studies; 14 million audios (being 220 thousand live shows), 6 million of videos (being 2 million television programs), 3.5 million images and 580 thousand software. “We have more than 20 years of history of the web accessible through the Wayback Machine and we work with more than 625 libraries and other partners through our Archive-It program”, writes the IA on the “About” page of the website.

According to Jonah Edwards, operations and infrastructure manager for the Internet Archive, to handle this daunting amount of material, ensure the quality of service and privacy of its users, the organization does not use cloud storage solutions. All files uploaded to the Internet Archive are stored on more than 20 thousand hard drives, divided among the 750 servers that are installed around the headquarters in California. A total of 200 petabytes of storage capacity, or better, 200 million gigabytes.

Map with the location of 4 Internet Archive data centers. Photo: Internet Archive.
Map with the location of 4 Internet Archive data centers. Photo: Internet Archive.

The storage power of the Internet Archive grows 25% per year. This means that new hard drives are constantly being purchased to meet the growing storage need. According to Jonah, the number of accesses and the amount of material downloaded from the site grew alarmingly during the pandemic of the new coronavirus (COVID-19), that made the AI ​​invest even more in infrastructure.

Employee carries the equivalent of 10 petabytes of hard drives that were purchased to integrate the Internet Archive infrastructure. Photo: Internet Archive.
Employee carries the equivalent of 10 petabytes of hard drives that were purchased to integrate the Internet Archive infrastructure. Photo: Internet Archive.

Jonah explains that the organization uses local servers, rather than cloud storage solutions mainly because of the cost issue. An infrastructure like this coming from Amazon Web Services (AWS), for example, would cost much more than what is already spent on the physical structure.

In addition, a physical structure can also guarantee some basic AI principles, such as transparency, simplicity, durability, performance and longevity. When there is a problem with a disk, the responsible team can track and fix it much more efficiently than cloud service customers can. Another advantage of using local servers, is that the AI ​​can guarantee the privacy of its users, as cloud services can track and collect usage data from their users.

Financing

Jonah explains that the Internet Archive is an institution characterized as an archive and library, for that reason, they can access government funding benefits, mainly the benefits of the United States Federal Communications Commission (FCC). IA also accepts donations from companies, users and other institutions. On the “About” page of the IA website, you can find a list of all sources of income for the organization.


Source: Internet Archive (1) (two).

See the original post at: https://thehack.com.br/conheca-a-infraestrutura-de-armazenamento-do-internet-archive/?rand=48873

You might also like
News
News

More Similar Posts

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.