School of Information
 Previously School of Library & Information Studies

 Friday Afternoon Seminar: Summaries.
  296a-1 Seminar: Information Access, Fall 2022.

Fridays 3-5. Details will be added as they become available.
In person, with also Zoom -- unless indicated otherwise. Campus policy requires all Zoom participants to sign into a Zoom account prior to joining meetings hosted by UC Berkeley. Face mask recommended but not required for in-person attendance. Zoom sessions are not recorded.
A link to each Seminar session is available only at the School's event listing: www.ischool.berkeley.edu/events.
Schedule. Weekly mailing list.

Aug 26: Clifford LYNCH: Overview of Stewardship Lectures.
    Michael BUCKLAND & Clifford LYNCH: Introduction.
    Introduction to the Seminar, and Plans for Fall, including a summary of upcoming sessions. Introductions of Participants; expectations for registered students.
    Clifford LYNCH: Context and Overview for the Stewardship Lectures.
    In 2016 I gave a series of talks in the seminar trying to summarize and synthesize work I've done over the past 20 years on stewardship of the scholarly and cultural record. These were transcribed and I hoped to use them as the basis for a book. As I covered the material it became clear that there were several major areas that were omitted from the survey (and I published several papers covering some of these areas); in the following years several urgent personal issues, and then a pandemic intervened. And the landscape of preservation and stewardship changed significantly. In this introductory discussion, I'll outline what I hope to cover in the 4-5 sessions we've scheduled on stewardship this fall, and try to provide a broad context for the upcoming lectures.

Sep 2: No Seminar meeting. Labor Day weekend.

Sep 9: Clifford LYNCH: Stewardship 1: The Scope of the Challenge.

    Following the August 26 introduction and framing of the stewardship challenge. my talks on September 9 and October 7 will begin an examination of the scope of the stewardship challenge. I'll look at the scope and nature of the scholarly record, and the explore and contrast that to the much larger and less well defined cultural record. After some general discussion, I'll look at a series of specific case studies: music; books; moving and still images; geospatial, remote sensing and documenting the environment; the web at large, including the deep web, and the digital ephemera of the future; social media, personal digital archiving; factual biography; and news.
    After these case studies, I'll explore the continually-shifting contested areas and boundaries at the the fringes of the cultural record. I'll briefly discuss efforts to measure or approximate the scale of various segments of the cultural record, and of the parts of that record under the protection of stewardship institutions.

Sep 16: Short Reports by Arogya KOIRALA, Shai DHALIWAL, Calvin LEE, Alan KYLE, Siddharth ADELKAR, Sarah BARRINGTON and Ameya NAIK.
    Arogya KOIRALA: Monitoring War Destruction from Space Using Machine Learning.

    Extracting information on-war related destruction is difficult because it relies on eyewitness accounts and manual detection on the field, which is not only costly for institutions carrying out these efforts, but also unsafe for the individuals carrying out this task. The information gathered is also incomplete, which makes it difficult for use in media reporting, carrying out humanitarian relief, understanding human rights violations, or academic reporting. This seminar introduces an automated approach to measure destruction in war damaged buildings by applying deep learning in publicly available satellite imagery. We adapt different neural network architectures and make them applicable for the building damage detection use case. As a proof of concept, we apply this method to the Syrian Civil War to reconstruct war destruction in these regions over time. We close the discussion by talking about how the nature and quality of the inputs used (publicly available satellite imagery) and different architectural choices made in the design of the machine learning system relate to the robustness and generalizability of the outputs produced. This work builds on prior work by Mueller et al. in the PNAS paper "Monitoring war destruction from space using machine learning".
    Shai DHALIWAL: Modernizing Mainframe RACF.

    I plan to explore how cloud modernization will improve cyber security for legacy mainframe information systems: Consider the driving factors which influence organizations to leverage legacy mainframe systems today and existing capabilities for Identity & Access Management; explore opportunities for organizations to migrate IBM Mainframe workloads securely to the cloud and how this could benefit organizations for the next 10-20 years, and provide a recommended framework to execute Mainframe RACF modernization for improved Identity & Access Management security.
    Calvin LEE: Exploring Consumer Robocall Mitigations.
    Have you recently received a strange call from an unknown number in your area code requesting that you extend your expired vehicle warranty insurance? If you have, you had one of the 75.9 billion illegal robocalls that were made within the last 12 months. Counter-advances have been made by critical players such as the Federal Communications Commission and Tier-1 US network carriers, but are they proving futile? Perhaps we can look towards similar tactics such as email anti-spam or CAPTCHA techniques. Designing a solution to mitigate these unwanted calls is complex but we will explore new possible mitigations.
    Alan KYLE: Drawing Lines Between Section 230 and Trust & Safety.
    In this presentation I will discuss what Section 230 and trust & safety are and how they are connected. I will use my experience as a trust & safety professional, and my research of Section 230 bills as a jumping off point for thinking about ways to contextualize current attempts to regulate the Internet.
    Siddarth ADELKAR: English Documentation of Non-English Stories: A Study of the People's Archive of Rural India.
    The People's Archive of Rural India (PARI) is primarily a journalism project. PARI reports on the "everyday lives of everyday people" by using the craft and structure of journalism to inform their predominantly English speaking urban readership -- an important part of which is school and college students. PARI sees itself as an antidote to the structural problems in Indian media, education and historiography. While Indian media focuses on the urban rich, PARI focuses on the rural poor. While Indian English education focuses on skilling and emigration, PARI focuses on deep engagement with one's surroundings. While Indian history focuses on the narrative of kings and empires, PARI prepares an archive for future writings of people's histories. The unique themes that have stood out in PARI's reportage include the agrarian crisis in India, women's sexual and reproductive health, and chronicling the impending climactic catastrophe as "everyday" people feel it.
    An important question that I wish to study in PARI's coverage is the impact -- benefits and drawbacks -- of English-first documentation of livelihood and culture in a predominantly non-English speaking country. PARI is translated in up to 11 Indian languages by over 100 accomplished translators. Yet, in the future if only PARI were to survive what will be lost due to the English-ness or English-firstness of the stories? Will something be gained? Does the presence of pictures and video improve the situation or worsen it?
    Sarah BARRINGTON: The ‘Fungibility’ of Non-Fungible Tokens: Vulnerabilities in an Over-Hyped Market.
    Non-Fungible Tokens, digital certificates of ownership for virtual art, are becoming increasingly ubiquitous in Western media; from releasing the first notes of the Beatles’ ‘Hey Jude’ as an NFT, to the first tweet being sold as an NFT for $2.9 million. In 2021, the market was valued at $17.6bn, representing a growing and salient portion of the overall cryptocurrency and blockchain economy. Yet, the NFT market is also speculative, variously described as irrational and overhyped. The emergence of vulnerabilities, along with a sustained market downtime, are now calling the role of NFTs into question: what exactly are NFTs? And most importantly, what gives them value? This project aims to address these questions, arguing that three fundamental properties (permanence, immutability and uniqueness) are necessary (but not sufficient) conditions for an NFT to have value. We explore both the underlying artworks and their associated metadata in order to define these metrics. Furthermore, we take a quantitative approach to testing these definitions against 6 months of real-world data, examining the true permanence and perceived value of over 7 million NFTs. We ultimately envision this work to help buyers and marketplaces identify and warn users against purchasing NFTs that may be overvalued, and bring some much needed rigour to a presently complex and recondite market.
    Ameya NAIK: Assessing Data Subject Access Requests.
    Any mobile or desktop application prompts you to sign a term of policy agreement that allows the application to gather information related to you, your activities, and your attributes. The level and type of information gathered depend on the organization, business model, kind of application, and geographical location. You could access this information through Data Subject Access Requests, which the applications are bound to provide. While GDPR (Article 15) and CCPA have broadly drawn rules and regulations for Data Subject Access Request(DSAR), however, there still are differences in the way the data is stored and shared with the consumers(you). I plan to initiate and note the process of DSARs and then analyze the data shared. This analysis could potentially lead to interesting observations, and would want to compare the shared data by similar applications (messaging, social media etc.). Developing a catalog and visualizing the data would be the other aspect of the project.

Sep 23: Chris FREELAND, Internet Archive: Controlled Digital Lending.
    The Internet Archive’s Open Libraries program empowers libraries to lend digital books to patrons using controlled digital lending (CDL). In the course of this discussion we'll cover how CDL works, different implementations of the library practice, including Internet Archive's Open Libraries program, and the impact that the practice has for libraries and the communities we serve. We'll also cover Hachette v Internet Archive, the lawsuit brought against the Internet Archive by four commercial publishers for controlled digital lending.
    Chris Freeland is the Director of Open Libraries at the Internet Archive, working in support of the organization's mission to provide "Universal access to all knowledge." Before joining the Internet Archive Chris was an Associate University Librarian at Washington University in St. Louis, managing Washington University Libraries' digital initiatives and related services, and the Director of the Center for Biodiversity Informatics at the Missouri Botanical Garden. He holds an M.S. in Biological Sciences from Eastern Illinois University and an M.S. in Library and Information Science from University of Missouri-Columbia.

Sep 30: Clifford LYNCH: Stewardship: The Scope of the Challenge (Continued).
    I'll continue the discussion from September 16 exploring the nature of the (digital) cultural record through a series of detailed examinations of developments involving particular genres of content. After completing the discussion of recorded music from the last session, we'll discuss moving images (video and film); geospatial and remote sensing broadly; the web, including the "deep" web and consideration of new modes of grey literature and emphemera; and, time permitting, social media and its implications for stewardship. This discussion will continue on November 4.

Oct 7: Michael BUCKLAND & Wayne de FREMERY.
    Jeanette ZERNEKE: Brief Report on the Recent PNC and ECAI Meetings.
    Michael BUCKLAND & Wayne de FREMERY: Contexts, Works, and Catalogs.

    Building on work previously presented at this seminar by Wayne de Fremery and myself, we propose some fundamental changes to bibliographic and library search and discovery:
    1. Consider the purpose of retrieval systems as a search for "families" or "contexts" of related works rather than for particular items;
    2. Diminish the privileged status accorded to individual creative works, notably by Seymour Lubetzky ’34 and others, and redefine and redirect the Functional Requirements for Bibliographic Resouces (FRBR) model accordingly; and
    3. Unbundle the the tight relationship between library catalog and library collection in order to harmonize theory with contemporary technological reality.
    Wayne de Fremery is Professor of Information Science and Entrepreneurship and Director of the Francoise O. Lepage Center for Global Innovation at Dominican University of California. Previously, he was an associate professor in the School of Media, Arts, and Science at Sogang University in South Korea, where he has lived for twenty years. He currently represents the Korean National Body at ISO as Convener of a working group on document description, processing languages, and semantic metadata (ISO/IEC JTC 1/SC 34 WG 9). His recent research projects have concerned "Digital humanities in the iSchool" (JASIST, 2022), "Copy theory" (JASIST, 2022), "Context, relevance, and labor" (JASIST, 2022), as well as the use of deep learning to improve Korean OCR, for which he received a national citation of merit from the South Korean Ministry of Culture, Sports, and Tourism. More at www.pwdef.info.

Oct 14: Cathy MARSHALL: Who Broke Mechanical Turk?
    Crowdsourcing platforms provide a valuable way to perform a wide range of human intelligence tasks--e.g. data labeling, content moderation, text translation, citizen science--as well as a convenient venue for collecting participant data. I've been using Amazon Mechanical Turk in various capacities since 2010, and have followed worker forums, labor organizing efforts, and the development of worker-centered tools (on one side) and increasingly sophisticated uses of the crowd (on the other). Early on, my colleagues and I were (perhaps naively) delighted by the quality of the data we gathered and by generally positive interactions we had with workers. Using practical advice from the literature, we were able to vet work and encourage good-faith participation in our studies.
    More recently, a handful of researchers from diverse disciplines who use crowdsourcing platforms have described an uptick in unusable data from US-based workers. Frank Shipman and I saw this ourselves in 2018 and 2019 when we re-ran a survey we'd used successfully five years earlier: by 2019, we had to exclude more than 12% of the completed HITs according to our established cleaning heuristics. Even knowing this, what we saw on Mechanical Turk this spring and summer startled us. Almost 90% of the data was unusable. In this talk, I'll use a preliminary analysis of our own and other researchers' data in an effort to explain what seems to be happening on Mechanical Turk, present evidence of why it's not necessarily a symptom of bots, autocompletion tools, or bad faith work, and speculate why Amazon has little incentive to do anything about it.
    Cathy Marshall is a senior research scientist at Texas A&M University and a former principal researcher at Microsoft Research, and before that, at Xerox PARC. She's a fan of personal ephemera, special collections, and a quiet reading room.

Oct 21: Students' Progress Reports.
    Sarah BARRINGTON, Arogya KOIRALA, Shai DHALIWAL, Calvin LEE, Alan KYLE, Siddharth ADELKAR and Ameya NAIK.
    Sarah BARRINGTON: The Fungibility of Non-Fungible Tokens: A Quantitative Analysis of ERC-721 Metadata.

    Non-Fungible Tokens (NFTs), digital certificates of ownership for virtual art, have until recently been traded on a highly lucrative and speculative market. Yet, an emergence of misconceptions, along with a sustained market downtime, are calling the value of NFTs into question. This project (1) describes three properties that any valuable NFT should possess (permanence, immutability and uniqueness), (2) creates a quantitative summary of permanence as an initial criteria, and (3) tests our measures on 6 months of NFTs on the Ethereum blockchain, finding 45% of ERC721 tokens in our corpus do not satisfy this initial criterion. Our work could help buyers and marketplaces identify and warn users against purchasing NFTs that may be overvalued.
    Arogya KOIRALA: Monitoring War Destruction from Space Using Machine Learning.
    Extracting information on war related destruction is difficult because it relies on eyewitness accounts and manual detection on the field, which is not only costly for institutions carrying out these efforts, but also unsafe for the individuals carrying out this task. The information gathered is also incomplete, which makes it difficult for use in media reporting, carrying out humanitarian relief, understanding human rights violations, or academic reporting. This seminar introduces an automated approach to measure destruction in war damaged buildings by applying deep learning in publicly available satellite imagery. We adapt different neural network architectures and make them applicable for the building damage detection use case. As a proof of concept, we apply this method to the Syrian Civil War to reconstruct war destruction in these regions over time. In the last talk we outlined the problem space, and talked about different data-related considerations to keep in mind when approaching the problem using machine learning. For this talk, we will take a closer look at the data, introduce the different machine learning architectures that we will employ, and (if time permits) discuss the potential benefits and drawbacks of these choices as they relate to our goal of identifying war-related building destruction.
    Shai DHALIWAL: Modernizing Mainframe RACF.     During this seminar, I plan to explore how cloud modernization will improve cyber security for legacy mainframe information systems, focusing on research progress made for below technical areas:
- Structuring Mainframe Information, Metadata, Databases
- Tokenization, Privacy, and Standards, and
- Time permitting: RACF User Migration to the Cloud & Benefits.
    Calvin LEE: Evaluating Consumer Robocall Mitigations.
    With the all-time high in October 4.5B robocalls were made across the US. We will evaluate the most and least effective mitigations as to date. We will provide an update and review the latest policies in how the Federal Communications Commission is placing new obligations for gateway providers to play a more active role to curb this abuse.
    Alan KYLE: Artificial Intelligence and Machine Learning Fairness.
    In this presentation I will share my progress in formulating my project on machine learning fairness toolkits. As a part of a large team, my contribution will focus on the policy aspects of these tools. Questions to be answered are: How are fairness toolkits used? What are the current AI policies that practitioners need to be advised on? And, how can organizations promote fairness practices through their internal policies?
    Siddharth ADELKAR: English Documentation of Non-English Stories: A Study of the People's Archive of Rural India.
    In understanding what information is lost or gained in English-first documentation of Indian conditions, I investigate the discourse in Indian English and its closeness to other Indian languages. I focus on four dimensions of language viz. phonology, lexicon, syntax but most importantly, what is being said – the contexts. I consider the verbal and written communications of Indian English writers who are native Marathi speakers, and compare the distance between their English and Marathi. If Indian English is closer to Indian languages than to, say English (UK) and if the distance between them is comparable to that between native dialects of English then it would be fair to say that Indian English is only as good or bad in documenting local conditions as its native languages.
    Ameya NAIK: Assessing “Data” in Data Subject Access Requests.
    Any mobile or desktop application prompts you to sign a term of policy agreement even before using it. This term of policy agreement allows the application to gather certain information related to you, your activities, and your attributes. The level and type of information gathered depend on the organization, business model, kind of application, and geographical location. EU Data protection and CCPA grants the consumer the right to personal data the company holds on them. While GDPR (Article 15) and CCPA have broadly drawn rules and regulations for Data Subject Access Requests (DSAR), however, there still are differences in the way the data is stored and shared with the consumers (you). We have initiated DSARs for certain mobile applications, which are categorized in similar categories, taking due notes of the process, and comparing the data and visualzing the data.

Oct 28: Early start! 2:30 pm.
    Seminar combined with the School's 104th Birthday Celebration.

    For program and registration go to www.ischool.berkeley.edu/events/2022/104th-birthday-celebration.

Nov 4: Clifford LYNCH: Stewardship: The Scope of the Challenge (continued).
    In today's talk, I'll conclude my survey of now-digital mass-market content with an examination of digital news material. I'll next examine the stewardship challenges involved in social media platforms, and, time permitting, start a discussion of personal digital archiving and related issues.

Nov 11: No Seminar meeting. Veterans' Day holiday.

Nov 18: Final Progress Reports by Sarah BARRINGTON, Arogya KOIRALA, Shai DHALIWAL, Calvin LEE, Alan KYLE, Siddharth ADELKAR and Ameya NAIK.
    Sarah BARRINGTON: The Fungibility of Non-Fungible Tokens: A Quantitative Analysis of ERC-721 Metadata.

    Non-Fungible Tokens (NFTs), digital certificates of ownership for virtual art, have until recently been traded on a highly lucrative and speculative market of $17.6bn. Yet, an emergence of misconceptions, along with a sustained market downtime, are calling the value of NFTs into question. This project (1) describes three properties that any valuable NFT should possess (permanence, immutability and uniqueness), (2) creates a quantitative summary of permanence as an initial criteria, and (3) tests this measure on 6 months of NFTs on the Ethereum blockchain, finding 45% of ERC721 tokens in our corpus do not satisfy this initial criteria. We find that, ultimately, 75% of ERC-721NFT assets are stored off the blockchain entirely. Our work could help buyers and marketplaces identify and warn users against purchasing NFTs that may be overvalued.
    Arogya KOIRALA: Monitoring War Destruction from Space Using Machine Learning.
    In the last talk we presented different deep neural network architectures we have employed to detect building destruction from satellite imagery. We also noted some challenges, particularly around computational efficiency and model understanding. In this talk we walk you through approaches that we have brainstormed for addressing some of these challenges, viz. improving computational efficiency, getting a better sense of what the model is learning, and gaining clarity of the gain (if any) that we derive from employing these deep networks.
    Shai DHALIWAL: Modernizing Mainframe RACF.
    During this seminar, I plan to provide a final update on my research into how cloud modernization will improve cyber security for legacy mainframe information systems:
  -   Provide an in depth analysis on opportunities for organizations to migrate their mainframe workloads securely to the cloud.
  -   Provide a final recommendation on what will come next in this space and showcase a potential framework for organizations to implement a RACF modernization program.
    Calvin LEE: The Robocall & Robotext Trajectory.
    Since the last Federal Communications Commission robocall mitigation initiative, 9 phone companies have been shut down due to non-compliance. Since then, there has been a reduction of 1.5B robocalls. Although the measures are proving to be effective, bad actors have been migrating into the robotext space instead. In our exploration of robocall and robotext’s current state of affairs, we will uncover the latest happenings in politics and industry.
    Alan KYLE: Artificial Intelligence and Machine Learning Fairness.
    This final progress report will discuss the initial steps and findings of a MIMS capstone project to build an AI fairness toolkit for the healthcare space. Specifically, international public policy and ethics frameworks for AI will be discussed.
    Siddharth ADELKAR: English Documentation of Non-English Stories: A Study of the People's Archive of Rural India.
    In my final report I shall cover two other lines of investigation apart from English-first documentation of Indian life. I discuss the adaptability of Large Language Models such as Open AI's Whisper to Indian accented English and languages. I also start a preliminary study of multilingual "course articulation" problem in universities.
    Ameya NAIK: Assessing “Data” in Data Subject Access Requests.
    Any mobile or desktop application prompts you to sign a term of policy agreement even before using it. This term of policy agreement allows the application to gather certain information related to you, your activities, and your attributes. The level and type of information gathered depend on the organization, business model, kind of application, and geographical location. EU Data Protection and CCPA (and new CPRA) grant the consumer the right to personal data the company holds on them, and all major applications have mechanisms to access this information often referred as Data Subject Access requests. The project dives into evaluating the process of raising these DSARs and analyzing the data applications have on their users. Additionally, we also analyzed how this process works in educational institutions.

Nov 25: No Seminar meeting. Thanksgiving.

Dec 2: In person and Zoom.
    Clifford LYNCH: Stewardship: The Scope of the Challenge (continued).

    Today's talk will focus on stewardship and preservation issues related to social media. This will include a look at the evolving social media "universe" and the various platforms that might be viewed (or not viewed) as part of this universe, consideration of reasons why some platforms might be prioritized over others and how to adapt to shifting usage patterns among platforms. The varying possible objectives of stewardship activities surrounding social media -- and the feasibility of achieving them -- will also be discussed at some length from technical, legal and policy perspectives.
    If time permits, we will continue on with a discussion of personal digital archiving (including its relationship to the stewardship of social media). I expect that this discussion will continue at the opening session of the Spring 2023 Seminar on January 20, 2023.

The Seminar will resume in the Spring semester.
Spring 2022 schedule and summaries. Spring 2023 schedule and summaries.