Web Foundations (URIs & HTTP)

Web Architecture and Information Management [./]
Spring 2011 — INFO 153 (CCN 42509)

Erik Wilde and Dilan Mahendran, UC Berkeley School of Information
2011-02-16

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents Erik Wilde and Dilan Mahendran: Web Foundations (URIs & HTTP)

Contents

Erik Wilde and Dilan Mahendran: Web Foundations (URIs & HTTP)

(2) Abstract

The Web's architecture has very simple principles revolving around the ideas of placing a heavy emphasis on a consistent and global identification mechanism for resources, a standardized way of how resource representations can be retrieved, and a standardized way of how resource representations should be usable by using standardized media types. Based on the Internet, the Web's transport protocol transmits representations of resources identified by a Uniform Resource Identifier (URI) between Web servers and clients. The most important protocols for data transfer on the Web is the Hypertext Transfer Protocol (HTTP).



Erik Wilde and Dilan Mahendran: Web Foundations (URIs & HTTP)

(3) Web Server Service



Uniform Resource Identifier (URI)

Outline (Uniform Resource Identifier (URI))

  1. Uniform Resource Identifier (URI) [7]
  2. Hypertext Transfer Protocol (HTTP) [14]
    1. HTTP Basics [7]
    2. HTTP Authentication [5]

(5) Resource Identification

Global naming leads to global network effects... the value of an identifier increases the more it is used consistently

Architecture of the World Wide Web, Volume One [http://www.w3.org/TR/webarch/]



(6) URIs & Resources



(7) URIs & Resources



(8) URI Schemes

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
http://dret.net/lectures/web-spring09/foundations#uri-schemes


(9) Resources & Representations



(10) 1 Resource, 2 Representations

img/google-representations-1.png


(11) 2 Resources, 1 Representation

img/google-representations-2.png


Hypertext Transfer Protocol (HTTP)

Outline (Hypertext Transfer Protocol (HTTP))

  1. Uniform Resource Identifier (URI) [7]
  2. Hypertext Transfer Protocol (HTTP) [14]
    1. HTTP Basics [7]
    2. HTTP Authentication [5]

(13) DNS & HTTP

The two basic protocols which every Web browser must implement are DNS [Internet Architecture; Domain Name System (DNS) (1)] access and HTTP [Hypertext Transfer Protocol (HTTP) (1)]. However, most operating systems provide an API for DNS access, so the browser can use this service locally and only has to implement HTTP. TCP [Internet Architecture; Transmission Control Protocol (TCP) (1)] (which is required as the foundation for HTTP) is usually provided by the operating system.

browser-dns-http.png

(14) The Web's Protocol

internet-traffic-trends.png

provided by CacheLogic Inc. [http://www.cachelogic.com/]



HTTP Basics

(16) HTTP Messages

  • HTTP needs a reliable connection
    • the foundation for HTTP is the Transmission Control Protocol (TCP) [Internet Architecture; Transmission Control Protocol (TCP) (1)]
    • DNS resolution yields an IP address
    • open TCP connection to port 80 or port specified in URI (http://rosetta.sims.berkeley.edu:8085/)
  • HTTP is a text-based protocol
    • the connection is used to transmit text messages
    • all HTTP messages are human-readable (not all entities, though)
    • basic HTTP operations can be carried out by hand
start-line
          message-header *

          message-body ?


(17) HTTP Header Fields

  • Header fields contain information about the message
    • general header: Date as the message origination date
    • request header: Accept-Language indicates language preferences
    • response header: Server contains system information
    • entity header: Content-Type specifies the media type of the entity
  • HTTP defines a number of header fields [http://www.cs.tut.fi/~jkorpela/http.html]
    • unknown fields must be ignored (extensibility)
    • unstandardized fields should use a X- prefix
  • HTTP is about acting on these fields
    • HTTP defines what HTTP implementations must or should do


(18) HTTP Requests

  • After opening a connection, the client sends a request
    • the method indicates the action to be performed on the resource
    • HTTP's most interesting methods are: GET, HEAD, POST
    • other interesting methods are: PUT, DELETE
  • The URI identifies the resource to which the request should be applied
    • absolute URIs are required when contacting proxies
    • absolute paths are required when contacting a server directly
    • the URI may contain query information
  • The Host header field must be included in every request
Method Request-URI HTTP/Major.Minor
          [Header]*

          [Entity]?


(19) HTTP GET

  • Retrieval action based on the URI
    • maybe implemented by reading a file
    • maybe implemented by processing a file (PHP)
    • maybe implemented by invoking a process
  • Semantics may change based on header fields
    • If-*: only reply with the entity if necessary
    • Range: only reply with the requested part of the entity
  • Cacheability depends on header fields of the response
GET / HTTP/1.1
          Host: ischool.berkeley.edu


(20) HTTP Responses

  • The server's response to interpreting a request
    • the status code is given numerically and as text
    • 2** for variations of ok
    • 3** for redirections
    • 4** are different client side problems (404: not found)
    • 5** are different server side problems
  • Header fields specify additional information
    • information about the server
    • information about the entity (media type, encoding, language)
HTTP/Major.Minor Status-Code Text
          [Header]*

          [Entity]?


(21) HTTP Performance

  • HTTP/1.0 allowed one transaction per connection
    • TCP connection setup and teardown are expensive
    • TCP's slow start slows down the initial phase of data transfer
    • typical Web pages use between 10-20 resources (HTML + images + CSS + scripts)
    • typically, these resources are stored on the same server
  • HTTP/1.1 introduces persistent connections
    • the TCP connection stays open for some time (10 sec is a popular choice)
    • additional requests to the same server use the same TCP connection
  • HTTP/1.1 introduces pipelined connections
    • instead of waiting for a response, requests can be queued
    • the server responds as fast as possible
    • the order may not be changed (there is no sequence number)


(22) HTTP Connection Handling

http-phttp-pipelining.png

HTTP Authentication

Outline (HTTP Authentication)

  1. Uniform Resource Identifier (URI) [7]
  2. Hypertext Transfer Protocol (HTTP) [14]
    1. HTTP Basics [7]
    2. HTTP Authentication [5]

(24) HTTP Access Control

  • HTTP servers can deny access [http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_Client_Error] because of access control
    • 401 Unauthorized means the resource is access controlled
    • 403 Forbidden means the resource is inaccessible
    • 405 Method Not Allowed signals a request using the wrong request method [HTTP Requests (1)]
  • Two different approaches to unauthorized access are possible
    • repeat the HTTP request with the proper authentication credentials
    • redirect to a Login Page [Login Page (1)] and establish an authenticated Session [State Management (Cookies); Session (1)]


(25) HTTP Authentication

HTTP Authentication

(26) Basic HTTP Authentication



(27) Repeated Access

  • Clients typically access more than one protected resource
    • a perfectly stateless client would always request authentication from the user
    • using the realm clients can identify repeated accesses
  • Web interactions by default are perfectly stateless
    • each request is completely independent from other requests
    • stateless interactions make the Web loosely coupled and scalable
    • concepts like the realm or State Management (Cookies) [State Management (Cookies)] introduce state
  • Clients remember the authentication and replay it automatically
    • browsers provide little control over this feature
    • logging out of HTTP authenticated sessions is hard


(28) Login Page

  • Basic HTTP Authentication [Basic HTTP Authentication (1)] works with browser controls (including the window)
    • no possibility to log out without using browser-specific controls
    • client side security depends on browser security measures
  • Using forms gives more freedom in session management
    • authentication and authorization are completely application-based
    • if there were secure personal browsers this would not work very well


Erik Wilde and Dilan Mahendran: Web Foundations (URIs & HTTP)

(29) Conclusions



2011-02-16 Web Architecture and Information Management [./]
Spring 2011 — INFO 153 (CCN 42509)