Due dates:

Update 12/2:

Update 12/5:

Update 12/6:

Update 12/9:

Getting started

Milestone 1

Download CS201_Assign06_MS1.zip and import it into your Eclipse workspace (File → Import → General → Existing projects into workspace → Archive file.)

You should see a project called CS201_Assign06_MS1 in the Package Explorer. Your will be implementing the methods in the URL.

Update 12/2: The original version of URLTest.java did not have any tests for the getDirectoryPart method. Here is a fixed version:

URLTest.java

Milestone 2

Download CS201_Assign06_MS2.zip and import it into your Eclipse workspace (File → Import → General → Existing projects into workspace → Archive file.)

You should see a project called CS201_Assign06_MS2 in the Package Explorer. You will need to copy the URL class from your Milestone 1 project into the Milestone 2 project. Your will be implementing the methods in the LinkExtractor and Crawler class.

Your task

Milestone 1: URLs

Your first task is to implement the methods in the URL class so that the tests in URLTest pass.

Very important: Do not use the java.net.URL class, or any other classes in the java.net package, in your implementation of the URL class.

A URL is a string of text specifying the location of a web resource. A URL consists of a protocol (optional), followed by a host (optional), followed by a path (required).

A protocol is a sequence of characters ending in a colon (“:”).

A host starts with “//” and then consists of a sequence of non-slash characters. If the URL’s protocol is either http: or https:, then the host part of the URL is not optional.

A path is a sequence of components separated by slashes. Each component is a sequence of non-slash characters. As a special case, a single slash character (“/”) is also considered a valid path. Also note that if the URL specifies a host, the path must start with a slash.

Here are some example URLs with the parts (protocol, host, and path) identified and color-coded:

http://ycpcs.github.io/cs201-fall2016/assign/assign06.html

https://ycpcs.github.io/cs201-fall2016/assign/assign06.html

https://ycpcs.github.io/cs201-fall2016/

https://ycpcs.github.io/

//ycpcs.github.io/cs201-fall2016/assign/assign06.html

foo/bar/baz.html

foo/bar/baz/../thud/../../blat.html

https://ycpcs.github.io/cs201-fall2016/assign/../practice/index.html

A URL is absolute if its path begins with a slash.

A URL is a directory if its path ends with a slash.

A URL is canonical if its path does not contain any “.” or “..” components.

A URL object is an instance of the URL class, and represents the information extracted from the textual form of a URL. You will need to implement the following methods:

Note that the equals, hashCode, and compareTo methods are already defined: you will not need to modify them. (All of them call toString, so they will not work correctly unless toString works correctly.)

Also note that URL is an immutable class. This means that none of the methods supported by URL objects modify the object’s data. In other words, URL objects are “read only” once they are created.

Finding the canonical form of a URL

A URL is in canonical form if its path does not contain any occurrences of “.” and “..”. Many URLs which are not in canonical form can be converted to canonical form.

To understand how to convert a URL to canonical form, let’s consider what the path of a URL means. The path is a description of how to get from a starting point to a destination in a hierarchical directory structure. Consider the following hierarchy:

/
    reptiles/
        lizards/
            monitorLizard.html
            komodoDragon.html
            gilaMonster.html
    mammals/
        marsupials/
            koala.html
            potoroo.html
            wombat.html
        monotremes/
            platypus.html
            echidna.html
    dinosaurs/
        therizinosaurs/
            therizinosaurus.html
            nothronychus.html
            beipaosaurus.html

We can specify a path which describes how to get from the root (/) of the hierarchy to any resource in the hierarchy. For example, the path

/mammals/marsupials/potoroo.html

means “start in the root directory, then go into the mammals directory, then go into the marsupials directory, then find the resource called potoroo.html”. This is an example of an absolute path, because it starts in the root of the hierarchy. Paths can also be relative, which means that the location of a resource is specified in relation to another resource whose exact location in the hierarchy is known. For example, the path

monotremes/echidna.html

is relative. If the starting point is the directory

/mammals/

then the exact resource being named is the concatenation of the starting directory and the relative path, which in this case is

/mammals/monotremes/echidna.html

This brings us to the meaning of the “.” and “..” path components:

Consider the relative path

../../reptiles/lizards/./monitorLizard.html

If the starting point directory is

/dinosaurs/therizinosaurs/

then the complete path is

/dinosaurs/therizinosaurs/../../reptiles/lizards/./monitorLizard.html

What does this mean? Reading the path from left to right, it means

This does indeed allow us to find the resource, but the “..” and “.” components are redundant. A URL in canonical form, because it describes the simplest path to find a resource, is a better representation of the location of the resource. In addition, for any URL that can be put in canonical form, the canonical form is unique.

Here is how we can convert a URL not in canonical form to canonical form:

  1. convert the URL’s path into a sequence of components
  2. create a stack of string values
  3. for each component, in order:
    • if the component is “.”, ignore it
    • if the component is “..”, pop the stack
    • otherwise, push the component onto the stack

If the process completes successfully, the components in the stack (from bottom to top) are the components of the canonical form of the path. (One reason that conversion to canonical form may fail is if a “..” component is seen, but the stack is empty.)

In the example above, the canonical form of the path is

/reptiles/lizards/monitorLizard.html

Getting a referenced URL

So, if having a URL in canonical form is a good thing, why are the “.” and “..” components used at all?

The reason is that when one web page links to another resource, it may be convenient to specify the linked resource’s location relative to its own location. For example, if a web page with the path

/dinosaurs/therizinosaurs/nothronychus.html

contains a link

../sauropods/argentinosaurus.html

then the effective path of the linked resource is

/dinosaurs/therizinosaurs/../sauropods/argentinosaurus.html

In other words, we take the directory part of the document containing the link, and append to it the relative link. The idea is that a “..” path component allows a link to a resource “below” or “beside” the web page containing the link.

Because canonical form is always the best way to specify the location, of a resource, referenced URLs should be converted into canonical form. In the example above, the canonical form of the linked resource’s path would be

/dinosaurs/sauropods/argentinosaurus.html

The getReferencedURL method in the URL class should do the following:

  1. If the referenced URL’s path is absolute, convert it to canonical form and return the canonical form
  2. If the referenced URL’s path is not absolute, then
    1. Find the referenced URL’s effective path by appending it onto the directory part of the “base” URL (which is the URL object the method is called on)
    2. Return the canonical form of the URL whose protocol and host are the same as the base URL, but whose path is the effective path of the referenced URL

Milestone 2: Web crawler

In the second milestone, you will complete the implementation of two classes, LinkExtractor and Crawler.

LinkExtractor class

The LinkExtractor class is used to scan the text of a web page for links to other resources (including other web pages.)

For the purposes of this assignment, you can assume that a link is text between the delimiters href=" and ". You may also assume that any line of text in a web page will contain at most one link.

As an example, consider the following web page:

<html>
  <body>
    <a href="coolStuff.html">Some cool stuff</a>
    
    <a href="info/resources.html">Resources</a>
  </body>
</html>

This web page has two links:

coolStuff.html
info/resources.html

To implement the LinkExtractor class, you should do the following:

You can use the LinkExtractorTest class to test your LinkExtractor implementation. Make sure all of the tests pass.

Crawler class

The Crawler class implements a “web crawl”:

This algorithm is implemented in the crawl method, and is a form of the breadth-first search algorithm.

Your task is to complete the following methods:

Each method is described by a comment.

You will need to read the code in the crawl method carefully to understand the role of each of these methods in the crawling algorithm.

The FileCrawlerTest class has unit tests for the Crawler class: make sure these pass. You can also try the HttpCrawlerTest class, which runs the same tests as FileCrawlerTest, but loads web pages from the course website rather than from files.

Update: 12/9: Please replace the getStartURL method in the FileCrawlerTest class with the following one:

@Override
public URL getStartURL() throws Exception {
    File baseDir = new File(".").getCanonicalFile();
    return new URL("file:" +
        baseDir.toString().replace('\\', '/').replace("H:", "//storage/home") +
        "/exampleSite/index.html");
}

This should allow FileCrawlerTest to run correctly on the Windows computers in KEC. Note that this test will probably not run correctly on other Windows computers. It should run correctly on all Linux and Mac computers.

In any case, the HttpCrawlerTest class should run correctly on all computers (as long as there is a network connection), and its tests are equivalent to those in FileCrawlerTest.

Hints and specifications

For the makeCanonical method in URL, you can split a path into components as follows. Assume that the path variable is the path part of a URL. The following code will break the path into an array of components:

String[] components = path.split("/");

Note that if the path starts with “/” there will be an empty first component, and if the path ends with “/” there will be an empty last component.

For LinkExtractor the following methods in the String class may be useful. For example, if s is a string:

Grading criteria

Milestone 1:

Points may be deducted for poor coding style such as non-private fields, inconsistent indentation, etc.

Milestone 2:

Submitting

Save the project (CS201_Assign06_MS1 or CS201_Assign06_MS2) to a zip file by right-clicking it and choosing

Export…→Archive File

Upload the saved zip file to the Marmoset server as assign06_ms1 or assign06_ms2 (for Milestones 1 and 2, respectively). The server URL is

https://cs.ycp.edu/marmoset/