Methods for drawing down, editing and uploading data about documents.
Return the document with the provided DocumentCloud identifer.
>>> from documentcloud import DocumentCloud
>>> client = DocumentCloud(USERNAME, PASSWORD)
>>> client.documents.get('71072-oir-final-report')
<Document: Final OIR Report>
Return a list of documents that match the provided keyword.
>>> from documentcloud import DocumentCloud
>>> client = DocumentCloud()
>>> obj_list = client.documents.search('Ruben Salazar')
>>> obj_list[0]
<Document: Final OIR Report>
Save changes to a document back to DocumentCloud. You must be authorized to make these changes. Only the title, source, description, related_article, published_url, access and data attributes may be edited.
>>> # Grab a document
>>> obj = client.documents.get('71072-oir-final-report')
>>> print obj.title
Draft OIR Report
>>> # Change its title
>>> obj.title = "Brand new title"
>>> print obj.title
Brand New Title
>>> # Save those changes
>>> obj.put()
Delete a document from DocumentCloud. You must be authorized to make these changes.
>>> obj = client.documents.get('71072-oir-final-report')
>>> obj.delete()
An alias for put that saves changes back to DocumentCloud.
Upload a PDF to DocumentCloud. You must be authorized to do this. Returns the object representing the new record you’ve created. You can submit either a file path or a file object.
>>> from documentcloud import DocumentCloud
>>> client = DocumentCloud(USERNAME, PASSWORD)
>>> new_id = client.documents.upload("/home/ben/test.pdf", "Test PDF")
>>> # Now fetch it
>>> client.documents.get(new_id)
<Document: Test PDF>
Searches through the provided path and attempts to upload all the PDFs it can find. Metadata provided to the other keyword arguments will be recorded for all uploads. Returns a list of document objects that are created. Be warned, this will upload any documents in directories inside the path you specify.
>>> from documentcloud import DocumentCloud
>>> client = DocumentCloud(DOCUMENTCLOUD_USERNAME, DOCUMENTCLOUD_PASSWORD)
>>> obj_list = client.documents.upload_directory('/home/ben/pdfs/groucho_marx/')
The privacy level of the resource within the DocumentCloud system. It will be either public, private or organization, the last of which means the is only visible to members of the contributors organization. Can be edited and saved with a put command.
A list of the annotations users have left on the document. The data are modeled by their own Python class, defined in the Annotations section.
>>> obj = client.documents.get('83251-fbi-file-on-christopher-biggie-smalls-wallace')
>>> obj.annotations
[<Annotation>, <Annotation>, <Annotation>, <Annotation>, <Annotation>]
The URL where the document is hosted at documentcloud.org.
The user who originally uploaded the document.
The organizational affiliation of the user who originally uploaded the document.
The date and time that the document was created, in Python’s datetime format.
A dictionary containing supplementary data linked to the document. This can any old thing. It’s useful if you’d like to store additional metadata. Can be edited and saved with a put command.
>>> obj = client.documents.get('83251-fbi-file-on-christopher-biggie-smalls-wallace')
>>> obj.data
{'category': 'hip-hop', 'byline': 'Ben Welsh', 'pub_date': datetime.date(2011, 3, 1)}
A summary of the document. Can be edited and saved with a put command.
A list of the entities extracted from the document by OpenCalais. The data are modeled by their own Python class, defined in the Entities section.
>>> obj = client.documents.get('83251-fbi-file-on-christopher-biggie-smalls-wallace')
>>> obj.entities
[<Entity: Angeles>, <Entity: FD>, <Entity: OO>, <Entity: Los Angeles>, ...
Returns the full text of the document, as extracted from the original PDF by DocumentCloud. Results may vary, but this will give you what they got. Currently, DocumentCloud only makes this available for public documents.
>>> obj = client.documents.get('71072-oir-final-report')
>>> obj.full_text
"Review of the Los Angeles County Sheriff's\nDepartment's Investigation into the\nHomicide of Ruben Salazar\nA Special Report by the\nLos Angeles County Office of Independent Review\n ...
Returns the URL that contains the full text of the document, as extracted from the original PDF by DocumentCloud.
The unique identifer of the document in DocumentCloud’s system. Typically this is a string that begins with a number, like 83251-fbi-file-on-christopher-biggie-s.malls-wallace
Returns the binary data for the “large” sized image of the document’s first page. If you would like the data for some other page, pass the page number into document_obj.get_large_image(page). Currently, DocumentCloud only makes this available for public documents.
Returns a URL containing the “large” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into document_obj.get_large_image_url(page).
Returns a list of URLs for the “large” sized image of every page in the document.
When the document has been retrieved via a search, this returns a list of places the search keywords appear in the text. The data are modeled by their own Python class, defined in the Mentions section.
>>> obj_list = client.documents.search('Christopher Wallace')
>>> obj = obj_list[0]
>>> obj.mentions
[<Mention: Page 2>, <Mention: Page 3> ....
Returns the binary data for the “normal” sized image of the document’s first page. If you would like the data for some other page, pass the page number into document_obj.get_normal_image(page). Currently, DocumentCloud only makes this available for public documents.
Returns a URL containing the “normal” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into document_obj.get_normal_image_url(page).
Returns a list of URLs for the “normal” sized image of every page in the document.
The number of pages in the document.
Returns the binary data for document’s original PDF file. Currently, DocumentCloud only makes this available for public documents.
Returns a URL containing the binary data for document’s original PDF file.
Returns an URL outside of documentcloud.org where this document has been published.
Returns an URL for a news story related to this document.
A list of the sections earmarked in the text by a user. The data are modeled by their own Python class, defined in the Sections section.
>>> obj = client.documents.get('74103-report-of-the-calpers-special-review')
>>> obj.sections
[<Section: Letter to Avraham Shemesh and Richard Resller of SIM Group>, <Section: Letter to Ralph Whitworth, founder of Relational Investors>, ...
Returns the binary data for the “small” sized image of the document’s first page. If you would like the data for some other page, pass the page number into document_obj.get_small_image(page). Currently, DocumentCloud only makes this available for public documents.
Returns a URL containing the “small” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into document_obj.get_small_image_url(page).
Returns a list of URLs for the “small” sized image of every page in the document.
The original source of the document. Can be edited and saved with a put command.
Returns the binary data for the “thumbnail” sized image of the document’s first page. If you would like the data for some other page, pass the page number into document_obj.get_thumbnail_image(page). Currently, DocumentCloud only makes this available for public documents.
Returns a URL containing the “thumbnail” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into document_obj.get_small_thumbnail_url(page).
Returns a list of URLs for the “small” sized image of every page in the document.
The name of the document. Can be edited and saved with a put command.
The date and time that the document was last updated, in Python’s datetime format.