Plaintext file import¶

In [21]:

file = open('the-zen-of-python.txt')
print(file.read(128))
print(file.closed)
print(file.close())
print(file.closed)

The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
False
None
True

In [14]:

with open('the-zen-of-python.txt') as file:
    print(file.readline())
    print(file.readline())

The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Flat file import¶

Flat file description:

containing records
table data, organized in rows and columns
can have a header row with names/descriptions for the columns

Numpy loadtxt¶

loadtxt will be used when the data is of one type: e.g. only numbers or strings.
As an example a row from the "MNIST Dataset of Handwitten Digits" will be loaded and displayed.

In [36]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

digits = np.loadtxt('mnist_test.csv', delimiter=',')
print(type(digits))
number = digits[0, :1] # take the first column of the first row of data
pixel_data = digits[0, 1:] # take all following columns of the row. in total: 784 columns
img_grid_data = np.reshape(pixel_data, (28, 28)) # 28 rows with each 28 columns
plt.imshow(img_grid_data)
plt.show()

<class 'numpy.ndarray'>

numpy recfromcsv¶

In [35]:

import numpy as np
data = np.recfromcsv('f1data.csv')
# Print out first two entries
print(data[:2])

[(360, 571, b'Scuderia Ferrari') (351, 655, b'Mercedes Grand Prix')]

Serialize python datatypes - with pickle¶

In [49]:

import pickle
filename = 'pickle-test-file.pkl'
my_dict = { 'A': 1, 'B': ('1','2'), 'C': ['a','b', 'c']}
# save
outfile = open(filename,'wb')
pickle.dump(my_dict,outfile)
outfile.close()
# load
with open(filename, 'rb') as file:
    data = pickle.load(file)

print(data)
print(type(data))

{'A': 1, 'B': ('1', '2'), 'C': ['a', 'b', 'c']}
<class 'dict'>

Database - SQL - import/export from/to DataFrame¶

In [1]:

import sqlite3
import pandas as pd

# prepare database
conn = sqlite3.connect('TestDB.db')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS posts')
conn.commit()

# create table and insert data by DataFrame
df1 = pd.DataFrame([['post A', 'john doe'], ['post B', 'jane doe']], columns=['title','author'])
df1.to_sql('posts', conn) 

# load DataFrame from db
df2 = pd.read_sql_query('SELECT * FROM posts', conn)
print(df2.head())

   index   title    author
0      0  post A  john doe
1      1  post B  jane doe

Import JSON from online source¶

In [79]:

import requests
url = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&titles=Stack%20Overflow'
json = requests.get(url).json()
print(type(json))
article = json['query']['pages']['21721040']['extract']
print(article[0:1000])

<class 'dict'>
Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Reddit. Users of Stack Overflow can earn reputation points and "badges"; for example, a person is awarded 10 reputation points for receiving an "up" vote on an answer given to a question and 5 points for the "up" vote of a question, and

Import csv from online source to DataFrame¶

In [75]:

import pandas as pd
url = 'https://data.lacity.org/api/views/g3qu-7q2u/rows.csv?accessType=DOWNLOAD'
df = pd.read_csv(url, ';')
print(df.head())
print(type(df))

  DataExtractDate,ReportPeriod,Terminal,Arrival_Departure,Domestic_International,Passenger_Count
0  05/01/2014 12:00:00 AM,01/01/2006 12:00:00 AM,...                                            
1  05/01/2014 12:00:00 AM,01/01/2006 12:00:00 AM,...                                            
2  05/01/2014 12:00:00 AM,01/01/2006 12:00:00 AM,...                                            
3  05/01/2014 12:00:00 AM,01/01/2006 12:00:00 AM,...                                            
4  05/01/2014 12:00:00 AM,01/01/2006 12:00:00 AM,...                                            
<class 'pandas.core.frame.DataFrame'>

Extracting data from websites¶

In [2]:

import requests
from bs4 import BeautifulSoup

url = 'https://www.python.org/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

print('>>> webpage title')
print(soup.title)

print('>>> find menu/navi item labels')
menu_items = soup.find(id="top")
lis = soup.find(id="top").findAll('li')
for li in lis:
    print(li.a.get('title'))
    
print(">>> Find first 5 'a' tags")
a_tags = soup.find_all('a')
for link in range(5):
    print(a_tags[link].get('href'))

>>> webpage title
<title>Welcome to Python.org</title>
>>> find menu/navi item labels
The Python Programming Language
The Python Software Foundation
Python Documentation
Python Package Index
Python Job Board
Python Community
>>> Find first 5 'a' tags
#content
#python-network
/
/psf-landing/
https://docs.python.org

In [ ]: