Plaintext file import¶
In [21]:
file = open('the-zen-of-python.txt')
print(file.read(128))
print(file.closed)
print(file.close())
print(file.closed)
In [14]:
with open('the-zen-of-python.txt') as file:
print(file.readline())
print(file.readline())
Flat file import¶
Flat file description:
- containing records
- table data, organized in rows and columns
- can have a header row with names/descriptions for the columns
Numpy loadtxt¶
loadtxt will be used when the data is of one type: e.g. only numbers or strings.
As an example a row from the "MNIST Dataset of Handwitten Digits" will be loaded and displayed.
In [36]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
digits = np.loadtxt('mnist_test.csv', delimiter=',')
print(type(digits))
number = digits[0, :1] # take the first column of the first row of data
pixel_data = digits[0, 1:] # take all following columns of the row. in total: 784 columns
img_grid_data = np.reshape(pixel_data, (28, 28)) # 28 rows with each 28 columns
plt.imshow(img_grid_data)
plt.show()
numpy recfromcsv¶
In [35]:
import numpy as np
data = np.recfromcsv('f1data.csv')
# Print out first two entries
print(data[:2])
Serialize python datatypes - with pickle¶
In [49]:
import pickle
filename = 'pickle-test-file.pkl'
my_dict = { 'A': 1, 'B': ('1','2'), 'C': ['a','b', 'c']}
# save
outfile = open(filename,'wb')
pickle.dump(my_dict,outfile)
outfile.close()
# load
with open(filename, 'rb') as file:
data = pickle.load(file)
print(data)
print(type(data))
Database - SQL - import/export from/to DataFrame¶
In [1]:
import sqlite3
import pandas as pd
# prepare database
conn = sqlite3.connect('TestDB.db')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS posts')
conn.commit()
# create table and insert data by DataFrame
df1 = pd.DataFrame([['post A', 'john doe'], ['post B', 'jane doe']], columns=['title','author'])
df1.to_sql('posts', conn)
# load DataFrame from db
df2 = pd.read_sql_query('SELECT * FROM posts', conn)
print(df2.head())
Import JSON from online source¶
In [79]:
import requests
url = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&titles=Stack%20Overflow'
json = requests.get(url).json()
print(type(json))
article = json['query']['pages']['21721040']['extract']
print(article[0:1000])
Import csv from online source to DataFrame¶
In [75]:
import pandas as pd
url = 'https://data.lacity.org/api/views/g3qu-7q2u/rows.csv?accessType=DOWNLOAD'
df = pd.read_csv(url, ';')
print(df.head())
print(type(df))
Extracting data from websites¶
In [2]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.python.org/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
print('>>> webpage title')
print(soup.title)
print('>>> find menu/navi item labels')
menu_items = soup.find(id="top")
lis = soup.find(id="top").findAll('li')
for li in lis:
print(li.a.get('title'))
print(">>> Find first 5 'a' tags")
a_tags = soup.find_all('a')
for link in range(5):
print(a_tags[link].get('href'))
In [ ]: